Sharks in the Moat

Page 56

by Phil Martin

In summary, as part of software acceptance, the following three actions must be carried out specific to the change management process:

Change requests are evaluated for impact on the overall security of the software.

The request is formally submitted, evaluated and approved only by those with the proper authority.

The asset management database is updated with the new information.

Figure 149: Risk Acceptance Template Elements

Risk Acceptance Policy and Exception Policy

When accepting risk, we should use the same template each time, and it should include four elements as shown in Figure 149 – risk, actions, issues and decisions, which makes a nice little acronym of RAID.

The risk section informs the business on the issues that can result from a change. From a security perspective, the issues will focus on risks that result in information disclosure, alteration or destruction – essentially, CIA. As the business owner will be the one evaluating risks, the verbiage must be kept free from technical mumbo-jumbo, and explain the risk in business terms as much as possible.

The actions section focuses on activities that the IT and development teams will need to take. This includes specific steps that have already been executed and those that will need to be executed. This can be as technical as necessary.

The issues section helps the development team understand at a technical level how threats can be realized with the software.

The decisions section lists the various options available to the decision makers when accepting the risk. We’re back to using non-technical words here.

Now let’s discuss the various options we have on how to handle risk that exceeds our acceptance and tolerance levels. At times we will not have the time or resources to properly handle unacceptable risk, and our only options are to either transfer or avoid that risk. We can transfer risk to an outside company by purchasing an insurance policy. We can avoid risk by discontinuing the use of the software that is causing the risk.

Unfortunately, sometimes neither one of those options is acceptable. For example, a new security policy or compliance requirement may popup that causes a legacy application to fall out of compliance. Because the legacy app is mission-critical we can’t simply discontinue its use, and we cannot simply transfer the risk as insurance may not be available for such a situation. Let’s say that the legacy app does not encrypt certain types of data that are now required by legislation to be encrypted. It appears that we are out of luck with options – we cannot terminate the use of the software, we cannot update it, but we will clearly be out of compliance if we do nothing. In short, we cannot mitigate, transfer or avoid the risk. In such a situation, our only recourse is to make an exception to a policy and accept the risk.

To make such an exception to an existing policy, the organization must have a process to allow policy exceptions – otherwise we will be guilty of ignoring a risk instead of accepting it. Ignoring a risk is failure to carry out due diligence and can result in some pretty serious legal actions down the road. But, if we have a recognized process to provide exceptions to a policy, AND we heavily document this activity, then when the time comes that an external party tells us we are out of compliance, we will be able to immediately show why such a situation is required. That doesn’t necessarily mean the external party will be OK with our decision, but it will provide some much-needed legal cover.

As a summary to this section, the software acceptance process must ensure that risk is within an acceptable range, and if not, then a policy exception process is used. Only when these two conditions are met can a release be approved for deployment.

Release Management

Release management is a process that ensures changes are planned, documented, tested and deployed with least privilege without negatively impacting the end-users. A common complaint is the reappearance of bugs that were previously fixed, which is a direct result of either improper version management or improper configuration management. For example, let’s say that a bug in the production environment resulted from a missing key in a configuration file. The fix is applied directly to production but is never reflected in the files used to deploy to that environment. Guess what happens on the next release? The key ‘mysteriously’ disappears again because proper configuration and version management procedures were not followed. This is why the release management process includes oversight of versioning, backups, and check-out and check-in procedures. Collectively, all of these procedures and processes make up something called the software configuration management plan, or SCMP.

An interesting side note is that sometimes debugging capabilities are accidentally, or even worse purposefully, deployed into a production environment. This usually deploys an additional program database file with a .pdb extension. While this file is not used until runtime when it is

dynamically linked to the executable, it contains all sorts of juicy information that an attacker would love to get his hands on. This is often a compile-time switch and can slow performance as well. Ensuring this never happens is well within the SCMP wheel house.

For proper configuration management to happen, configuration information must be well-documented and maintained in a versioned repository – this type of capability is often referred to as a configuration management database, or CMDB. In fact, ISO/IEC 15408, which contains the Common Criteria, absolutely requires that such a thing exists in the form of a configuration management system, or CMS, and is used to maintain documentation and any build tools. CMDB is simply the database for a CMS. Any security level changes should be reflected in the CMS as well.

Chapter 45: The Auditor Role

Business Continuity

Back in the core concepts section, we very briefly went over how to calculate things such as ALE, RTO and RPO. We also discussed business continuity. It is the auditor’s role to ensure that such things are executed properly and in-place before a disaster strikes. Here, we’re going to dive into the same topic but with a much greater level of detail.

Annual Loss Expectancy, or ALE

Part of calculating risk to a valuable asset is to express that risk in terms of money – there is nothing like dollar signs to get the attention of executives! The most common means to do so is something called the annual loss expectancy, or ALE. But, there are several other terms we must understand before we can start generating an ALE.

First, an asset must be assigned a monetary value, called the asset value, or AV. If a building cost $400,000 to replace, then AV = $400,000.

The exposure factor, or EF, is the percentage of an asset’s value that is likely to be destroyed by a particular risk and is expressed as a percentage. For example, if we are assessing the risk of a building catching fire, and we estimate that one-third of the building will be destroyed in a fire, the EF = 33%.

The single loss expectancy, or SLE, is the loss we will encounter if we experienced a single instance of a specific risk. In other words, for our building just mentioned, SLE would be the asset value for the building ($400,000) multiplied by how much of the building would be destroyed (33%). So:

SLE = AV x EF

=$400,000 x 33%

=$132,000

In simple terms, we expect that if the building catches fire, it will cost us $132,000 to repair it.

We have one more term to cover before we can calculate ALE.

The annualized rate of occurrence, or ARO, is the number of times a threat on a single asset is expected to happen in a single year. This number can be less than 1, which means we expect it to happen every few years instead of multiple times per year. If we expect our building to catch fire once every 10 years, then ARO = 1/10, or .1. If we, for some bizarre reason, expect our building to catch fire 3 times each year, then ARO will be 3. Let’s go with the once per 10 years in our example, since that seems to be a bit more reasonable.

So now we finally get down to calculating ALE, which is simply how much money we will lose for each instance of th
e threat multiplied by how often it will happen each year. This will give us how much money we can expect to lose each year – that is why we call it the annualized loss expectancy. The formula is:

ALE=SLE x ARO

And calculating it for our example, we would use:

ALE =$132,000 x .1

=$13,200

The bottom line is that we can expect to lose $13,200 every year that we own our building due to the risk of it catching fire. This makes it a lot easier to factor that risk into our annual budget so that when the building does catch on fire, we will probably have the money already set aside.

Remember, this is all in relation to performing a quantitative analysis, where the result for each risk will be expressed as either SLE or ALE, most commonly ALE.

Recovery Time Objective, or RTO

A good chunk of security management is focused on preventing bad things from happening. However, there is no way to completely prevent an incident from occurring, and in those cases, we must shift our attention to getting compromised facilities and systems back to an acceptable level of operation. The recovery time objective, or RTO, is the amount of time required to do this. The acceptable level is defined by the service delivery objective, or SDO (more on that in just a bit).

The acceptability of some risks can be quantified by using the approach of RTO, which tells us how much downtime we can absorb without serious consequences. RTO can then be used to quantify the cost of getting back to full recovery. For example, if we decide that our business can survive for only 3 days if our order taking system were to go down, then RTO must be no greater than 3 days. We can then estimate how much it would cost us to always be in a position where we could bring all order systems back to full operation within 3 days.

Recovery Point Objective, or RPO

The recovery point objective, or RPO, focuses on data backup and restoration. RPO will tell us how much data we can stand to permanently lose in case of interruption in terms of time, usually hours or days. Backup schemes normally will perform full or partial backups of data systems automatically on a periodic basis. RPO tells us the maximum amount of time we should ever go without performing some type of backup. Now, there is a scenario in which the time to restore exceeds the RPO or RTO. For example, the RPO dictates we can lose only 6 hours of data, but if an interruption occurs, it will take 8 hours to restore that 6 hours’ worth of data, in which case we will have exceeded the RPO by 2 hours. Or, perhaps the RPO is 2 days, but RTO may be set at 6 hours, in which case the RTO will be exceeded due to a slow restore. In either case, we are simply unable to meet the RPO or RTO, and if we cannot make them align by using different technologies, we just have to accept the risk.

Service Delivery Objective, or SDO

The service delivery objective, or SDO, defines the minimum level of service that must be restored after an event until normal operations can be resumed. Both RTO and RPO affect the value of the SDO. The units of SDO are specific to the system, but some possibilities might be transactions per second (TPS) or the number of concurrent users.

Maximum Tolerable Outage, or MTO, or MTD

The maximum tolerable outage, or MTO, is the maximum time that an organization can operate in an alternate or recovery mode until normal operations are resumed. Many factors can limit MTO, such as the availability of fuel to operate emergency generators, or the accessibility of a remote backup site. MTO will have a direct impact on the RTO, which in turn impacts the RPO.

Maximum tolerable downtime, or MTD, is just another name for MTO.

Allowable Interruption Window, or AIW

The allowable interruption window, or AIW, reflects the amount of time normal operations are down before the organization faces major financial problems that threaten its existence. MTO should never be greater than AIW but can be much shorter. Increasing the gap between MTO and AIW will lessen the impact the organization feels from a given outage.

Bringing It All Together

Let’s assume that we work for a company manufacturing rocket engines for gigantic spaceships. We have committed to delivering 40 engines each week, and you have been tasked with figuring out how to keep the system up that runs our assembly lines in case of a disaster. The primary assembly line runs at 75% capacity, meaning if we need to, we can kick up the speed temporarily to 100% to churn out engines more quickly. The CEO tells you that the company cannot survive if it is down for more than 7 days, so we set AIW (allowable interruption window) to 7 days. AIW represents the downtime before the company will be forced to cease operations.

Now, if the main assembly line goes down, our plan is to shift to a backup facility until the primary facility can be repaired. But, the backup facility can only operate at 50% of our normal capacity. So, we can run on the backup facility for only a few days. Without going into the details, we calculate that to be 3 days to get back up to speed before we hit the AIW. Therefore, RTO (recovery time objective) would be set to 3 days – that is the maximum amount of time we have until the assembly lines must be back up and running.

But, since our backup facility only operates at 50% relative to our primary facility (which normally runs at 75% of its capability), once we have resumed normal operations the primary assembly line will need to run at 100% for a few days to catch back up. So, we define SDO (service delivery objective) to be 100%, and then once we have caught up we can return to 75%. This means in the event of an outage; the primary facility must be ready to run at full speed for a few days.

But we discover that MTO (maximum tolerable outage) for the backup facility is only 2 days because it cannot store enough fuel to operate for longer. Since MTO is less than RTO, we have a problem. We solve it by installing an additional fuel tank for the backup facility, bringing MTO to 4 days. MTO >= RTO, so we’re good.

Once we solve MTO, we discover that we only backup the system running the assembly line once per week, forcing RPO (recovery point objective) to 7 days. Since the entire assembly process depends on tracking progress by the second, an outage would set us back by a week, which exceeds RTO. Obviously, this is unacceptable. So, we decide to start backing the system up once per day, meaning that our RPO is now 1 day, which is just enough to squeeze by. But, there is another problem – restoring the backup is a lengthy process and will take 2 days. That means we cannot bring the backup facility online for 2 days after the outage starts. Not a good thing, since RTO is 3 days. Therefore, we must invest some major money into purchasing a system that will allow us to restore data to the backup facility in only a few hours.

So, how does ALE (annual loss expectancy) factor into this? Well, we now have a plan that kicks into place if an outage occurs, but we would rather not incur that cost if we can avoid it. We can calculate an ALE for a specific threat to help us understand how much money we should spend to avoid an outage. Let’s assume that the most likely reason our primary assembly facility would go down is due to an alien attack trying to take out our Roadster spaceship fleet. In our example, the following is calculated:

AV (asset value) of the primary facility is $100 million

EF (exposure factor) is the percentage of the facility destroyed in an attack, which we estimate to be 40%

The loss from a single attack would be SLE = AV x EF, or ($100 million) * (40%) = $40 million

We expect an attack every 4 years, so ARO (annualized rate of occurrence), would be .25

Finally, ALE = SLE x ARO = ($40 million) *.25 = $10 million

If ALE = $10 million, that means we can justify spending up to $10 million per year to prevent an alien attack. Obviously, this means that we should spend that $10 million each year on laser satellites to protect Planet Earth.

Let’s briefly summarize what each term means:

AV is the asset value, or how much it would cost to replace a given asset

EF is the exposure factor, or how much of an asset would be lost in a single disaster

SLE is the single loss expectancy, or how much value an asset woul
d lose in a single disaster

SLE = AV x EF

ARO is the annualize rate of occurrence, or how often we can expect the disaster to happen in a single year

ALE is the annualized loss expectancy, or how much a given risk will cost us each year

ALE = SLE x ARO

RTO is the recovery time objective, or the amount of time we can absorb without serious consequences

RPO is the recovery point objective, or the amount of data we can lose without serious consequences

SDO is the service delivery objective, or the minimum level of service that must be restored before normal operations can resume

MTO is the maximum tolerable outage, or the maximum time we can spend in an alternate mode before normal operations must be resumed

MTD is the maximum tolerable downtime, and is another name for MTO

AIW is the allowable interruption window, or the maximum time we can spend in an alternate mode before the organization’s existence is threatened

Figure 150: Business Continuity Concept Summary

It is a bit complicated, but I hope you can see the value of tracking all the different variables and how the relationships between each provides guidance on where we need to reduce weaknesses. By defining AIW, RTO, MTO and RPO, we can contrast that to ALE to best decide where to spend our limited resources. Figure 150 provides a quick summary.

BCP, DRP and BIA

‹ Prev Next ›