Sharks in the Moat
Page 19
Create a base profile of the network, system or software that is accepted as ‘normal’. Any deviation from this norm can be identified as ‘abnormal’ and result in alerts. Otherwise, without a base profile how will you know what ‘abnormal’ looks like?
Maintain an updated diagnosis matrix to help incident handlers identify and categorize threats. When an emergency incident is encountered, people’s training and common sense often goes right out of the front door, and a prepared tool such as the matrix can often help them keep their cool in the heat of the moment.
Document and timestamp every action taken from the start of an incident to its final conclusion. This performs multiple functions:
It is useful for legal actions if we ever find ourselves in court.
It keeps the incident handlers honest, as they know their activities will be reviewed later.
It helps with the post-incident analysis.
Since this data may be used in a court of law, it is important to protect the activity log from alteration, destruction, or disclosure.
Be sure that some type of mechanism exists to prioritize incidents before they are handled. A first-come-first-serve basis is NOT the way to handle emergencies. This mechanism cuts both ways – not only should it define what is ‘important’ so the response team has the flexibility to delay some actions, it must also establish the minimum level of response times the team must meet for the highest-prioritized events.
Containment, Eradication and Recovery
The third step in the incident response process is to contain the blast zone from the incident to as small a radius as possible, eradicate the threat, and then recover back to normal operations. While these can be viewed as three separate steps, they are grouped as one because they often are carried out at the same time or as part of the same activity.
Containment is concerned with limiting any further damage or additional risks. Some examples of containment might include shutting a system down, disconnecting it from the network, turning off services, or taking an application offline. Keep in mind that some of these actions could contaminate evidence, such as shutting down a system and thereby losing all data in volatile memory.
The difficulty with immediate containment is that by ending an incident too soon, we might be hurting our ability to collect evidence and identify the intruder or root cause. On the other hand, allowing an incident to continue in the name of evidence collection can result in additional damage – this approach is called delayed containment. The chosen containment strategy must be examined uniquely for each incident and you may need to confer with legal counsel in cases of a malicious breach. Some useful criteria we can leverage when deciding on the correct containment strategy are the following:
The potential impact and theft of resources that may result with a delayed containment strategy.
The importance of preserving evidence.
The importance of the continued availability of a service.
The time and resources needed to execute a strategy.
How ‘complete’ a strategy is – in other words it if will result in a partial containment or a full containment.
The duration and criticality of the solution. For example, a containment strategy might have a permanent duration if we can find a work-around to reduce criticality.
The chances of the attack to cause additional damage after the primary attack is contained. As an example, disconnecting a system might cause a logic bomb to detonate.
When considering the need to preserve evidence, the processes used to collect and store the evidence must be well-documented and should be created based on recommendations from the internal legal team and external law enforcement authorities. As an example, any volatile data must be carefully collected for it to remain admissible as evidence in court. This includes a list of network connections, running processes, login sessions, open files and memory contents. Analyzing persistent storage has its own requirements as a complete bit-copy must be made and used for analysis, leaving the original in a pristine state.
Whereas containment stops the propagation of damage, the eradication step erases or rolls back any damage caused by the incident. Care must be taken that the appropriate authorization has been granted to prevent contamination of evidence and to ensure recovery efforts will succeed. As an example, if we decide that the user’s table has been compromised, erasing all contents in the table would be a very bad move if we do not have a copy that can be restored when carrying out the recovery step. Instead, we may need to clean the users table on a column basis. When modifying or reconfiguring third-party software to remove a vulnerability, we need to consult the licensing agreement to ensure we have the right to take such actions.
The recovery step aims to restore the network, system or software back into its normal working state. This might entail restoring backups, restoring compromised accounts, forcing a change of passwords, installing additional security controls, or even applying patches. This process might also entail rolling out enhanced logging capabilities to aid in detecting the same type of event in case it reoccurs.
Post-Incident Analysis
Once the containment, eradication and recovery steps have been completed, a post-incident analysis must be performed. How soon after recovery this needs to happen will depend on the severity. For low-impact events, a post-mortem can be conducted in batches at a later time. For higher-impact events, the post-mortem should be carried out immediately to reduce the risk of the same event happening again. Regardless, a post-incident analysis MUST happen for every incident, as it will deliver the following benefits:
It identifies the root cause.
It identifies security weaknesses in the network, system or software.
It identifies problems with policies and procedures.
It can be used as evidence later.
It can be referenced during future events to accelerate handling and findings.
It is useful for training and reference materials for less experienced IRT members.
Such a capability will require the use of some type of database, even if it is a wiki that is searchable. If an organization is required to communicate details of an incident to external media, vendors or law enforcement agencies, the post-incident analysis must be completed before that communication occurs. Often incorrect communication following an incident can cause more damage than the incident itself. Therefore, communication guidelines need to be established well before an incident is encountered, along with a list of approved contacts. In general, there should be no communication to external parties until the IRT has had a chance to talk with the various internal authorities. Only the authorized point of contact should communicate with external parties, and that conversation should only include the minimum required details. Over-communication has caused more than one company to regret their lack of policies.
How a post-incident analysis, or post-mortem, is carried out should reflect the need of each organization, but it should at a minimum include the five ‘Ws’:
What happened?
When did it happen?
Where did it happen?
Who was involved?
Why did it happen?
The ‘why’ leads us to the root cause and should never be left unanswered. In the software world, an unidentified root cause will always come back to haunt you. It is simply guaranteed. In fact, identifying root cause is so important that it leads us to a completely different area than incident management, called ‘problem management’.
Problem Management
The focus of incident management is to return an affected service to its normal state as quickly as possible. Contrast that to problem management which has the goal of reducing the number of incidents and/or the severity of incidents. Stated another way, if problem management is unable to prevent a severe incident, then incident management kicks in and handles it. If we find ourselves spending too much time handling incidents, then perhaps we need to spend some time on problem manage
ment to reduce the number of incidents. Figure 46 illustrates the relationships between Incident, Problem, Change and Release Management.
Figure 46: Relationships between Incident, Problem, Change and Release Management
Problem management looks at incidents that have already been handled in an attempt to find the root cause. Solve the root cause and hopefully we won’t have a repeat next week. Some standard approaches to problem management include using a fishbone/Ishikawa cause-and-effect diagram, brainstorming and the use of the 5 Whys. This last approach – the 5 Whys as shown in Figure 47 – is an iterative question-asking exercise in which we ask, ‘Why did it happen?’, followed by 4 more iterations in asking why THAT answer happened. Eventually we drill down to the actual root cause instead of focusing on symptoms only. Once the root cause has been identified, it is called a known error, and a work-around or solution can be developed to prevent it from happening again.
Figure 47: The 5 Whys
If problems are not resolved in a timely manner, there needs to be an escalation process that gets the attention of IS management. Unresolved problems tend to eventually result in an incident that disrupts business operations, and even worse can cause corruption of data over time. In fact, when escalating an issue, the documentation should reflect if the problem can wait until working hours.
If you look back at incident management, you will see that it pretty much wraps up with asking ‘Why did it happen?’, because ‘why’ is the last ‘W” in the five Ws’. Problem management starts with the exact same question – ‘why?’ and continues from there. In some ways the two overlap with the ‘why’ question, but problem management takes it to a whole new level by asking ‘why’ five times in succession with the 5 Whys process.
Figure 48: Problem Management Process Flow
A mature problem management process will follow a repeatable list of steps, such as those shown in Figure 48. It begins with an incident notification, after which we start digging into to find the root cause (the ‘why’). A good litmus test to see if we have arrived at the root ‘why’ is to imagine fixing the last ‘why’ and then determining if the problem would go completely away. If it will not, then we need to keep asking questions.
A fishbone diagram is also a good tool as we previously mentioned. With this approach, shown in Figure 49, we can visualize and organize the various possible causes of a problem, and by narrowing down the possibilities we can zero in on the real root cause.
Another tool is to use categories to identify root cause. By using pre-defined categories, such as people, process, technology, network, host, and software, the big brains thinking through the possibilities can quickly rule out certain areas without the fear that other areas will not be considered. Figure 49 is an example of using categories and a fishbone diagram at the same time.
Figure 49: Root Cause Analysis using a Fishbone Diagram
Or, we can choose to use the rapid problem resolution, or RPR, approach in which we examine possible causes using three steps – discovery,
Figure 50: Rapid Problem Resolution Steps
investigation and correction, as shown in Figure 50. RPR is fully aligned with ITIL so if an organization is already using that framework, the steps will be instantly familiar.
When carrying out a root cause analysis, it is important to separate the symptoms from the underlying cause. Incident management treats symptoms, whereas problem management drills down until the root cause is identified. That is why we use the five Ws with incident management, and the five Whys with problem management. As we mentioned, asking ‘Why?’ iteratively forces us to look past symptoms and find the real cause. Since problem management is responsible for fixing the root cause, vulnerability tracking is a natural part of this process. Once we determine root cause, we should then track the issue, ensure mitigations are applied properly, and then verify that the updates were successful.
It should be obvious but let’s say it out loud anyway – without proper logging in place, we will have very little hope of identifying root cause. Not only do logs help us to look beyond symptoms, they also are a tremendous help when it comes time to reproduce an issue. Unless a developer can reproduce a problem, there is virtually no chance that he or she will be able to solve it.
Change Management
Both incident management and problem management will almost always require some type of change to be made in the production environment. Any related updates must be handled very carefully for three reasons:
1) During emergencies people tend to become careless in their attempt to get a fix out quickly, and we need some type of gating process to force everyone to slow down and think things through before acting.
2) Changes made in an emergency are often not documented, and we are later left wondering “Who made what change, and why?” We need a process to force all changes to be documented before being rolled out.
3) Young Harry, the junior dev on the team, should not be in charge of making decisions simply because he was the only one who responded to the outage call. There must be a process in which all changes are properly approved before being applied to production.
Change management is the process we have been referring to that manages any update and is the primary goal of both configuration and vulnerability management. At no time should the need for a quick fix allow the change management process to be bypassed. I can personally guarantee that you will regret it at some point, usually within hours or days. Now, that does not mean that we should not allow an exception to the process when things get really hairy and customers are threatening to sue. If we need to roll out a fix with 0% testing because it is the least risky move, then that is exactly what should happen – AS LONG AS there is a process to manage change management exceptions. Of course, the funny thing here is that if we have a process to allow that exception, then guess what? It is not an exception to the change management process – it is PART of the change management process! The exception is to the normal flow of change management as long as we have a pre-defined process that allows it.
Let’s show that a little more clearly with real-world example. Not long ago, the change management process that my teams worked within required development testing, staging testing, UAT and then finally a deployment to production. Nothing was allowed to go into production unless it had successfully completed all three phases of testing. Our product was rock solid as an example. That is, until one morning following a deployment the night before, we discovered that our largest client who comprised 20% of our total revenue was unable to run reports because of a missing index in production – the database was bogging down and exhausting the CPU because it was performing table scans. It was one hour before office hours, and no one had yet arrived to perform the required rounds of testing. Not only that, but we could not afford to wait to roll out the fix even if we had the testing team present. So, did we go around the change management process in the interest of keeping the customer happy? NO – we didn’t have to. Because we already had a process that said if we have a high-severity bug that was causing a complete outage of services to more than 15% of traffic, a VP could override the process as an ‘exception’. The exception was clearly spelled out – the fix needs to be placed into staging with the development team performing all testing, and once they gave the green light, a production deployment was approved as long as all documentation for the fix and rollout was properly created. And that is how we saved 20% of our total revenue without violating our own processes. But it would not have happened had we not taken the time to think ahead and plan for just such a situation.
Now let’s discuss management of both patches and vulnerabilities, which are often the same thing – the lack of proper patching can create a vulnerability or cause a vulnerability to remain unresolved. When a vulnerability is discovered, we have three choices:
1) Do nothing as the risk falls below acceptable levels.
2) Plan to address the vulnerability in an upcoming release.
> 3) Immediately fix the vulnerability.
The last option – updating code specifically to fix the vulnerability and rolling it out to production – is referred to as ‘patching’ the software. Often a patch is a temporary fix until we can roll out a better solution in a scheduled release. Patching can be seen as a subset of hardening, as fixing a vulnerability is by definition ‘hardening’ a product.
There are three primary methods we can employ to fix a vulnerability – apply a code patch, adjust the software’s configuration, or remove affected software. Software threats will usually come in the form of malware such as Trojan horses, worms, viruses, rootkits or exploit scripts, but they can also be human. Unfortunately, there is no software patch for careless people, and the best tool we have with that threat is awareness training.
When a patch is provided by an external vendor, it can come in two different forms – as a hotfix, or as a service pack. A hotfix, sometimes called quick fix engineering, or QFE, includes no new functionality and makes no hardware or application changes. Instead, it is usually related to updating the operating system or perhaps an external component. The nice thing about a QFE is that they can be applied selectively and usually have no dependence on one another.
A service pack will usually contain a number of QFEs, and often provides increased functionality as well. When a service pack is applied, it will more than likely include any previous QFEs, and you can be reasonably assured that the target software is completely up-to-date if the newest service pack is applied. Given the choice between multiple QFEs and a single service pack, choosing the service pack is often the best choice as the supplier has committed to having no regression issues with a service pack.