Sharks in the Moat
Page 57
Let’s define a few basic terms that feed off of the calculations that we just went over.
Business continuity is a strategy that:
Allows us to prevent most disasters from happening to our business.
Tells us how to handle the disasters that slip through.
Enables us to recover once the disaster has ended.
In other words, business continuity is a strategy to prevent, recover and continue from disasters. But normally, most people think that business continuity is about prevention and how to keep functioning after a disaster has happened. That part in the middle – recovering from a disaster while it is still on-going – is so important that it gets its own name – disaster recovery. That is why disaster recovery, while discussed all by itself, is really a subset of business continuity.
We usually don’t call them business continuity and disaster recovery though. When discussing these matters, we will usually talk about the plan that addresses them. So, we have a business continuity plan, or BCP, and a disaster recovery plan, or DRP. DRP is contained within a BCP.
A disaster recovery plan, or DRP, documents how we will quickly restore data, applications and core services that run our business after a serious event happens. There will often be a disaster plan specific to IT as part of the larger DRP, so let’s take a second to see what that plan would look like. Some possible scenarios that would require the IT DRP to kick in are a loss of network connectivity, key systems, critical data or a service provider. A business continuity plan, or BCP, documents how an organization will prevent disruptions and continue operating at a strategical level with minimal or no downtime after a serious event happens.
In summary, a DRP is all about boots on the ground getting our systems back up at an operational level after some bad event has happened, while a BCP is all about how the organization will function before the event and after we have recovered.
Figure 151: The Relationship Between BIA, BCP and DRP
However, it turns out that before we can talk about either a BCP or DRP, we have to perform something called a business impact analysis, or BIA. The BIA helps us to understand what assets are important, and what their loss will mean to us. After all, if we don’t know which assets are important and why, how can we possibly know how to recover from their loss using a DRP or BCP? Figure 151 illustrates the relationships between the BCP, DRP and a BIA.
A BIA is undertaken so that we can easily see the impact to the organization of losing the availability of any given resource. One of the downsides of a BIA is that all assessments tend to be ‘worse-case’ and end up being inflated. This leads to management often discounting the estimates. An alternative is to look at a small subset of scenarios, and have key stakeholders analyze each and produce a range of outcomes. Based on these results, we then estimate a minimum, maximum and likely values along with a confidence level. We can then perform some quantitative magic to objectively come up with a final estimate that can be trusted.
Recall that the recovery time objective, or RTO, is how long it will take for us to get operational again. RTOs are defined when carrying out a business impact analysis as part of BCP development. Often, there can be two different perspectives on RTO, with each providing a different answer: the individuals who consume information, and senior management who have a broader view of the organization and must consider costs. For example, a lower-level supervisor may believe that specific information is critical to his job, but a vice president may disagree because she is looking at overall organizational risk, and that particular asset is actually much lower in the overall priority. However, the information security manager should take both views into account and try to achieve an RTO that services both views.
The business continuity plan will take RTOs and use them to arrive at a priority order in which assets are restored – those with the shortest RTO first, with assets having the longest RTO being restored last. Of course, it’s never that simple as some assets will have dependencies on other assets before they can be declared ‘recovered’. For example, a specific generator by itself may have an RTO of 2 weeks, but a system having an RTO of 2 days might depend on that generator being available. Therefore, the generator must jump in priority even though its personal RTO is quite long.
Costs must also be factored in when setting RTO and restoration priority. System owners will always lean toward shorter RTOs, but a shorter RTO usually comes at a cost. Near-instantaneous recovery is almost always technically possible, but not necessarily financially justifiable. For example, we can maintain a backup system that is an exact duplicate of the one we use in production, but if they both must have 20 servers, that can be costly. To justify such an expense, the system must generate significant revenue, or any downtime must be extremely impactful. In general, the longer the RTO, the less cost is involved. There is a break-even point in the time-period where the impact of the disruption begins to be greater than the cost of recovery, and we need to make sure that RTO never exceeds that value. In other words, RTO should be shorter than the point at which the impact loss exceeds the recovery cost.
Auditing
Logging is a serious concern and can only be implemented well if it is built into software from the beginning. I’ve already told my tale of struggling to figure out how an application was misbehaving without logging. Suffice it to say that we did not achieve any level of quality until we had quality logging in-place. Logging performs four major functions:
It allows us to figure out what went wrong and how after the fact by performing a post-mortem.
It allows us to proactively identify trends before things go wrong through the use of instrumentation.
It allows us to know who did what, when, where and how by capturing an audit trail.
It allows us to establish a baseline for system performance and uptime.
Regulations such as SOX, HIPAA and PCI DSS require companies to collect and analyze logs from a variety of sources. It is a huge boon to security and stability to be able to correlate logs from multiple systems into a single log, but in order to do this we need to make sure that all computer clocks are synchronized and accurate. Not only does this allow us to interweave log entries from different sources together in a single stream, but it also allows us to correlate log entries with real world events such as badge readers, security cameras and CCTV recordings.
Of course, we must be able to trust the logs we collect, and so integrity must be assured. Log files must be secured against unauthorized alteration and access must be closely controlled. In this section we will be covering technologies to help us achieve all of these goals, including application event logs, Syslog and digital rights management.
Application Event Logs
An application event log captures the various activities that go on when an application is running normally, as well as when the unexpected happens. Care must be taken not to approach logging with a haphazard approach where logging is put in only when a problem is detected. Events to be logged must be logically identified and consistently applied across the entire application. For example, to log data access requests, we can easily apply logging in a single place if a data access layer, or DAL, is properly architected. We can even put in code to reflect CRUD operations. But this will only work if we design the capability in when the DAL is designed. This is a great example where security can help us create better designs and increase quality.
Additionally, we need to make sure to put in mechanisms so that we can increase or decrease verbosity at run-time, but in a way that an attacker cannot access this functionality. Events should be categorized to allow for easy filtering – otherwise will be overwhelmed with so much data that the logs will be ignored.
The following events must be logged at a minimum:
Security breaches
Critical business processes
Performance metrics
Compliance-related events
Syslog
Syslog is a standard for event logging that sends da
ta across the network to a server where it can be collated with other log sources who submit their own data using syslog. Syslog is built on top of TCP/IP and can leverage either the UDP or TCP protocols. UDP is lighter than TCP but does not provide any mechanism for assuring delivery. Cache servers can be used to increase performance but use of such mechanisms will increase the attack surface.
Syslog has been the standard on Linux and Unix platform since the 1980’s and is quickly gaining acceptance on Windows platforms. NIST SP 800-92 describes how to use Syslog for auditing as well. One of the drawbacks of this standard is that it does not include any security capabilities, and therefore all traffic must be encrypted using TLS or SSH tunneling. Hashing should also be employed to assure that no one has tampered with the data while in-transit. Syslog-NG, which stands for New Generation, is an actual open source implementation of the Syslog standard.
Digital Rights Management (DRM)
Figure 152: Forward Locking
Think back to the days when we used to watch DVDs or BluRay movies. I personally have not watched one of those in years - ever since Netflix started streaming in HD. If you can recall, an FBI message would appear warning pirates to not copy the movie. It looked something like Figure 152. If you tried to fast-forward through it, you quickly discovered that this was not allowed. This feature is called forward locking, and is one example of something called Digital Rights Management, or DRM. DRM refers to multiple technologies and standards with the goal of protecting intellectual property and digital content using technological controls.
Copyright law can only act as a deterrent – it cannot actually force someone to obey the law. DRM helps to enforce the law by preventing someone from making a digital copy. Copyright law functions by permitting all that which is not forbidden. DRM, conversely, operates by forbidding all that which is not permitted. Copyright acts like a blacklist and allows anything BUT what is on the list, while DRM acts like a whitelist and only allows things on the list.
DRM provides copy protection out of the box but can also be configured to provide usage rights by only allowing certain parties to view the digital contents and can assure authenticity and integrity. DRM provides presentation layer security – remember that is found at OSI layer 6 – by stopping the copy, printing or sharing of a file, but still allowing it to be viewed. This is usually carried out by tying the file to some type of unique hardware identifier. Even though the file may be copied, it cannot be used on a different system because DRM allows access only on the system containing that unique hardware identifier. For example, if you purchase a movie from the iTunes store, it will work on your computer, but if you simply copy it to another computer it will not work unless you explicitly authorize that computer as well.
Figure 153: How DRM Works
The three core entities of DRM are the user, content and rights. The user entity is the party wishing to access the content and can be either a human or a computer. The content entity is whatever is being protected, such as a song, a movie, a photo, a PDF document – whatever you like as long as it is an electronic file. The rights entity represents what the user can do with the
content, and is expressed using a language called Rights Expression Language, or REL. Open Digital Rights Language, or ODRL, is an example of REL still under development which expresses rights using XML. eXtensible rights Markup Language, or XrML, is another example that is slightly more abstract than ODRL and uses a meta-language instead of XML. Specific to print media such as newspapers and magazines, Publishing Requirements for Industry Standard Metadata, or PRISM, is used between businesses and is mostly concerned with copyright protection. Look at Figure 153 for an overview of the DRM process.
Keep in mind that while REL expresses rights, it really does not have the ability to enforce those rights. When implementing REL, you will have to code the enforcement mechanism yourself and ensure it has been thoroughly tested.
Naturally, there are a few challenges with DRM. First, tying usage rights to a piece of hardware is risky as hardware wears out and is replaced, resulting in the purchaser of the rights no longer being able to access the content. However, tying access rights to a person requires them to authenticate, which presents its own problems. For example, using personal data to uniquely identify a person can also lead to privacy concerns. In some cases, DRM forbids even legal copying of content, which conflicts with the Fair Use law.
Section 4: Secure Supply Chain Management
Back in the 1970s, just about every company wrote and maintained their own software. In the 1980s, companies began to purchase commercial off-the-shelf, or COTS, software. In the late 1990s, the same companies began to subscribe to web-based applications hosted in someone else’s data centers. In the 2000s, core computing capabilities started moving to the cloud. Today it is not uncommon to find that virtually all mission-critical applications for a company are hosted in a third-party’s cloud such as Amazon’s AWS or Microsoft’s Azure. Even more common is to leverage a Software as a Service, or SaaS, offering in which the complete application is both hosted and owned by a third party. Office applications such as Microsoft 365 and Google Documents are prime examples. SaaS applications service multiple entities, and the term tenant is used to represent individual customers all using the same application. That is why SaaS applications are often called multi-tenant.
The software supply chain is comprised of all entities and products involved in delivering some type of software capability to a company and managing that supply chain is called supply chain management, or SCM. Figure 154 illustrates the process.
SCM consists primarily of an acquirer, vendor and supplier. Looking at it from viewpoint of the company purchasing software or services, a vendor is a specific external entity within that chain that resells software, while a supplier produces software for the vendor to resell to the company. In this case, we can refer to the company as the acquirer, as the company acquires and uses the software from the vendor.
Figure 154: A Supply Chain
In cases of SaaS, the same company is usually both the supplier and vendor. For example, when paying a monthly fee to use H&R Block’s tax software online, H&R Block is the supplier who created the software that it then ‘leases’
to companies, thereby acting as the vendor as well. To avoid confusion, for the remainder of this topic we will use the term ‘supplier’ to refer to both a supplier and vendor.
Chapter 46: Acquisition Models
There are several models that a company can use to acquire software or services, including a direct purchase, licensing from the supplier, partnering with a supplier, outsourcing and the use of managed services. By far the most common models are the last two – outsourcing and managed services. Figure 155 lists the various models.
Figure 155: Acquisition Models
Outsourcing refers to a company subcontracting development work to an external third-party. This normally involves splitting the work up into sub-components which are then handed over to a third-party to implement. When the external partner is based in a foreign country, this activity is referred to as off-shoring. Near-shoring is a subset of off-shore activity but takes place in a foreign country that is typically geographically close to the company’s native country and usually has a time zone very close to the company’s own. For example, a U.S company doing business with an India-based facility is called off-shoring, while the same company doing business with a Mexico-based facility is termed near-shoring. Such an approach usually provides lower hourly costs, and access to ready-to-go skilled development resources and intellectual capital.
An important concept to understand when dealing with outsourcing is software provenance. When efforts are outsourced, a number of external entities are usually involved, not just one. For example, when off-shoring work, we might work with a local external vendor, who sub-contracts work to another entity, that then hands work off to a foreign office location, where individuals are managed by yet another corporate entity. These rel
ationships can be seen as a series of steps as shown in Figure 156.
At each hand-off between entities, a software provenance point occurs where responsibilities are shifted from the previous entity to the next, with the danger of losing control being greatest at each provenance point. It is crucial that the supply chain be protected at each provenance point in three ways – conformance, trustworthiness and authenticity.
Conformance ensures that software follows a documented plan and meets the stated requirements. Additionally, any applicable standards and best practices must be followed.
Trustworthiness means that software does not possess any vulnerabilities that have been purposefully or accidentally injected into the source code. Open-source libraries and even purchased third-party libraries can have malicious embedded code, and any software that leverages such libraries are themselves infected.
Authenticity ensures that all materials used in the production of software – such as external libraries, algorithms and entire software packages – are not counterfeited or pirated. In other words, the software does not violate any intellectual property rights.
Figure 156: Software Supply Chain Staircase
When all three attributes have been met – conformance, trustworthiness and authenticity – software will also have integrity, meaning that its execution is predictable. Figure 157 illustrates the relationships between each attribute.
Aside from outsourcing, managed services is the other common acquisition model that companies use. Instead of the company divvying up work to external parties, entire operations are delegated to a third-party that specializes in that area. This allows the company to focus on their core expertise and leave fringe, yet crucial, operations to the experts. As an example, development teams can subscribe to source control and defect tracking capabilities offered by SaaS providers such as Atlassian. While these activities are crucial for a software company delivering GPS tracking capabilities to the trucking industry, creating source control products is not considered to be the company’s core expertise – creating GPS tracking software is. Entire security operations can be covered as a managed subscription service such as security risk management, penetration testing, and forensics, among many others. Beyond SaaS, we also have available on a subscription basis Platform-as-a-Service, or PaaS and Infrastructure-as-a-Service, or IaaS. When leveraging such capabilities, the service level agreement, or SLA, is the primary tool to manage such services.