Sharks in the Moat

Page 10

by Phil Martin

Asymmetric Weaknesses

Now there are a couple of weaknesses in the asymmetric scheme – an attacker could conceivably intercept the entire message and simply replay it later with no modifications, and the recipient wouldn’t be any the wiser. To prevent this, the date and time at which the message was sent is usually embedded inside of the content, so the recipient can hold it suspect if it is more than a few seconds old. A few seconds on the Internet is an eternity.

Another possible weakness is a man-in-the-middle attack. For example, Matt, who is a no-good malicious hacker, could have somehow inserted himself between Sam and the Internet and substituted his own public key for Bobby’s. Then, Bobby would be encrypting his information with evil Matt’s public key. All Matt has to do is to

decrypt the message with his own private key, read the message, re-encrypt it with Sam’s public key and forward it on. Neither Bobby or Sam would ever know that Matt was listening to the entire conversation.

However, the CA has solved this as well. Matt would have to get in the middle of all traffic for both Bobby and Sam in order to replace the CA, which is just not going to happen. Because both parties will get the public key and certificates from the CA, there really is no way to get in the middle.

Chapter 12: Integrity

Now that we have covered encryption at-length, let’s get back to our core security concepts. We previously discussed how to prevent disclosure of information through confidentiality, but we also need to ensure information is not altered or lost, at least until we want that to happen. Software that can do this is known as resilient software, and the security attribute that is implemented is called integrity.

As an example of this capability, consider your expectations when using bill pay to send an electronic check to make a payment on your mortgage, which might take 4 days to complete. If the dollar amount is $2,092.78/month, then you expect exactly $2, 092.78 to be deducted from your checking account and for $2, 092.78 to be credited to your mortgage holder’s bank account. You certainly do not expect someone to log in and change that dollar amount in the middle of the transaction, and neither would you expect for the dollar amount to be accidentally rounded up to $2,100.00!

From this example, we can deduce two things about software integrity. First, it should ensure that data is transmitted, processed and stored as accurate as the originator intended, and secondly the software should perform as reliably as it was intended to according to predefined rules. A specialized case of integrity within a database is called referential integrity, which we will discuss later.

Integrity should address two core security requirements – reliability assurance and prevention of unauthorized modifications. When integrity assures reliability, we really mean that the software is functioning as designed. Activities behind this capability include proper design, implementation and testing. When integrity prevents unauthorized modifications, our software is assuring that both the system and data remains accurate. More to the point, this activity ensures that only authorized individuals are allowed to change information and programs, and that they do so in an authorized manner. Not only do we need to make sure that Joe Bob in Accounting is not able to start modifying the database directly, but we also need to make sure that Veronica the DBA has read-only access to production data, and any modifications to production data are approved in-advance. When done properly, integrity assurance promises that a system and its data will remain complete and consistent.

Note that we keep referring to both the system and data. Together, the two make up a solution, but are so different that each needs to be addressed separately. For example, SQL injection attacks require certain actions in software such as input validation but require different actions when communicating with the database such as not using string concatenation to create SQL statements. And then we have data that might be between the system and the database in-transit that still needs to be protected. The entire solution must be examined end-to-end to ensure gaps don’t appear. Let’s talk about some various ways to ensure integrity.

Input validation is the act of never trusting external data coming into a system by examining it and correcting known vulnerabilities. This could take place in a browser, a Windows program or in a mid-tier web service.

Parity bit checking can be used to detect errors or changes made to data as it is transmitted. A parity bit is appended to a group of data bits and lets us know if the number of 1’s in the data bit is an odd or even number. If data was purposefully or accidentally modified, we have a good chance of detecting the problem. One common use for parity bit checking is when using a cyclic redundancy check, or CRC, for messages that are longer than one byte (eight bits). Right before transmission, each block of data is run through an algorithm that computes a CRC value, called a checksum. Both parity bit-checking and CRCs are not foolproof, as different original values can mathematically result in the same parity value or CRC. A much better solution is called hashing, which we have already covered.

An aspect of integrity that is often overlooked is that of resource contention. This problem surfaces when two different processes attempt to use the same common resource at the same time. The resource could be a table record, a file, a memory location, or really anything that is shared among multiple consumers. Software must be purposefully written to handle situations in which integrity could be lost if two or more processes were to modify or delete a common resource. The pattern to avoid such an occurrence is to implement resource locking, which allows only one process access to the common resource at a time. For example, most relational databases have this capability already built in, and allow only one process to alter a

record while forcing all others to queue up and wait their turn. Resource locking must be carefully implemented, or a deadlock condition can result in which each process is waiting for the other to release the resource. In this scenario both processes now are effectively frozen, unresponsive and will never recover.

Let’s review some examples of good integrity requirements.

“All input forms and querystring values must be validated against a known set of acceptable values before the software accepts it for processing.”

“All outgoing data feeds must implement a computed checksum or hash function so that the recipient can validate its accuracy and completeness.”

“All non-human actors such as system and batch processes must be identified, monitored and prevented from altering data unless explicitly authorized to.”

Chapter 13: Business Continuity

So far, we have covered confidentiality and integrity, and the last member of the classic CIA triad is availability. Before launching into that discussion though, we will need to cover a number of topics related to keeping a business up and running by calculating risk to our most important assets. While we will leave most of the details here to the Auditor role, we do need to define a few terms used when calculating risk.

The annual loss expectancy, or ALE, is the amount of money we expect to lose in a single year due to a specific risk. ALE is calculated using several other variables.

Each asset must be assigned a monetary value, called the asset value, or AV.

The exposure factor, or EF, is the percentage of an asset’s value that is likely to be destroyed by a particular risk and is expressed as a percentage.

The single loss expectancy, or SLE, is the loss we will encounter if we experienced a single instance of a specific risk. SLE is calculated using the following formula:

SLE=AV x EF

The annualized rate of occurrence, or ARO, is the number of times a threat on a single asset is expected to happen in a single year.

To calculate ALE, we use the following formula:

ALE=SLE x ARO

The main thing to remember for now is that ALE is how much a single threat will probably cost us in a single year.

The recovery time objective, or RTO, is the amount of time required to get a compromised facility or system back
to an acceptable level of operation.

The recovery point objective, or RPO, tells us how much data we can stand to permanently lose in case of interruption in terms of time, usually hours or days.

The service delivery objective, or SDO, defines the minimum level of service that must be restored after an event until normal operations can be resumed.

The maximum tolerable outage, or MTO, is the maximum time that an organization can operate in an alternate or recovery mode until normal operations are resumed. Many factors can limit MTO, such as the availability of fuel to operate emergency generators, or the accessibility of a remote backup site. MTO will have a direct impact on the RTO, which in turn impacts the RPO.

Maximum tolerable downtime, or MTD, is just another name for MTO.

The allowable interruption window, or AIW, reflects the amount of time normal operations are down before the organization faces major financial problems that threaten its existence.

Business continuity is a strategy to prevent, recover and continue from disasters. If we focus only on the ‘recover from disasters’ part, then we are thinking about disaster recovery.

A disaster recovery plan, or DRP, documents how we will quickly restore data, applications and core services that run our business after a serious event happens. A business continuity plan, or BCP, documents how an organization will prevent disruptions and continue operating at a strategical level with minimal or no downtime after a serious event happens.

Before we can talk about either a BCP or DRP, we have to perform something called a business impact analysis, or BIA. The BIA helps us to understand what assets are important, and what

their loss will mean to us. The BIA helps us to calculate ALE, RTO, RPO and MTO, which in turns

helps to define the BCP and DRP. That enough acronyms for you?

Chapter 14: Service Level Agreements

A service level agreement, or SLA, stipulates and commits a provider to a required level of service or support, for both hardware and software. The power of an SLA kicks in when a provider fails to meet the minimum stipulations, and penalty provisions and enforcement options specified in the SLA take effect. For example, if an SLA requires a 99.999% uptime for a given system, but the provider has only delivered 99.900% uptime in the last month, the SLA might require the provider to pay a $25K penalty per week until service levels are brought up to the minimum required. That would be the stick in the ‘carrot or stick’ analogy. The carrot could be represented in the SLA by a $100K bonus if the provider delivers 1 month or more before the agreed upon deadline. In the case where the provider delivers on-time but not before, the provider has not violated the SLA but doesn’t get the reward for

early delivery either.

We can also express SLAs in terms of service improvements such as:

Reductions in the number of help desk calls

Reductions in the number of system errors

Improvements to system availability

We will make frequent reference to SLAs as we go through the book, so just keep in mind that it is a tool between us and a supplier, vendor or provider that controls how we both behave. Never enter into a business relationship without a contract, and the contract will normally include an SLA at a minimum.

Let’s continue now with core security concepts and talk about availability – which just happens to reference all of three-letter terrors that we have just discussed in the last couple of chapters.

Chapter 15: Availability

So far, we have covered confidentiality and integrity, and the last member of the classic CIA triad is availability.

When talking about business continuity, we look at the availability of crucial systems and data, meaning that those capabilities are up and functioning at a minimal level. But when we move into the realm of software security, the term ‘availability’ takes on some additional nuances. Available software means two things – that systems and data are accessible only to those who are authorized to access them, and that the systems and data are accessible only when required. While business continuity is really only concerned with making sure a capability is accessible, availability in the software security world operates in the opposite direction as well – not only must it make the capability accessible, but it must make it inaccessible when appropriate as well. Data should not be available to the wrong people or at the wrong time.

If availability requirements are not explicitly stated, the most common result is a system that is unreliable and unstable, often stated as a denial of service, or DoS, to the intended audience. Put simply, availability requirements must protect the system against the destruction of the software and data. When determining these requirements, MTD, RTO and RPO are often used to both derive and express the need in a written fashion. We also need to note the crucial role that a service level agreement, or SLA, plays when defining good requirements. The SLA is one of the best ways to explicitly state and monitor compliance with requirements for both business partners and clients. MTD and RTO should be included in the SLA, along with the accepted availability as measured in ‘up time’. An industry-accepted way of measuring availability is to use the concept of ‘nines’. In short, the more nines you can place behind the decimal point in 99%, the less downtime a system should experience in a given time period. For example, if an SLA calls for three nines, or a 99.9% uptime, then we expect there to be less than 9 hours of downtime within a calendar year. If we were to increase that to four nines, or 99.99%, a given system should be down for less than one hour in a given year. The maximum reasonable uptime is expressed as six nines, or 99.9999%, which means that a system will not be down for more than 31.5 seconds in any given year. Six nines represent an extremely reliable and available system and is seldom achievable without a great deal of redundancy built-in from the beginning stages of requirements definition, including load balancing and the elimination of single point of failures.

Insecure development that lead to security breaches must be addressed during the requirements stage by addressing code-level implementation details such as dangling pointers, improper memory de-allocation and infinite loop constructs. We’ll discuss what those look like later.

Following are some good examples of availability requirements:

“The software must provide a high availability of five nines (99.999%) as defined in the SLA.”

“The number of supported concurrent users should be 300.”

“All software and data should be replicated across physical data centers to provide load balancing and redundancy.”

“Mission critical functionality should be restorable to normal operations within 1 hour of disruption; mission essential functionality should be restorable to normal operations within 4 hours of disruption; and mission support functionality should be restorable to normal operations within 24 hours of disruption.”

Although no code is actually written during the design phase, coding constructs such as incorrect database cursor usage and tight loops that lead to deadlocks can be examined. Replication, failover and scalability should be designed at this stage.

Both MTD and RTO must be considered during the design phase and should have already been used to explicitly state requirements. A single point of failure, or SPoF, can be best described as a system component that has no redundant capabilities. This is addressed by replicating data, database and software across multiple systems, resulting in redundancy. In addition to eliminating SPoFs, redundancy also helps us to achieve scalability as workload is spread across more systems at the same time.

Replication is usually implemented as a master-slave backup scheme, sometimes called a primary-secondary scheme. In this layout, one system acts as the primary node and updates the secondary node in either an active or passive fashion. An active-active replication scheme applies any update to both the primary and secondary system in real-time, while an active-passive scheme allows the primary system to update first, followed by an update to the secondary system after the pr
imary system reports success. When using an active-passive scheme, special attention should be paid to how well integrity is maintained across the two systems.

Whereas replication implies that two systems stay in a constant state of synchronization and both remain available at all times, a failover capability simply provides a standby system to be ready to take over in case a failure of the primary system is detected. The amount of elapsed time between the point of primary failure and when the standby is ready to take over could be seconds up to a number of hours. When we have a failover capability, it is assumed that the move to the standby system happens automatically with no manual human involvement. If we use the term switchover, the expectation is that a person will have to manually bring the standby system online in the event of a primary failure.

Closely related to availability is something called scalability. If a system cannot scale its resources up to accommodate increasing usage without experiencing a decrease in functionality or performance, then a decrease in availability will be the result. We have two types of scalability that can help us keep availability at an acceptable level – vertical and horizontal. When discussing scalability, keep in mind that we will use the term ‘node’ to refer to each identical copy of a system or its software.

Vertical scaling does not increase the number of nodes, but instead increases the capabilities of existing nodes, most often by increasing the hardware resources available to each node. For example, if we discover that system performance is suffering because the servers are running out of memory, we can simply increase the amount of physical or virtual memory available to a node. If we find that storage space is running out, we can install bigger hard drives or perhaps move to some type of attached network storage. However, sometimes the answer can be solved through configuration. For example, most run-time environments use the concept of database connection pooling to save on memory and CPU usage. Instead of creating a dedicated connection to a database for every unique process, connection pooling allocates a specified number of connections and reuses them as-needed. This could cause some processes to have to wait for a connection to free up, resulting in a loss of availability. By increasing the number of database connections in the available pool, we can alleviate some bottlenecks at the expense of an increase in memory usage on both the mid-tier server and the database server. Additionally, if not implemented properly, sharing database connections could result in some fairly serious security flaws.

‹ Prev Next ›