Sharks in the Moat
Page 46
requirements and development efforts. Although most would say this is a breach of relationships, it has been my experience that the engineering manager must often reach out directly to stakeholders in some cases to ensure their requirements are making it to the development team intact and that the stakeholder is indeed being properly represented by Product and included in key meetings by Project.
Chapter 41: The Testing Role
While it takes all roles covered in this book to effectively produce secure software, there are three core roles that shoulder the bulk of responsibility – Developer, Architect and Testing. I cannot stress strongly enough how important the testing capability is to truly secure software – if you cannot measure it, you cannot manage it. And testing is the only function that focuses exclusively on measuring how well software meets requirements and how secure it is.
Flaws vs. Bugs
Now, we previously made the statement that coding issues represent the source for most threats. However, when a significant problem regarding availability surfaces in a production environment, it will usually be related to an environmental or architectural issue as opposed to a problem introduced by a coding mistake. In this book, we use the term flaw to refer to issues related to infrastructure, configuration or architecture, and use the term bug to refer to coding mistakes. At times a single problem can be attributable to both a flaw and a bug, but nonetheless the sources remain distinct.
To uncover potential issues, threat modeling and architecture design reviews are helpful in uncovering potential flaws, while code reviews and penetration testing help with identifying bugs. For example, it is difficult to detect business logic flaws when performing code reviews, nor will network security devices protect us from them. Our only hope in proactively flushing these issues out is to perform architectural and design reviews. Specific to security, attack surface evaluations, threat modeling and misuse case modeling are the best tools to locate flaws. Logical flaws, such as incorrect business logic, are sometimes called semantic issues, while bugs are called syntactic issues.
Quality Assurance
Quality assurance is the act of assessing and validating that software has achieved a specific level of quality. We can break quality down into five attributes – reliability, resiliency, recoverability, interoperability and privacy. Figure 118 lists each attribute.
Reliability testing measures how well software functions according to the requirements as stated by the business or customers. Given that the average application is extremely complex, it is not very likely that all possible paths have been properly tested, and this reality is often taken advantage of by an attacker.
Figure 118: Quality Assurance Attributes
Resiliency testing measures how well software can withstand an attack meant to compromise the application. This can be caused by unintentional or accidental actions by users that impact its availability. If software does not have a high level of resilience, then it will be vulnerable to attacks such as injection threats, DoS, theft and memory corruption.
Recoverability testing validates that software can return to an operational state after it has been taken down by an accidental or intentional compromise.
Interoperability testing looks at how well software operates in different environments and with other applications.
Privacy testing is carried out to see if a proper level of confidentiality is maintained for PII, PHI, PFI, and any other information that is exclusive to the owner.
Keep in mind that just because software meets all of its security requirements does not mean that it is secure. Proper security testing – which we will cover in just a bit – is required before we can call something ‘secure’. Since security is one of the attributes associated with quality, if an application really is secure, then its quality increases as well.
Testing Artifacts
Before we dive into discussing the various types of QA testing, let’s discuss the artifacts that are produced – the strategy, plan, cases, suites, scripts and harnesses.
Figure 119: Testing Artifacts
The test strategy is the first testing artifact to be created and outlines testing goals, methods, test environment configuration and any required resources. The strategy controls how communication will be carried out among all team members, and describes the criteria used to determine if a test passes or fails. When developing the strategy, the team should take into account data classification, threat models, and the subject/object matrices.
Once a strategy has been finalized, we can create the test plan, which provides the workflow at a granular level that a tester will follow. A test plan includes three primary components – requirements, methods and coverage.
A test case takes the plan requirements and defines multiple ‘measurable conditions’ that must be met in order for the plan to pass. In other words, each test case represents a single unique path that must be tested. A test case will contain the following information:
A unique identifier.
A pointer to the requirement that is being validated.
Any preconditions that need to be met.
Actions or steps to be carried out.
The required test inputs.
The expected results that equate to a ‘pass’.
Test cases can be grouped together in a test suite. Test suites are usually organized by sections, such as security, performance, load testing, etc. Security suites are often overlooked, so be sure that they exist.
Once a test case has been finalized, a test script is written detailing the exact steps a tester will need to follow. While a list of steps is included with the test case at a high level, the test script is very specific about how to execute each step. A single test case can require multiple test scripts.
All components necessary to carry out software testing is collectively called a test harness. Included in a harness are all testing tools, data samples, configurations, test cases and test scripts. A test harness can also be used to simulate functionality that has not yet been developed or available in a test environment.
Types of Software QA Testing
QA testing can be broken down into three types – functional, non-functional and other. While functional testing focuses on ensuring the software meets the requirements as stated by the business or customer, non-functional testing ensures that the software meets certain quality attributes such as scalability, performance and recoverability, among others. The other group is a catchall for important testing that isn’t a functional or non-functional test. All types of testing we will cover are shown in Figure 120.
Figure 120: Software Quality Assurance Testing Types
Functional Testing
Functional testing, sometimes referred to as reliability testing, ensures that the software meets the needs of both the business and customer, who are often the same. Software is reliable when it meets the needs of the business owner. There are four types of functional tests that we can carry out – unit, logical, integration and regression.
Unit testing is executed by the development team, and as such is covered in detail under the Development role. As a quick summary, unit testing allows a developer to test small blocks of code by simulating both input and output. Unit tests should be automated so they can be run at any time as part of a regression suite.
At the same level as unit testing we find logic testing, which validates that code produces the expected logical result. For instance, if we have a function designed to multiply two numbers, then a logical test is that if the two inputs are 3 and 6, then result should be 18. A predicate is the condition that is either affirmed or denied by a function – in our previous example, the predicate is that if we provide 2 and 3 as an input, the output will be 6. If software has a high level of cyclomatic complexity – meaning that the number of linear independent paths are few – then it must undergo extensive logical testing.
After unit and logical testing have been successfully completed, we need to carry out integration test
ing in which we combine modules that were previously unit and logic tested and see if they work together correctly. In our previous shopping cart example, the Cart, ShippingRate and CurrencyConversion classes might have all passed unit testing, but when we connect Cart with ShippingRate, and ShippingRate with CurrencyConversion and perform integration testing, we might find that Cart assumed ShippingRate would already know the current currency in use. Integration testing will show us that the author of ShippingRate assumed that Cart would pass the currency in, resulting in a failed test.
Once a development team have unit and integration tests passing 100%, it should be celebrated, because if the tests were properly written and there are no gaps in coverage across the product, there is a very good chance that the product now is of a high quality. At least until the team makes changes in the code that breaks something. That is not a question of ‘if’ but ‘when’ – development teams WILL break something that used to work. And that is why regression testing is so important. Regression testing is simply the act of rerunning all previously passing tests to determine what we just broke and is sometimes called verification testing. Some people claim a regression test is created to verify a specific bug fixed or updated, but that is incorrect. A regression test is not a test we write like a unit or integration test – it is the act of rerunning existing unit and integration tests to see if the fix or update caused an unwanted side-effect.
Now, when we do fix a bug or make a minor modification, an automated test should be written to explicitly validate the problem was resolved, but that test will be either a unit test or an integration test. This test is written and executed along with all existing tests to prove five things:
We addressed the root cause instead of treating a symptom.
The fix did not introduce any new bugs.
The fix did not cause an old bug to reappear.
The fix is compliant with the specified requirements.
Code and data that was not modified are not impacted.
Now keep in mind that while we have been discussing functional requirements and fixes, security must also be addressed in the same manner. For example, a security update could very well prevent users from accessing features they should be able to reach, such as a menu option that suddenly disappears.
Ideally, all unit and integration tests should be run on each check-in, after each deployment to any environment, and as a final check when a new release is deployed to production. However, if unit and integration test coverage has been properly implemented, you will find that the time to run the entire suite of available automated tests is much too long. For example, we can’t afford to run a 50-minute suite of regression tests for every check-in – we would never get any work done, and all of the developers would leave for new jobs elsewhere. We therefore have to be able to define test suites that either focus on specific areas of the application as a vertical slice or focus on the depth of testing as a horizontal slice. For example, we might define a test suite that focuses only on authentication and authorization processes and run those only when code or data in that area is modified. Or, we might choose to test the entire application end-to-end but limit the number of tests so that it can be run in 2 minutes or less.
Whichever direction we choose to go, we must ensure that every major release has a full-featured suite of tests executed, and we just need to allow for the extended amount of time it will take. Deciding which security tests should be part of this version-release suite takes some special consideration. At a minimum we should include boundary condition and timing tests. Determining the RASQ for each version allows us to know if we are going forward or backward in our never-ending quest to increase security.
Non-Functional Testing
Non-functional testing covers areas that cannot be tied directly to a requirement but are nonetheless crucial for success. Some attributes that fall under this area include load balancing or high-availability failover, interoperability, disaster recovery capabilities and the appropriate level of replication. There are three categories of non-functional tests – performance, scalability and simulation.
Performance Testing
Performance testing is not concerned with finding vulnerabilities, but instead focuses on discovering if there are any bottlenecks in the software. From a security perspective this is fairly interesting, as increasing security will almost always decrease performance. For example, implementing complete mediation where authorization is checked on each and every transaction is expensive. We can offset this by caching data, but security makes us limit the amount of time that data has until it expires. Security might dictate that we have real-time, synchronous replication, but this will slow down every transaction because it now has to wait for the replicated server to catch up before continuing. Security may require that all communication channels are encrypted, but the overhead of the encrypt/decrypt cycle eats precious CPU and memory resources.
As a result, increases in performance so that an application can pass performance testing must be implemented in a way that does not decrease security below acceptable levels. Performance tuning can be carried out by optimizing code, changing the configuration of the environment, adjusting operating system settings and increasing hardware capabilities. As an example of configuration tuning, we might choose to increase the number of allowed database connections that are pooled.
We can test performance in two different dimensions – load and stress.
Load testing measures the volume of tasks or users that a system can handle before its performance drops below an acceptable threshold. Sometimes this approach is referred to as longevity testing, endurance testing, or volume testing. When carrying out load testing, load starts out at a low level and is incrementally increased, usually up to the maximum load that we expect. For example, if requirements state that a system must support 40 concurrent users with 3,000 configured accounts in the database, we will probably start out with 5 concurrent users and increment by 5 for each stage of the test. Likewise, we would likely start out with 100 configured accounts and increase by 500 at a time.
While load testing generally stops once we have reached our target maximum load, stress testing keeps going until we reach the breaking point of the system, where the application actually fails in some manner. This lets us know where the absolute maximum load lives for a given configuration. In our previous example we certainly would increase concurrent users and the number of configured accounts, but we might also choose to artificially cause low memory or reduced CPU conditions to see how the application behaves.
Stress testing has two objectives:
1) Find out if the software can recover gracefully once it fails.
2) Find out if the software fails securely.
Stress testing is also a great tool to smoke out hidden timing and synchronization issues, race conditions and deadlocks, and resource exhaustion.
Scalability Testing
The second type of non-functional testing is called scalability testing, which allows us to see what limitations to scale a system currently has and is the next type of testing normally carried out after performance testing. Whereas load testing finds the absolute maximum load a system can handle without performance dropping to unacceptable levels, scalability measures the ability of an app to avoid encountering that scenario by allowing more hardware, servers, or processes to be added as load increases. It’s important to understand that scalability can be hampered not only by infrastructure limitations but by the design of a system. For example, an application that works on a single server may not be scalable to two servers because the processes were not written to share resources with another server. A frequent reason for the lack of scalability is the database because that shared persisted resource on-disk is not easily spread across multiple servers. We could even be unable to scale simply because a table was not designed properly. As an example, if our primary ID field in a table will only handle values up to 65K, then once we hit that limit the application will
stop functioning properly and will instead start throwing exceptions whenever it tries to insert a new record into the limited table.
Environment Testing
The last type of non-functional testing is called environment testing, which looks at infrastructure to which software is deployed to ensure it is reliable and properly secured. With the advent of modern web apps, business logic is increasingly migrating to the browser thick client which presents additional problems as the client may be aggregating data from multiple backends. Environment testing is comprised of three sub-areas – interoperability, disaster recovery, and simulation.
When software is connected to multiple environments, interoperability testing ensures that the various connections between those environments is resilient and protected. As an example, an online store connects to a third-party merchant, PayPal, FedEx and UPS. While the external environments are beyond the control of the store’s owner, the connections between the various environments must be secured and tested. When SSO is used to share credentials, as is the case with many intranets, testing must look at how the credential token is handed off and managed within each environment. There are four tests that if executed in an SSO infrastructure will at least guarantee a base level of confidence. They are: