Sharks in the Moat
Page 22
If we choose to encrypt the data before handing it to the database, we can separate the key management and storage capability from the encryption processes, making it more difficult for an attacker to access both. When keys are stored in a hardware security model, or HSM, the attacker will need to have physical access to the hardware, increasing the work factor even more. This also has the side-effect of removing encryption overhead out of the database, which can often impose a limitation to scalability. On the down side, we will need to modify the application to encrypt the data, and the level of traffic between the application and database will increase due to the increased size requirements of encryption.
Normalization
Duplication of data within a relational database should raise all sorts of red flags unless it was done purposefully and logically. For example, if the database for an application that sells merchandise saves the retail price in multiple tables, it is extremely easy for them to get out of synch when one location changes, but the others remain unaltered. Not only does this introduce instability, it wastes storage space as well. Additionally, the application code will become less maintainable as it must now know about multiple places to pull price information. Storing data in multiple places also increases the chance that we will miss something and not apply proper protection.
This situation can be resolved by using normalization, a technique employed to reduce redundancy and inconsistency. When we organize data in this way, we follow certain rules, with each rule referred to as a normal form. There are three primary rules, or normalization rules that we can follow. If we follow the first rule, the database is said to be in the first normal form, abbreviated as 1NF. If both the first and second rules are followed simultaneously, it is said to be in the second form of normalization, or 2NF. 3NF follows the same pattern, with 4NF and 5NF existing, but anything past 3NF is seldom implemented and beyond the scope of this book.
Take a look at Figure 54, which represents a database table without any normalization rules applied.
CustomersAndSalesRepsAndSales
CustomerId
FirstName
LastName
SalesRepId
Product1
Product2
1
Grant
Hereford
105
Widget_003
Widget_129
2
Sal
Miningent
119
Widget_129
Widget_48
Figure 54: Unnormalized Form
Here, you can see that each record in the CustomersAndSalesRepsAndSales table contains a single unique customer, with each product purchased represented by a single column. As a customer purchases a new product, we must put the product ID into a new column. Since we only have two product columns, this means that a customer will be allowed to purchase at most two products before we have to modify the customer table. Obviously, this is not very scalable, and we will be missing tons of opportunities to sell more of our products to repeat customers.
The rule for First Normal Form, or 1NF, states that there are no repeatable fields or groups of fields within a table and is sometimes called the ‘No Repeating Groups’ rule. To have a 1NF database in this example, we will need to create a new record for each purchase, instead of having to create a new column for each purchase. New columns require the table structure to be modified, which is a HUGE deal, while adding new rows is just part of everyday business and easy. Our 1NF-compliant table would look like Figure 55.
CustomersAndSalesRepsAndSales
CustomerId
FirstName
LastName
SalesRepId
ProductId
1
Grant
Hereford
105
Widget_003
1
Grant
Hereford
105
Widget_129
2
Sal
Miningent
119
Widget_129
2
Sal
Miningent
119
Widget_48
Figure 55: First Normal Form (1NF)
Here, we only have a single column representing all products purchased. If a customer purchases 6 products, then we find six records for that customer. While our profit margin has just increased due to repeat sells, our customer data is now duplicated among multiple rows.
Second Normal Form, or 2NF, mandates that repeated data must be removed. In this case, we need to address the fact that the name of a customer appears in multiple rows. If we need to change the customer’s name, we would need to locate all rows in this table and update them. This is unnecessarily complex, so we need to implement the 2NF rule called ‘Eliminate Redundant Data’. The way to do this is to create a new table representing the relationship between the duplicated data and whatever is unique about each row in the original table. In our case, we need to extract all information that is the same for all purchases made by the same customer. This means the customer name and sales rep ID is moved to a new table. We are assuming that each customer has only one sales rep. Our new tables will look like Figure 56.
Sales
CustomerId
ProductId
1
Widget_003
1
Widget_129
2
Widget_129
2
Widget_48
CustomersAndSalesReps
CustomerId
FirstName
LastName
SalesRepId
1
Grant
Hereford
105
2
Sal
Miningent
119
Figure 56: Second Normal Form (2NF)
The original table is now named Sales, and the new table is named CustomersAndSalesReps. Note that we have a column that both tables use called ‘CustomerID’. In the Customers table, this column is referred to as the primary key, as it uniquely identifies each and every row. A value of a primary key column is never duplicated. In the CustomersAndSalesReps table, the same column is found but, in this case, it is not a primary key and is instead called a foreign key, meaning that it contains a value that is a primary key in some other table. A foreign key value will usually be duplicated.
Third normal form, or 3NF, requires that any value in a table that is not dependent on the primary key should be moved to its own table. In our example, the CustomersAndSalesReps table lists the sales rep ID for each customer, but the value in this column will be duplicated if a sales rep can handle more than one customer. This is true in our pretend company, so the ‘Eliminate Non-Key-Dependent Duplicate Data’ rule, or 3NF, requires us to create yet another new table representing sales reps. This results in a SalesReps table that looks like Figure 57.
SalesReps
SalesRepId
FirstName
LastName
105
Sally
Janeson
119
Herb
Blasinger
Figure 57: SalesReps table
Once we have extracted all information from the CustomersAndSalesReps table into the SalesReps table, the Customers table will have only customer information as represented in Figure 58.
Customers
CustomerId
FirstName
LastName
1
Grant
Hereford
2
Sal
Miningent
Figure 58: Third Normal Form (3NF)
Notice that when we carried out this operation, we managed to lose the connection between a sales rep and a customer. We will therefore need to create yet another table representing the relationship between those two as shown in Figure 59.
CustomerSalesReps
SalesRepId
CustomerId
105
1
/> 119
2
Figure 59: Union Table
This highlights that while normalization can bring a much greater degree of organization to a data model, it can also increase the complexity of the application consuming the data model. However, in the application code we have exchanged a lack of maintainability for a slight increase in complexity that is highly maintainable. This is overall a definite win.
From a security standpoint, normalization allows us to enforce integrity. Integrity in a database consists of both accuracy and consistency. This also allows us to implement secured access at a more granular level. For example, we can provide the ability for a sales rep to have write access to the customers table but have only read access to the SalesReps table and the table linking the two together. This means a sales rep can locate and update customer information belonging to herself but is unable to assign more customers to herself.
Of course, any good thing comes with some drawbacks. In the case of normalization, we can expect to see some performance degradation, as we now have to join multiple tables together at run-time to produce the same set of data as we did before normalization was implemented. To address this problem, we have two possibilities. First, we can use a view, which is an abstracted aggregation of the underlying data. Beyond having a lot of uses for data hiding, a view can also increase performance. Essentially, it is an in-memory representation of the data that is updated whenever the underlying tables change. The benefit of a view is that when it represents multiple tables being joined together, updates to the view happen when the underlying data changes, not each time a view is read. This results in a much faster retrieval of the ‘viewed’ data. However, views can seldom be indexed, and in some databases a view is executed each time it is read, resulting in no performance gains.
The second performance optimization is to selectively denormalize the database. This is quite different than having never normalized a database. In this approach, we normalize the entire database by default, and denormalize only when a performance issue is detected that cannot be addressed using other techniques such as indexing or views. There is nothing wrong with denormalized data as long as we are purposeful when it is applied, and it is an exception rather than the rule.
Triggers
Most enterprise-level databases support embedded code in the form of a stored procedure. This is a proprietary language that executes within the database. It can often result in a much faster execution of any logic requiring access to the raw data in real time. Stored procedures are good candidates for code executing in a tight loop that requires access to large amounts of data. While we could implement this logic in a middle tier, or in a consuming application, the amount of data that would have to be transferred between the consumer and database is so large as to be problematic. The downside to stored procedures is that they seldom have any type of real integrated development environment, the proprietary language is difficult to learn, and debugging is usually very painful for the developer. Scalability concerns also detract from the usefulness of ‘stored procs’.
A database trigger is a special type of stored proc that is automatically executed when certain conditions occur within the database. While normal stored procs are executed at the request of the consuming application, triggers are automatically executed when a specific event occurs within the database. The possible events are:
Data Manipulation Language statements, or DML statements, that modify data.
Data Definition Language statements, or DDL statements, that can carry out auditing and regulatory database operations.
Error events.
System events such as startup, shutdown, and restart.
User events such as logon and logoff.
Triggers can in fact execute normal stored procs, and can be used to improve security by carrying out the following actions:
Enforce complex business rules such as preventing updates during certain times of the day or carrying out calculations requiring data from multiple tables when an external event happens.
Prevent invalid transactions.
Ensure referential integrity is maintained.
Implement automated auditing and event logging.
Enforce complex security privileges and rights.
Synchronize data across replicated tables and databases.
While it sounds like triggers can possibly solve 90% of our problems and possibly world hunger, there are some very problematic issues with their use. Because they are event-based, they cannot commit or rollback transactions and can cause strange behavior that is not easily identifiable. Triggers can call other triggers, resulting in cascading triggers, which can cause strange side-effects as well. In general, maintainability of an application will always be seriously undermined when triggers are over-used. A much better approach is to contain all business logic in a business layer tier sitting outside of the database. In general, triggers and stored procedures should be used only when they are the best mechanism for addressing a specific issue. The overuse of embedded code in databases is a leading cause of scalability and maintainability issues. As an architect, my normal advice on the subject is the following:
Always put business logic in a mid-tier layer, and only use stored procs if they provide a significant security or performance advantage, and strive to keep their use to an absolute minimum.
Views
We’ve already touched on views, but we need to add a little more color with these useful but potentially hazardous tools. A view can increase both performance and security, but another significant advantage to views is their ability to hide underlying complexity. A view is similar to encapsulating an object in classic object-oriented programming – we do not have to worry about how it was constructed or the internal complexity, and changes to the underlying logic are invisible if implemented correctly. A great use of views is to move the complex logic of how to generate a view from application code and back into the database itself. Because views are defined once in the database and accessed at-will in application code by simply referencing the name, we can move knowledge of the underlying data complexity from the application back into the database. As a rule of thumb, we should address complex data structures inside of the database and take care of complex data processing outside of the database in a business logic tier.
Privilege Management
Most enterprise databases support the concept of users and roles. Instead of giving privileges directly to a user account, the user should be assigned a role which is then assigned privileges. Because a role lives separately from a user
account, it is possible to delete a user without impacting any role that user created or belongs to. Applications should be architected to use the concept of ‘datareader’ and ‘datawriter’ roles, thereby enforcing the principles of least privilege and separation of duties. It is not scalable or safe to associate database users with end-user accounts used to login to an application. Instead, the application should use impersonation to connect to the database using a standard and generic database account. In other words, the account the application uses to connect to the database should not be tied directly to the application user identity. Application code should use a specific privileged account to perform read operations and use a ‘write-enabled’ account only in certain conditions. This greatly reduces the ability of an attacker to carry out both read and write operations simultaneously. However, since the database will be executing using impersonated accounts, it will not be able to carry out effective auditing actions since the end-user’s identity will be unknown. That is why the application will need to implement its own auditing capability instead.
Chapter 37: The Development Role
Also called a programmer or coder, a developer uses their technical know-how and skillsets to solve problems. While we usually jump right to thinking of how well a developer knows a specific language or development tool, the reality is that the most important skill a good developer can possess is problem solving. T
his core skill is then used to program and automate manual processes.
Because this book is about creating and deploying secure software, the team that writes software is obviously going to be front and center. The Development role encompasses team leads, tech leads, and individual developers. While an Architect by definition used to be a developer, that role is addressed separately as an architect will have many concerns that are much broader in scope than the average developer. This section will address some topics already covered in the Core Concepts section, but at a much deeper level. And of course, we will be discussing the down and dirty ways to implement safe and secure coding at the algorithm level. That is what the average developer is looking for anyway! Keep in mind that this book is written to be language-agnostic, so don’t expect to find code snippets ready for cut-and-paste action.
Computer Architecture
First, let’s dive into how computers are architected. Developers who only have high level programming experience might find some of this subject matter challenging, but knowledge of the underpinnings of hardware will make anyone a better programmer.
A computer is primarily comprised of three components – the computer processor, system memory and input/out, or I/O, devices as shown in Figure 60.