Data is Forever, Programs are Ephemeral

Avoiding the costs of bad databases

There are two main kinds of problem database

The most problematic databases are inevitably the work of somebody with little or no programming or computer science background who has put something together in a GUI, typically Access, and added some simple VB scripts or Excel formulas to do the processing they need. Such projects frequently result from frustration when IT support has not been provided to build some necessary system and a user has resorted to making their own solution. These ad-hoc databases are rarely intended to serve for the long term. However, because these databases are frequently created by people with excellent domain knowledge, they may often be eagerly adopted by other users and may turn out to be too useful to throw away. Because of their utility they can become a troublesome legacy application on which a business may become dependent more easily than anticipated.

Often 'professionally' designed databases may also exhibit problems, for example subtle integrity issues; typically a result of insufficient normalization or absent - but essential - constraints. They typically pose an entirely different class of problem to databases created by an ad-hoc amateur database in that the problems are more time consuming to diagnose and there are are more likely to be applications that rely on the existing database schema.

Often such databases are implemented on an enterprise class DBMS and can be transformed to solve the problems. In some rare cases the old structure can be replicated through views so that legacy applications can interoperate with the new and improved database design, though this is typically not possible when the application updates the database state.

The cause of many database problems is premature denormalisation; an activity performed by incompetent 'professionals' who learned their database skills from other 'professionals' with exactly the same broken skills. These alleged experts believe that they are improving database performance by aggressively denormalising it - which is just about the same as not normalising it in the first place. It has been shown time and time again that these techniques don't really improve performance. They may save a few joins, but most modern DBMS are incredibly fast at joins. Instead they encumber the database with consistency issues that require more checks and slow downs to make them function properly. This is also a source of terribad SQL code, loaded with special clauses to handle NULL cases. The reality is that adding appropriate constraints (not absurd ones that slow down the database) along with strong normalisation to at least BCNF ultimately lead to better performance, reliability and developer productivity. Most catastrophically slow databases have the wrong index configuration and benefit not one bit from munging tables together to save a join here or there.

A third variety of problem database has always been a minority issue and may likely be more common in the future due to the ongoing fad for NoSQL databases. Hierarchic, network or multi-value databases - as well as modern NoSQL databases such as MongoDB - may function perfectly well with the application they were originally designed for, but are not suitable as a foundation for other applications. In other cases the problem with the NoSQL database may be the difficulty in sourcing developers who can work with the database, or discontinued support for the database itself. MongoDB might look like a great choice right now, but compared to the SQL databases, experienced developers are hard to find, and the underlying theory underlying the DBMS is sketchy at best. Most SQL based DMBS can have a file as a datatype, so the performance advantages of MongoDB may not be relevant to your application. For some applications (probably most business applications) SQLite may be a better performer, if performance is your problem. The great thing about classic SQL DBMS like Oracle, DB2, MySQL and Postgres is that they peform reasonably well across a variety of use and access patterns and with a broad variety of data types. They are general purpose solutions that are likely to remain a safe bet to house your data for the long-term.

Amateur Databases

Don't let short term savings turn into long term disasters

Friendly products like Access can encourage people to dive in and build a database without much in the way of planning. Office databases created by beginners can often be incredibly useful at solving a specific problem but are typically unsuitable as foundation for ongoing development. There is often a need to migrate/transform that old but useful data to a more robust form.

In other cases there can be serious problems down the line. Fixing a badly designed database it usually going to cost more than paying to have it built appropriately from the beginning. Don't underestimate the costs of loss of system availability while a broken database is rescued. If the owners of those databases had developed them properly at the outset they might have saved a lot of money in the long term. They can end up investing so much of their time, and the time of others in the broken database that they daren't contemplate the cost of writing it off. There are highly successful companies that make their money from writing incredibly complex queries to make badly designed databases do something useful.

You can save money in the short term by putting a database together by trial and error, but in a couple years time, when your business is reliant on the data in it, you might find that some hasty design decisions made when you were a beginner mean you have to enter a lot of your data more than once, and because of entry errors, sometimes the two copies don't match up. Problems, problems! You might end up paying for the design of a new database that works properly, paying to have all your data transferred, and suffering downtime as well.

A badly designed database and a little bad management can turn bad business process into a liquidation. A broken database can lead to a broken billing system. If customers lose confidence in their bills, they don't pay. If the scale of the problem is large, disaster can ensue. Most small businesses don't have enough different customers for this to be a real threat, but some do.

If users lack confidence in a database application it will not deliver benefits to the business and will be avoided. This can certainly be an issue with an ad-hoc database or application crafted up by non-experts, but it can also be a problem with more expensive solutions. The cost and time required to train staff to use new products is typically underestimated or ignored. It is not uncommon for no time at all to be allowed to train staff on new systems. It should not come as a surprise when staff respond badly to the introduction of systems under such circumstances.

It could be cheaper than you think to get your databases designed and built by professionals, and it could save you more than you expect.

Common Database Blunders

The beginner's primary database blunder is 'bad keying'. e.g. It might have seemed like a good idea to key on client name when the database was built, but when you have twelve Simon Smiths on your books, it doesn't seem like quite such a good solution. To fix the problem you end up dragging huge chunks of customer data into unrelated queries, and they start to get complicated to write and run very slowly. This is why the big companies assign cryptic ids to customers, employees, products, contracts, support queries, etc.

Bad keys can cause problems when you decide what to put in each table. If you accidentally choose a non-unique key for your table, then it will end up being useless. This might seem obvious, but sometimes it doesn't show up until the table gets a lot of real data in it. It could go unnoticed for even longer if it's a table that you aren't really using. When you add new functionality, you can suddenly discover you have tables with duplicate keys that have grown unchecked for months - the result is chaos. For this reason you need to make sure that every table has a unique key.

A related problem is when data gets into a table that shouldn't. This is the second big mistake seen in amateur databases and the number one error in databases architected by poorly trained "professionals". When rarely used data that should have been in its own tables with its own keys creeps into a heavily used table, performance suffers and you might even end up entering duplicate data when filling in tables.

Put simplistically, the correct distribution of data amongst tables is called normalization. The importance of normalization is a mathematical fact, yet "experts" who never studied the underlying theory of relational databases still continue to play fast and loose with it. For example, they may claim that patchy normalization improves performance by reducing a need for JOIN operations. Sadly, this is almost always false. While in some cases a few joins may be avoided, the price is the total loss of database integrity, which can have far worse consequences than a few milliseconds extra time for a query. Usually, the claim is made that the problem is solved in application code - though if the database were constructed properly the DBMS could ensure integrity. Solutions implemented in applications are obscure, frequently duplicated and often vary from application to application, worse they simply may not work as intended at all but appear to as long as the database is lightly loaded. Application based integrity checks often perform more joins than were avoided in the first place, so the net outcome is a decrease in performance (at least on writes).

The above issue leads us to another problem: duplicate data. If you have good keys, and have split your data into different tables wisely, you shouldn't have any duplicate data at all. The only duplicate data in a database should be 'by design' allowances, such as allowing a customer to have multiple addresses. Such things aren't really duplicate data at all, they're part of good planning.

Think carefully about data items that you might need more than one of. It might be tempting to assume that a customer can only have one address, but it could cause you problems.

Less Common Database Blunders

A weakness amongst some commercially trained database architects - including many with considerable experience - is a muddled understanding of database theory. A lot of architects learned the theory at university but not the value of it. They went on to learn database design 'on the job' and picked up a belief in a variety of performance myths that are passed around the internet.

Such architects may apply good rules inappropriately or employ terrible techniques they have been taught to believe are advantageous (but which are actually highly damaging) such as introducing denormalization into a database "to improve efficiency". The latter mistake can be disastrous for the integrity and long term performance of a database. There are surprisingly few situations where denormalization makes any sense. At the very least it prevents the efficient use of DBMS constraints to ensure data integrity, resulting in check code implemented into every application that must update the database - ultimately performance is reduced not improved - but it may fatally compromise integrity in a way that no check code can mitigate let alone genuinely prevent.

Possibly the denormalization myth comes from the over-estimation of the cost of joins. In practice most SQL DMBS have highly optimized join capabilities. Though it can depend on how you perform the join (for example, the comma syntax can result in different times) the quality of query optimization on modern SQL systems is sufficient to make joins a non-issue in most cases. Where they do pose a problem, judicious use of other techniques can often solve the bottleneck. The slowdown in most databases is due to write transactions; reads are very rarely the bottleneck.

Another blunder sometimes made by 'experts' who failed to grasp database theory (or practice) is the introduction of completely unnecessary synthetic keys "by rote", because the architect does not really understand what a database key is and mechanically creates a synthetic (typically integer) key for every table, even when a perfectly good natural key exists. They are so afraid of creating a table with a non-unique key that they add unnecessary keys to tables that already have a perfectly good unique key naturally present in the data. While on the surface this appears harmless, it is not; it can create all kinds of problems with deletion and duplicate data within applications.

Some architects who have prospered working on large systems have scarcely any regard for how a database will be accessed. They may do things like represent a fundamentally identical kind of relation, such as parent to child, in several different ways. Because they are viewing the data from a business perspective rather than an abstract one, they might see identical kinds of relation as differing when they do not. In some cases there may be a tendency towards premature optimization of the database model. However it arises, this inconsistent way of creating relationships is the bane of developers, resulting in a developer coding a tailored query where (for example) a general purpose query engine might have been used to cope with all parent-child relations. Even using iBatis/MyBatis can still making digging out of such a hole bothersome and time consuming. It certainly makes using Hibernate or similar object-first type solutions impossible (not that object-first is usually the best way).

Another somewhat nasty technique is the introduction of unnecessary explicit type fields. Rather than store two unrelated kinds of data in different tables, the architect perceives an unimportant relationship between the different types of data as highly consequential and believes that a single relation (table) should be used for different kinds of data. The use of explicit type fields may be appropriate in many cases, typically where the type value is some minor property that all the objects have. A good example of such a case might be cars with different colours. It would be foolish to store data about cars of different colours in different tables, one for each colour. However, it might (in some cases) make sense to store some data about trucks in a table of its own. Trucks may have many attributes that a car simply does not. This is where the over-eager architect doesn't normalize out the shared data between trucks and cars, putting the unique data about each in its own table, one for trucks, one for cars, instead he stuffs all the data in one table and adds a type field to distinguish trucks from cars. The meaning of some data fields then becomes dependent on the introduced type field. The architect might do this because he hates to introduce new tables (irrational fear of joins) or because he thinks he's expressing something important about the connection between trucks and cars. In both cases he would be doing the wrong thing for the wrong reason and has compromised database integrity. He's probably made it harder for developers to use the database too as queries with type fields generally require the introduction of joins (oh the irony) to resolve.