Do you have data waste or data base?
By Gints Plivna
Every person more or less is familiar with data bases. Starting with heap of scratch papers, notebooks (I mean the old ones on the paper :), and ending with relational data base management systems storing zillions of records. Not so many people think they are familiar with data wastes. But in reality most probably they are! You ask – what is data waste? In short – this is data base where there aren’t clearly defined algorithms how to get out rational data.
You probably have (or had) an old notebook on paper with very important information including phone numbers, addresses, peoples etc. randomly written in. The same probably could be said about files and directories of your own work computer (if you have a perfect system, then most probably you are an exception :).
So who cares about your own notebook or computer? Nobody do, except you. But in case of databases used by many people situation turns out far worse.
The first big chapter “Characteristics for data waste” describes how one can identify data waste. Second chapter “Consequences of data waste” outlines most important and worst consequences of data waste.
This chapter tries to formulate most common and important characteristics describing a data waste. Most of the time none of them are true/false type features but the more of them you have the more your data base is data waste with all possible (negative) consequences.
Although the references are mostly related to Oracle the concepts are valid for each and every database management system that has more advanced features than simply storing data in tables.
What I understand by documentation? Documentation in reality can be stored in many ways. You can have it in Software design description according to IEEE standard, separate document called something like Database design description or even in comment format in database. The most important thing isn’t that you have enumerated all the tables and columns, but that you have described what they contains, what possible values they could have, what these values mean, what relationships they have etc. One can easily generate table, column, column data type, primary key, foreign key and list of other database objects from data dictionary. The main problem is to understand what these objects contain and what the purpose they serve is.
1.http://www.bbc.co.uk/guidelines/newmedia/word_docs/OracleDatabaseDesignTemplate.doc - New Media (Oracle) Database Design Template.
In conjunction with “lack of documentation” (1) and “all logic in application / db has only tables” (5) you are in a big trouble if you need to change something or get reasonable data out. The only way is to analyze the data model, analyze the data and analyze the application code (if source is available). This takes time. This takes lot of time. And there is no guarantee that you have made changes in a right way. Probably after some time you’ll find that your changes won’t work for a very special case and worst of all have silently damaged your other data. So take care of your only well-informed person and force him to correct “lack of documentation” or at least deliver his knowledge to others.
Initial design is crucial for almost everything that is at least a bit complex. You cannot build a big house without accepted project. Unfortunately many times because of false expectations that it would be cheaper information systems are built without any initial design. Actually it will be far more expensive at least in long term. The earlier the mistake is allowed the harder is to correct it. If your data model is incorrect in the very root it will affect all the system and even best programmers in the world wouldn’t be able to make the application performance acceptable.
1. Data Model Patterns: Conventions of Thought by David C. Hay, ISBN: 0932633293;
2. The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises by Len Silverston, ISBN: 0471380237;
3. Requirements Analysis: From Business Views to Architecture by David C. Hay, ISBN: 0130282286;
4. http://www.phlonx.com/resources/nf3/ - Tutorial on Database Normalization;
5.http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:6692296628899 – only one table with complex object as blob;
6. Oracle Insights Tales of the Oak Table, Chapter 11 Bad CaRMa by Tim Gorman, ISBN: 1590593871;
7. http://www.learndatamodeling.com/ - a short description of data modelling.
8. http://www.tdan.com/edatt1_archive.htm - article archive. You can search for example for David Hay; he has some valuable free articles there.
9. http://www.utexas.edu/its/windows/database/datamodeling/dm/overview.html - introduction to data modelling and relational modelling.
Physical modelling for Oracle
1. Expert Oracle Database Architecture: 9i and 10g Programming Techniques and Solutions by Thomas Kyte, ISBN: 1590595300;
2. Effective Oracle by Design (Osborne ORACLE Press Series) by Thomas Kyte, ISBN: 0072230657;
3. Oracle Insights Tales of the Oak Table, Chapter 10 Design Disasters by Jonathan Lewis, ISBN: 1590593871;
4. http://asktom.oracle.com – information mountain by Thomas Kyte.
Usually that goes hand in hand with “3. Lack of initial design” It will raise the total incomprehension level about your data base because every object has to be analyzed separately. You cannot create patterns from already known information and use analogies that greatly reduce necessary overall work.
1. http://www.gplivna.eu/papers/naming_conventions.htm - Naming conventions for Oracle tables, columns, indexes, keys etc;
2. http://en.wikipedia.org/wiki/Analogy - Analogy, From Wikipedia, the free encyclopaedia.
Do you have all business logic outside your Oracle database? Yes? Then why on earth you have spent so many $$ for Oracle licence? You could easily live with MySQL! And MySQL would certainly be faster! Why? Because Oracle has so much possible functionality compared to MySQL, that it certainly adds some overhead. This is just like you have rented a big house with tens of rooms and tons of equipment and advanced features and you are using only living-room. You are not using WC. Why WC? You can go out in the garden and do necessary things there! You are not using kitchen. Why kitchen? You can go out and do the cooking on fire! So would you pay rent for the entire house with all its features and wouldn’t use them?
The same situation is with databases, especially advanced databases with many features. You have paid so much money and are just deliberately robbing yourself.
If you are seeking for a miracle called database independence then you have achieved the wonderful result when your application isn’t performing at possible best on every one of them. All the built in features are as close as possible to the data and that’s the primary reason they are doing things better than any feature outside database.
This is just like you are living in several separate houses each with different architecture and using only living-rooms there. I’m sure you wouldn’t be satisfied with your quality of life.
1. http://www.rittman.net/archives/001140.html - The Cost of Database Independence by Mark Rittman;
2.http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:883929178230 – business logic in DB versus out of DB.
6. Many persons have already taken part to make data waste bigger each with his own imagination and level of understanding
With “3. Lack of initial design” and “4. Lack of naming conventions” all people developing your data base are like ten people decorating your ten room house without any guidance. If you are lucky achieved result will be quite acceptable, if you aren’t result will strike terror into everyone. The same situation is with databases. There are many conventions, standards and best practises to follow. If more than one of them are used in your database the ambiguity level is increased and it becomes apparent most brightly when you have to do something with object A made after convention one and object B made after convention two.
7.1. No constraints and referential integrity
This usually is the most radical form of “5. All logic in application / db has only tables” or result of simple incompetence. If you haven’t enforced referential integrity in database then this is only matter of time when you’ll have orphan records, duplicate values and unwanted values. Sooner or later you’ll access your data by passing your application, sooner or later someone will use a bug in your application and your data will be damaged.
Trying to enforce referential integrity and for example unique constraints in application is far far harder than doing that in database, especially if you have multi-user application (and who hasn’t?).
1. http://en.wikipedia.org/wiki/Referential_integrity - Referential integrity, from Wikipedia, the free encyclopaedia;
2. http://download-uk.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_constraints.htm - Oracle® Database Application Developer's Guide – Fundamentals, 6 Maintaining Data Integrity in Application Development;
3. http://tkyte.blogspot.com/2006/06/what-did-i-decide-on.html - The Tom Kyte Blog, some worst practises.
7.2. Cryptic and uninformative column/table names
Tables with names tab1, tab2 or AKP1, AKP2 and columns with names col1, col2 etc. can turn everyone’s life into a hell. I prefer then table names in unknown language; at least there is a reason to learn some words in it. In a new database initially it is hard to remember logical names to say nothing of col1 and col2. Each and every usage of them results in looking into some document and/or sticker.
7.3. Many columns like notes without any classifiers. Each user enters data in his own style
A human can
probably get along without classifiers because he understands that
7.4. Character data type instead of date, number
This is another one “feature” of database independence because each data base treats its data types differently. So instead of using DATE and NUMBER you use VARCHAR (10) and VARCHAR (XX). As a result you have wonderful potential for a big mess. Ordering for characters works like ‘1’, ‘10’, ‘2’ and not like you’d like 1, 2, 10, nothing to speak of ordering of varchar dates. Date and number mathematic becomes Sisyphean task and there is always a threat that someone will enter either date in incorrect format or number with for example character “O”.
1. http://en.wikipedia.org/wiki/Sisyphus - Sisyphus, from Wikipedia, the free encyclopedia.
7.5. Chaotic indexes
This is more physical, than logical aspect of database. In case the database has more or less only necessary indexes one can found common data access patterns that can tell something about valuable data and not so valuable data. In case the database has chaotic indexes one can found only bad example of indexing scheme.
List of consequences described below of course isn’t complete and these are for the worst case scenario when database complies with all criteria described above. Therefore I hope there are only a few data bases in the world fulfilling all characteristics mentioned above and I hope even more that you don’t have to deal with one of them. Even if you have, maybe you are lucky enough to find a person that has researched your problem source before and can give you some knowledge about it, maybe you are lucky enough and the database is so small that even with all characteristics above fulfilled you can understand it in a reasonable time, maybe you are lucky enough and can agree with customer to start a new life without old data or at least only with data that satisfies some minimal criteria.
The more your data base is data waste the more each change costs.
- With “1. Lack of documentation” and “2. The only well-informed person has left a while ago” you don’t know WHAT to change;
- With “3. Lack of initial design” and “4. Lack of naming conventions” you don’t know HOW to change;
- With “5. All logic in application / db has only tables” and “7.1. No constraints and referential integrity” you cannot be sure that your changes sometime won’t break all the system.
You are stick with your data waste and there is nothing but pray that something won’t break.
OK, you’ve understood that you have data waste, let’s create a data base and start things over! The bad news is – that’s quite a big work or almost impossible for a real data waste.
- With “1. Lack of documentation” and “2. The only well-informed person has left a while ago” you don’t know WHAT to migrate. You don’t know which tables and columns take seriously, which tables and columns are simple waste, you even don’t know what all this mess means!
- With “5. All logic in application / db has only tables” and “7.1. No constraints and referential integrity” you have records without necessary information (because of lack of NOT NULL constraints), you have orphan records and don’t know where to put them and you have duplicate records with different values.
- With “7.3. Many columns like notes without any classifiers. Each user enters data in his own style” and “7.4. Character data type instead of date, number” you haven’t clear algorithms how to make data transformation and how to map old a la classifiers to new classifiers.
- With “7.2. Cryptic and uninformative column/table names” your task of writing migration code becomes nightmare and every evening you have headache J
As a result with completely inadequate work you have made partial migration and all users complain that old system was better because had more information and was less restrictive.
Data waste is probably one of the biggest causes of viewpoint that the biggest value is application, not the data. I’ve personally seen only the contrary. What is data conversion from text files, various ancient DBMSes, previous data model versions? All these data were entered via some application. They are gone forever; nobody cares about them, but the great requirement to convert old data remains. In case of data waste the overall quality of data and inability to convert them leads to this false conclusion.
Avoiding mistakes described in “Characteristics for data waste” doesn’t guarantee that your database will be good. But allowing them does guarantee you problems that you won’t be able to correct (except “7.5. Chaotic indexes”) via tuning, caching, writing better SQL, using bind variables, buying bigger box or whatever. These are logic mistakes and the only ultimate solution is to redesign data base and most probably rewrite your application from scratch. If you really think that you need to allow at least one of them then think at least twice and try to formulate both minuses and pluses for yourself. This will at least allow you to look deeper in possible problems and gather new experience for decision making skills J
1. http://en.wikipedia.org/wiki/Anti-pattern - Anti-pattern, from Wikipedia, the free encyclopedia;
2. http://www.web-hits.org/txt/codingunmaintainable.html - How To Write Unmaintainable Code.
Gints Plivna firstname.lastname@example.org is system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in working with Oracle since 1997 and his interests mostly have been connected with analyzing system requirements, design, development and SQL tuning.
e-mail - email@example.com
website - http://www.gplivna.eu/
is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/
or send a letter to
First edition: 2006-07-24
Changes: 2006-08-09 added url for logical modelling