Do you have data waste or
data base?
By Gints Plivna
Characteristics for data waste
2. The only well-informed person has
left a while ago
5. All logic in application / db has
only tables
7. Physical attributes of database
objects
7.1.
No constraints and referential
integrity
7.2.
Cryptic and uninformative
column/table names
7.3.
Many columns like notes without
any classifiers. Each user enters data in his own style
7.4.
Character data type instead of
date, number
1. Evolution and changes are almost
impossible
2. Migration is almost impossible
3. Theory that application is the
main entity not data
Every
person more or less is familiar with data bases. Starting with heap of scratch
papers, notebooks (I mean the old ones on the paper :), and ending with
relational data base management systems storing zillions of records. Not so
many people think they are familiar with data wastes. But in reality most
probably they are! You ask – what is data waste? In short – this is data base
where there aren’t clearly defined algorithms how to get out rational data.
You
probably have (or had) an old notebook on paper with very important information
including phone numbers, addresses, peoples etc. randomly written in. The same
probably could be said about files and directories of your own work computer
(if you have a perfect system, then most probably you are an exception :).
So who
cares about your own notebook or computer? Nobody do, except you. But in case
of databases used by many people situation turns out far worse.
The first
big chapter “Characteristics
for data waste” describes how one can identify data waste. Second
chapter “Consequences of
data waste” outlines most important and worst consequences of
data waste.
Characteristics
for data waste
This
chapter tries to formulate most common and important characteristics describing
a data waste. Most of the time none of them are true/false type features but
the more of them you have the more your data base is data waste with all
possible (negative) consequences.
Although
the references are mostly related to Oracle the concepts are valid for each and
every database management system that has more advanced features than simply
storing data in tables.
1. Lack of
documentation
What I understand by documentation? Documentation in reality can be stored in many ways. You can have it in Software design description according to IEEE standard, separate document called something like Database design description or even in comment format in database. The most important thing isn’t that you have enumerated all the tables and columns, but that you have described what they contains, what possible values they could have, what these values mean, what relationships they have etc. One can easily generate table, column, column data type, primary key, foreign key and list of other database objects from data dictionary. The main problem is to understand what these objects contain and what the purpose they serve is.
Further reading:
1.http://www.bbc.co.uk/guidelines/newmedia/word_docs/OracleDatabaseDesignTemplate.doc
- New Media (Oracle) Database Design Template.
2. The only
well-informed person has left a while ago
In conjunction
with “lack of documentation” (1) and “all logic in application / db has only
tables” (5) you are in a big trouble if you need to change something or get
reasonable data out. The only way is to analyze the data model, analyze the
data and analyze the application code (if source is available). This takes
time. This takes lot of time. And there is no guarantee that you have made
changes in a right way. Probably after some time you’ll find that your changes won’t
work for a very special case and worst of all have silently damaged your other
data. So take care of your only well-informed person and force him to correct
“lack of documentation” or at least deliver his knowledge to others.
3. Lack of
initial design
Initial
design is crucial for almost everything that is at least a bit complex. You
cannot build a big house without accepted project. Unfortunately many times
because of false expectations that it would be cheaper information systems are
built without any initial design. Actually it will be far more expensive at
least in long term. The earlier the mistake is allowed the harder is to correct
it. If your data model is incorrect in the very root it will affect all the
system and even best programmers in the world wouldn’t be able to make the
application performance acceptable.
Further reading:
Logical modelling
1. Data
Model Patterns: Conventions of Thought by David C. Hay, ISBN: 0932633293;
2. The Data
Model Resource Book, Vol. 1: A Library of Universal Data Models for All
Enterprises by Len Silverston, ISBN: 0471380237;
3. Requirements
Analysis: From Business Views to Architecture by David C. Hay, ISBN: 0130282286;
4. http://www.phlonx.com/resources/nf3/
- Tutorial on Database Normalization;
5.http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:6692296628899
– only one table with complex object as blob;
6. Oracle
Insights Tales of the Oak Table, Chapter 11 Bad CaRMa by Tim Gorman, ISBN:
1590593871;
7. http://www.learndatamodeling.com/
- a short description of data modelling.
8. http://www.tdan.com/edatt1_archive.htm - article archive. You can search for example for David Hay; he has some valuable free articles there.
9. http://www.utexas.edu/its/windows/database/datamodeling/dm/overview.html - introduction to data modelling and relational modelling.
Physical modelling for Oracle
1. Expert
Oracle Database Architecture: 9i and 10g Programming Techniques and Solutions by
Thomas Kyte, ISBN: 1590595300;
2. Effective
Oracle by Design (Osborne ORACLE Press Series) by Thomas Kyte, ISBN: 0072230657;
3. Oracle Insights
Tales of the Oak Table, Chapter 10 Design Disasters by Jonathan Lewis, ISBN:
1590593871;
4. http://asktom.oracle.com – information mountain by Thomas Kyte.
4. Lack of naming
conventions
Usually
that goes hand in hand with “3.
Lack of initial design” It will raise the total incomprehension level about
your data base because every object has to be analyzed separately. You cannot
create patterns from already known information and use analogies that greatly reduce
necessary overall work.
Further reading:
1. http://www.gplivna.eu/papers/naming_conventions.htm
- Naming conventions for Oracle tables, columns, indexes, keys etc;
2. http://en.wikipedia.org/wiki/Analogy
- Analogy, From Wikipedia, the free encyclopaedia.
5. All logic in
application / db has only tables
Do you have
all business logic outside your Oracle database? Yes? Then why on earth you
have spent so many $$ for Oracle licence? You could easily live with MySQL! And
MySQL would certainly be faster! Why? Because Oracle has so much possible
functionality compared to MySQL, that it certainly adds some overhead. This is
just like you have rented a big house with tens of rooms and tons of equipment
and advanced features and you are using only living-room. You are not using WC.
Why WC? You can go out in the garden and do necessary things there! You are not
using kitchen. Why kitchen? You can go out and do the cooking on fire! So would
you pay rent for the entire house with all its features and wouldn’t use them?
The same
situation is with databases, especially advanced databases with many features.
You have paid so much money and are just deliberately robbing yourself.
If you are
seeking for a miracle called database independence then you have achieved the
wonderful result when your application isn’t performing at possible best on every
one of them. All the built in features are as close as possible to the data and
that’s the primary reason they are doing things better than any feature outside
database.
This is
just like you are living in several separate houses each with different architecture and using only
living-rooms there. I’m sure you wouldn’t be satisfied with your quality of
life.
Further reading:
1. http://www.rittman.net/archives/001140.html
- The Cost of Database Independence by Mark Rittman;
2.http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:883929178230
– business logic in DB versus out of DB.
6.
Many persons have already taken part to make data waste bigger each with his
own imagination and level of understanding
With “3. Lack of
initial design” and “4.
Lack of naming conventions” all people developing your data base are like ten people decorating
your ten room house without any guidance. If you are lucky achieved result will
be quite acceptable, if you aren’t result will strike terror into everyone. The
same situation is with databases. There are many conventions, standards and
best practises to follow. If more than one of them are used in your database the
ambiguity level is increased and it becomes apparent most brightly when you
have to do something with object A made after convention one and object B made
after convention two.
7.
Physical attributes of database objects
7.1. No constraints and referential
integrity
This
usually is the most radical form of “5.
All logic in application / db has only tables” or result of simple incompetence. If you haven’t
enforced referential integrity in database then this is only matter of time
when you’ll have orphan records, duplicate values and unwanted values. Sooner
or later you’ll access your data by passing your application, sooner or later
someone will use a bug in your application and your data will be damaged.
Trying to
enforce referential integrity and for example unique constraints in application
is far far harder than doing that in database, especially if you have multi-user
application (and who hasn’t?).
Further reading:
1. http://en.wikipedia.org/wiki/Referential_integrity
- Referential integrity, from Wikipedia, the free encyclopaedia;
2. http://download-uk.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_constraints.htm
- Oracle® Database Application Developer's Guide – Fundamentals, 6 Maintaining
Data Integrity in Application Development;
3. http://tkyte.blogspot.com/2006/06/what-did-i-decide-on.html
- The Tom Kyte Blog, some worst practises.
7.2. Cryptic and uninformative column/table names
Tables with
names tab1, tab2 or AKP1, AKP2 and columns with names col1, col2 etc. can turn
everyone’s life into a hell. I prefer then table names in unknown language; at
least there is a reason to learn some words in it. In a new database initially
it is hard to remember logical names to say nothing of col1 and col2. Each and
every usage of them results in looking into some document and/or sticker.
7.3. Many columns like notes without any classifiers. Each user enters data in
his own style
A human can
probably get along without classifiers because he understands that
7.4. Character data type instead of date, number
This is
another one “feature” of database independence because each data base treats its
data types differently. So instead of using DATE and NUMBER you use VARCHAR (10)
and VARCHAR (XX). As a result you have wonderful potential for a big mess. Ordering
for characters works like ‘1’, ‘10’, ‘2’ and not like you’d like 1, 2, 10,
nothing to speak of ordering of varchar dates. Date and number mathematic
becomes Sisyphean task and
there is always a threat that someone will enter either date in incorrect
format or number with for example character “O”.
Further reading:
1. http://en.wikipedia.org/wiki/Sisyphus
- Sisyphus, from Wikipedia, the free encyclopedia.
7.5. Chaotic indexes
This is
more physical, than logical aspect of database. In case the database has more
or less only necessary indexes one can found common data access patterns that
can tell something about valuable data and not so valuable data. In case the
database has chaotic indexes one can found only bad example of indexing scheme.
Consequences of
data waste
List of
consequences described below of course isn’t complete and these are for the
worst case scenario when database complies with all criteria described above.
Therefore I hope there are only a few data bases in the world fulfilling all
characteristics mentioned above and I hope even more that you don’t have to
deal with one of them. Even if you have, maybe you are lucky enough to find a
person that has researched your problem source before and can give you some
knowledge about it, maybe you are lucky enough and the database is so small
that even with all characteristics above fulfilled you can understand it in a
reasonable time, maybe you are lucky enough and can agree with customer to
start a new life without old data or at least only with data that satisfies
some minimal criteria.
1.
Evolution and changes are almost impossible
The more
your data base is data waste the more each change costs.
- With “1. Lack of documentation” and “2.
The only well-informed person has left a while ago” you don’t know WHAT to change;
- With “3. Lack of initial design” and “4.
Lack of naming conventions” you don’t know HOW to change;
- With “5. All logic in application / db has only
tables” and “7.1. No constraints and
referential integrity” you cannot
be sure that your changes sometime won’t
break all the system.
You are
stick with your data waste and there is nothing but pray that something won’t
break.
2.
Migration is almost impossible
OK, you’ve
understood that you have data waste, let’s create a data base and start things
over! The bad news is – that’s quite a big work or almost impossible for a real
data waste.
- With “1. Lack of documentation” and “2.
The only well-informed person has left a while ago” you don’t know WHAT to migrate. You don’t know which tables and columns take
seriously, which tables and columns are simple waste, you even don’t know
what all this mess means!
- With “5. All logic in application / db has only
tables” and “7.1. No constraints and
referential integrity” you have records without necessary information
(because of lack of NOT NULL constraints), you have orphan records and
don’t know where to put them and you have duplicate records with different
values.
- With “7.3.
Many columns like notes without any classifiers. Each user enters data
in his own style” and “7.4. Character data type
instead of date, number” you haven’t clear algorithms how to make data
transformation and how to map old a la classifiers to new classifiers.
- With “7.2.
Cryptic and uninformative column/table names” your task of writing migration code becomes
nightmare and every evening you have headache J
As a result
with completely inadequate work you have made partial migration and all users
complain that old system was better because had more information and was less
restrictive.
3.
Theory that application is the main entity not data
Data waste
is probably one of the biggest causes of viewpoint that the biggest value is
application, not the data. I’ve personally seen only the contrary. What is data
conversion from text files, various ancient DBMSes, previous data model
versions? All these data were entered via some application. They are gone forever;
nobody cares about them, but the great requirement to convert old data remains.
In case of data waste the overall quality of data and inability to convert them
leads to this false conclusion.
Summary
Avoiding
mistakes described in “Characteristics
for data waste” doesn’t guarantee that your database will be good. But
allowing them does guarantee you problems that you won’t be able to correct (except
“7.5. Chaotic indexes”) via tuning, caching, writing better SQL, using bind
variables, buying bigger box or whatever. These are logic mistakes and the only
ultimate solution is to redesign data base and most probably rewrite your
application from scratch. If you really think that you need to allow at least
one of them then think at least twice and try to formulate both minuses and
pluses for yourself. This will at least allow you to look deeper in possible
problems and gather new experience for decision making skills J
Further reading:
1. http://en.wikipedia.org/wiki/Anti-pattern
- Anti-pattern, from Wikipedia, the free encyclopedia;
2. http://www.web-hits.org/txt/codingunmaintainable.html
- How To Write Unmaintainable Code.
About
the author
Gints
Plivna gints.plivna@gmail.com is
system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in
working with Oracle since 1997 and his interests mostly have been connected
with analyzing system requirements, design, development and SQL tuning.
Contacts:
e-mail - gints.plivna@gmail.com
website -
http://www.gplivna.eu/
Licenses
This work
is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/
or send a letter to
First
edition: 2006-07-24
Changes: 2006-08-09 added url for logical modelling