Data migration from old to
new application: an experience
By Gints Plivna
Organize and code your
data migration effectively
There are
several valuable articles around the internet that provides general overview
and strategy for data migration or data conversion method (for example The Complete Data Migration
Methodology by Joseph R. Hudicka, Dulcian, Inc., http://www.dulcian.com/papers/The%20Complete%20Data%20Migration%20Methodology.html)
but I couldn’t find an article that described more technical information about
data migration. This article tries to fill this gap and only shortly describes
chosen strategy and provides more technical details that are specific for data
migration from an old application to the new one.
Data
migration from the old system was part of a new application we recently
finished. Data model of the old system was completely different from the new
one. Old data were stored in 8i Oracle and had to be moved to new application
in Oracle 9i. Total amount of data to migrate wasn’t very big (several
gigabytes) but it was spread off on ~100 tables and included more than 40
million records. We had one definite positive factor – our client knew old
application well even in table and column level, so our task was to gather
information, identify what should be migrated and what shouldn’t, develop SRS
for migration, develop code and test results.
Establishing migration strategy
In order to
avoid a big monstrous migration problem all data were divided in several
logically separated groups called data
groups that can be processed serially. Of course they had some
dependencies, for example classifier data group has to be migrated first of
all. Division provided obvious benefits such as easier planning for project
manager, each analyst could focus on a tighter problem and software for
migration of each data group could be produced separately.
All data
groups were migrated by the same scenario – source
data from the system that has to be migrated were moved to migration
environment, I’ll refer to them as migration
data later. In migration environment they were processed in such a manner
that they could be almost without any change moved to our new application data
base, let’s call them target data. Such
a process gave us following benefits:
- Source data remained unchanged.
In case of emergence no harm was done to them;
- All migration process except
initial data copy could be done away from old production server;
- All changes in migration
process could easily be tracked. One could without any problems restore
link from migration data to source data, track how they has been changed and how they looked when
inserted into target data base;
- Migration process could be easily
restarted if necessary.
During the
migration process two kinds of operations were applied to migration data:
- Validations – checked whether a row satisfied a certain criteria e.g. if a
column was not null or values was in specified range. Validations were
either warnings i.e. a flag was
set for a row that violated a validation but further processing could be
done or errors i.e. a row
violated validation and couldn’t be processed further. Validations didn’t
change any business data they only set the appropriate flag. Both warnings
and errors were logged in a separate table for further examination if
necessary.
- Transformations – made actual changes to business data and examples started with
simple sequence number assign and ended with complex data derivation
depending on some other data.
During
migration analysis all validations and transformations were identified and they
were uniquely ordered for each data group. Of course such an approach led to
identification of global validations and
transformations that improved overall
data migration quality i.e. helped us not to forget a common validation and
transformation and make them in a consistent way.
As a result
separate System Requirements/Design Specification (SRS) for each data group and one global SRS was produced generally
containing following information:
- Listed prerequisites for the
migration of a particular data group;
- Identified source tables that
has to be migrated for the particular data group;
- Described all migration tables
that stored migration data;
- Described and enumerated all
validations and transformations that had to be applied to the data group.
Next
chapter will describe how logical units mentioned in SRSes can be mapped to
real PL/SQL code.
Transforming requirements
to code
All
migration was done in PL/SQL packages. Generally one package was created for
each data group and one procedure for one or more validation and
transformation. In development of migration code one has to consider following
differences between migration code and usual OLTP application code:
- Used once vs. used many times
for many years. Normally migration code has to be developed, tested and
then run only once to move data from source data base to target data base.
If you don’t plan using old and new systems in parallel only a critical
bug in migration process can force you to run it more than once in
production.
- Can be pretty static and deal
only with current situation vs. more dynamic and easy to evolve. By static
I mean that in migration software one can for example hardcode classifier
values, think only about current data structures and doesn’t pay much
effort how this code would handle alternating future requirements.
- Performed by one user
simultaneously vs. performed by many users simultaneously. Generally in
migration you run one migration process at the same time and you don’t
have to worry about other users. If you have underutilized resources you
can run several parallel processes if possible, but you can definitely
control how many they would be.
- Can use all computer’s
resources vs. can use only a part of computer’s resources. Generally the
migration is done on a separate box and you can consider all resources
provided by it. It gives you sense of stability, there isn’t any chance
that someone would intrude, spawn hundreds of processes and take away all
your so valuable resources.
You can
modify usual approach for developing application code according to above
mentioned differences. Next chapter will list some tips how to make your
migration run faster. Some of them will be a usual good programming practise
you have to use for every application development, some the very opposite –
you’d like to avoid in normal OLTP application.
Coding techniques
Following
coding and environment setup tips and techniques arrived in our project. I’ll
try to explain them in order how they appeared. If you avoid the same mistakes
we have met the goal of this article will be achieved.
1. We found
I/O our main bottleneck. More than half of I/O operations overall was done on
migration data, and here I mean only data, not indexes. Undo, migration
indexes, source data, target data each took about 10% of overall I/O and temp
~5%. So the conclusion for us was to reduce I/O contention as much as we could
on migration data. If you experience the same problems and haven’t access
and/or resources for high level I/O load balancing schemes you can use at least
poor man’s striping for migration data. We didn’t care much about source and
target data load balancing because source tables where accessed only once when
all data moved to migration tables and target tables also where mostly accessed
once when inserting migrated data. Only small amount of additional lookup was
necessary to target data in migration process.
2. Adjust
enough space for temp and undo. Think in gigabytes here, not megabytes.
3. Use some
kind of logger to log start and finish of each transformation and validation.
You can use autonomous transaction for this purpose. Measure time each
transformation and validation requires and focus on tuning only those that run
significant time. If overall migration process takes 3 hours you don’t care
about transformation that runs 10 seconds even if you can reduce necessary
running time 10 times. But you do care if you can reduce a transformation that
takes 2 hours by a small 10%.
4. Most
probably you don’t care about media failure during migration (it would be
easier restart the process from beginning) so you can use INSERT with APPEND
hint and CREATE INDEX with NOLOGGING option. This helps reducing a big amount
of redo generation. These two options are especially useful when moving data
from source tables to migration tables and later when doing inserts in target
tables. But remember that inserting with VALUES keyword always generates redo
as well as redo is generated if you have any indexes on the table or database
is in FORCE LOGGING mode.
5. In
initial load from source tables to migration tables try to filter out data that
obviously aren’t possible to migrate to reduce the migration data size that
will be changed over and over again. Each full scan on table (although
generally these aren’t bad see paragraph 14) will go through all table blocks
even those you’ll never need.
6. Use
referential integrity for migration data only where you really need it. This is
contrary to normal applications where you have to define referential integrity as
much as possible in database. Why? Because your code would be the only one modifying
migration data. Because migration data would be interesting only in migration
process and shortly after it, normally you don’t need to enforce integrity
constraints because in future someone could easily corrupt integrity. In
migration process you don’t care about distant future, you do care about
current moment and speed. Of course if you are doubtful and think that your
code cannot guarantee uniqueness of some column(-s) and parent/child
relationship between some of your tables then you should enforce primary keys
and foreign keys respectively. Enforcing them in the database is the fastest
way if you think they could be
broken.
7. Consider
seriously PCTFREE parameter for migration tables. If these are subjects of
heavily updates then default value of 10 is too less. With default value you’ll
end up with so many chained rows that they will very much slow down updates and
selects on your tables. After testing the „real migration” (see paragraph 16) ANALYZE
your tables and look for CHAIN_CNT in dba/all/user_tables view. If it is larger
than some percents of total rows in table you should increase PCTFREE. Of
course it all depends and I cannot give you exact value, but I become
suspicious for values larger than 5%.
8. Standardize
your migration tables e.g. if you use some kind of logging info then use it in
a consistent way. If you have a flag that indicates whether row has to be moved
to target tables then name it in the same manner and use the same data type
everywhere. We used following standard columns for all our migration tables:
- Last transformation or
validation that processed this row;
- Timestamp when it was last processed;
- Does it must be migrated?
- Timestamp when it has been
migrated to target system.
9. Use SQL
and set operations as much as you can. Use row by row operations only if these
are really necessary. Remember that you have such constructions like DECODE and
its younger brother CASE, NVL, various string functions, WITH keyword for
SELECTS, INSERT into multiple tables that somehow helps you calculate and
manipulate with values based on current row. And become familiar with ANALYTIC
functions that give you ability to look on other rows and their products in a
result set. Remember that you can create as many migration tables as you really
need. Consider row by row operations and cursors only as a last option.
10. If you
really cannot avoid loops and row by row operations (see paragraph above) then
use some kind of tool that provides you with information how much work is done
already. This tool for example can be
DBMS_APPLICATION_INFO.SET_SESSION_LONGOPS. For more information read Oracle9i
Supplied PL/SQL Packages and Types Reference or PL/SQL Packages and Types
Reference 10g Release 1.
11. Calculate
constants only once. Do not recalculate some lookup parameters (for example
current user or user’s organization) for every row. Even if you have some
classifier that has several values it is better to initialize some variables or
constants in the beginning of package or procedure than to query from a lookup
classifier table for each row. Your code wouldn’t be dynamic and probably not
so elegant, but it would be much faster and that is main goal at least for
migration code.
12. Consider
using sysdate everywhere. Do you really care whether row was transformed in
10:31:46 when it really was transformed or all rows were transformed in
10:31:00 when transformation started? It is especially important if you use a
construction SELECT sysdate INTO variable FROM dual. Why not simply initialize
a variable in beginning of each transformation rather than doing this for each
row? If you are concerned whether it costs anything just run dbms_profiler and
look at results.
13. In your
requirements gathering process you should identify some common transformations
and/or validations for migration data. How to realize them? There are two
possibilities: as a common procedure/function or everybody uses his own
solution. Common solution would give consistent results every time but probably
wouldn’t be the fastest one, in opposition to everyone’s own solution. So you
have to understand what benefits and disadvantages would give each solution (for
example how big the risk is that somebody would code transformation in a way
that isn’t compatible with others) and make a decision.
14. Be
suspicious about user-defined functions that are called for each row and do
some queries on their own. Better incorporate them into upper statement.
15. Full
scans aren’t bad. Especially in migration process. Migration involves many
transformations and validations that affect all or most of the data in a table.
If you have to process all data in a table then do it! Do it with full scan! Try
hash joins. Nested loops and access path via indexes can give you wonderful
results in an OLTP application where you have to select or update several
records, but it can be very slow process if you have to select or update all
records. And one more benefit from full scans and hash joins - you can monitor
them in v$session_longops and some tools like Enterprise Manager even can
visualise them. I prefer to know how much work has been done and how much
remained.
16. Before
loading data into target schema turn off all auditing and drop all indexes if
possible. Consider creating audit data later especially if this is some kind of
home grown auditing via triggers. Loading audit data in one single move will be
much faster than firing a trigger on each row. The other thing is indexes and
constraints on target tables. Creating indexes later would avoid possible
fragmentation issues as well as with NOLOGGING will save your time. Enabling
constraints NOVALIDATE also will save your time.
17. Measure
overall migration performance! We have tried to measure and compare many
overall statistics and events from v$session_event and v$mystat and found that
only reliable measurement is how many
rows package migrated in the given time. Our migration rate was from 20 000
till 100 000 rows in a minute for various data groups. Unfortunately it was
obvious that data groups with lowest rate wasn’t the most complicated ones with
most quantity or most sophisticated transformations. The biggest reason was the
programming style of each developer. Developers having the least ratio didn’t
follow at least one or several suggestions given here.
Some more
advices for migration performance measurement:
- Do that on the same hardware
with the same environment;
- You can measure either target
or source rows depending on your specificity. We measured target row
count;
- Analyze the best and worst
cases (assuming the complexity is comparable) and try to use practices
from best case and avoid those used in worst case.
18. Test
migration process at least once on all your real data. This would give you
following benefits:
- Provide with a real case how
long it takes to migrate all the data;
- Ensure that you have dealt with
all possible data combinations at least in that given moment;
- Give you real information about
physical parameters such as overall disk space both for migration data and
target data, necessary temp and undo space, PCTFREE parameter (more in paragraph
7);
- Provide possibility to test
target data whether anything isn’t forgotten;
- Give you at least a little
guarantee that main migration would run without emergency cases.
19. In
actual migration be prepared for emergency. You should monitor migration
process even if you have tested it several times, especially if data has been
changed since then. New unexpected classifier values, changed execution plans –
at least these are things that could affect you without notice.
20. And the
last but not least: if you have any doubts about statements written here as
well as anything else you’ve heard is a correct approach, try it yourself! Test
possible scenarios, measure results and draw a conclusion.
Summary
Data
migration method described in this article is suitable for complicated
transformations and validations in migration process. It can be used for small
and middle sized data migration projects because it requires much additional
storage along with source data and target data storage. And there are also
involved two steps of big data movement from source to migration and from
migration to target environments. For big databases without complicated
transformations you probably may choose some kind of direct data moving from
source to target environment.
Migration
software design and coding issues highlighted above should give you some
insight what problems you’ll have, some tips how to avoid them and provide with
overview of the successful experience.
About the author
Gints
Plivna is system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience
in working with Oracle since 1997 and his interests mostly have been connected
with analyzing system requirements, design, development and SQL tuning.
Contacts:
e-mail - gints.plivna@gmail.com
website - http://www.gplivna.eu/
Licenses
This work
is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/
or send a letter to