Data migration from old to new application: an experience
By Gints Plivna
Organize and code your data migration effectively
There are several valuable articles around the internet that provides general overview and strategy for data migration or data conversion method (for example The Complete Data Migration Methodology by Joseph R. Hudicka, Dulcian, Inc., http://www.dulcian.com/papers/The%20Complete%20Data%20Migration%20Methodology.html) but I couldnít find an article that described more technical information about data migration. This article tries to fill this gap and only shortly describes chosen strategy and provides more technical details that are specific for data migration from an old application to the new one.
Data migration from the old system was part of a new application we recently finished. Data model of the old system was completely different from the new one. Old data were stored in 8i Oracle and had to be moved to new application in Oracle 9i. Total amount of data to migrate wasnít very big (several gigabytes) but it was spread off on ~100 tables and included more than 40 million records. We had one definite positive factor Ė our client knew old application well even in table and column level, so our task was to gather information, identify what should be migrated and what shouldnít, develop SRS for migration, develop code and test results.
Establishing migration strategy
In order to avoid a big monstrous migration problem all data were divided in several logically separated groups called data groups that can be processed serially. Of course they had some dependencies, for example classifier data group has to be migrated first of all. Division provided obvious benefits such as easier planning for project manager, each analyst could focus on a tighter problem and software for migration of each data group could be produced separately.
All data groups were migrated by the same scenario Ė source data from the system that has to be migrated were moved to migration environment, Iíll refer to them as migration data later. In migration environment they were processed in such a manner that they could be almost without any change moved to our new application data base, letís call them target data. Such a process gave us following benefits:
- Source data remained unchanged. In case of emergence no harm was done to them;
- All migration process except initial data copy could be done away from old production server;
- All changes in migration process could easily be tracked. One could without any problems restore link from migration data to source data, track how they has been† changed and how they looked when inserted into target data base;
- Migration process could be easily restarted if necessary.
During the migration process two kinds of operations were applied to migration data:
- Validations Ė checked whether a row satisfied a certain criteria e.g. if a column was not null or values was in specified range. Validations were either warnings i.e. a flag was set for a row that violated a validation but further processing could be done or errors i.e. a row violated validation and couldnít be processed further. Validations didnít change any business data they only set the appropriate flag. Both warnings and errors were logged in a separate table for further examination if necessary.
- Transformations Ė made actual changes to business data and examples started with simple sequence number assign and ended with complex data derivation depending on some other data.
During migration analysis all validations and transformations were identified and they were uniquely ordered for each data group. Of course such an approach led to identification of global validations and transformations that improved overall data migration quality i.e. helped us not to forget a common validation and transformation and make them in a consistent way.
As a result separate System Requirements/Design Specification (SRS) for each data group and one global SRS was produced generally containing following information:
- Listed prerequisites for the migration of a particular data group;
- Identified source tables that has to be migrated for the particular data group;
- Described all migration tables that stored migration data;
- Described and enumerated all validations and transformations that had to be applied to the data group.
Next chapter will describe how logical units mentioned in SRSes can be mapped to real PL/SQL code.
Transforming requirements to code
All migration was done in PL/SQL packages. Generally one package was created for each data group and one procedure for one or more validation and transformation. In development of migration code one has to consider following differences between migration code and usual OLTP application code:
- Used once vs. used many times for many years. Normally migration code has to be developed, tested and then run only once to move data from source data base to target data base. If you donít plan using old and new systems in parallel only a critical bug in migration process can force you to run it more than once in production.
- Can be pretty static and deal only with current situation vs. more dynamic and easy to evolve. By static I mean that in migration software one can for example hardcode classifier values, think only about current data structures and doesnít pay much effort how this code would handle alternating future requirements.
- Performed by one user simultaneously vs. performed by many users simultaneously. Generally in migration you run one migration process at the same time and you donít have to worry about other users. If you have underutilized resources you can run several parallel processes if possible, but you can definitely control how many they would be.
- Can use all computerís resources vs. can use only a part of computerís resources. Generally the migration is done on a separate box and you can consider all resources provided by it. It gives you sense of stability, there isnít any chance that someone would intrude, spawn hundreds of processes and take away all your so valuable resources.
You can modify usual approach for developing application code according to above mentioned differences. Next chapter will list some tips how to make your migration run faster. Some of them will be a usual good programming practise you have to use for every application development, some the very opposite Ė youíd like to avoid in normal OLTP application.
Following coding and environment setup tips and techniques arrived in our project. Iíll try to explain them in order how they appeared. If you avoid the same mistakes we have met the goal of this article will be achieved.
1. We found I/O our main bottleneck. More than half of I/O operations overall was done on migration data, and here I mean only data, not indexes. Undo, migration indexes, source data, target data each took about 10% of overall I/O and temp ~5%. So the conclusion for us was to reduce I/O contention as much as we could on migration data. If you experience the same problems and havenít access and/or resources for high level I/O load balancing schemes you can use at least poor manís striping for migration data. We didnít care much about source and target data load balancing because source tables where accessed only once when all data moved to migration tables and target tables also where mostly accessed once when inserting migrated data. Only small amount of additional lookup was necessary to target data in migration process.
2. Adjust enough space for temp and undo. Think in gigabytes here, not megabytes.
3. Use some kind of logger to log start and finish of each transformation and validation. You can use autonomous transaction for this purpose. Measure time each transformation and validation requires and focus on tuning only those that run significant time. If overall migration process takes 3 hours you donít care about transformation that runs 10 seconds even if you can reduce necessary running time 10 times. But you do care if you can reduce a transformation that takes 2 hours by a small 10%.
4. Most probably you donít care about media failure during migration (it would be easier restart the process from beginning) so you can use INSERT with APPEND hint and CREATE INDEX with NOLOGGING option. This helps reducing a big amount of redo generation. These two options are especially useful when moving data from source tables to migration tables and later when doing inserts in target tables. But remember that inserting with VALUES keyword always generates redo as well as redo is generated if you have any indexes on the table or database is in FORCE LOGGING mode.
5. In initial load from source tables to migration tables try to filter out data that obviously arenít possible to migrate to reduce the migration data size that will be changed over and over again. Each full scan on table (although generally these arenít bad see paragraph 14) will go through all table blocks even those youíll never need.
6. Use referential integrity for migration data only where you really need it. This is contrary to normal applications where you have to define referential integrity as much as possible in database. Why? Because your code would be the only one modifying migration data. Because migration data would be interesting only in migration process and shortly after it, normally you donít need to enforce integrity constraints because in future someone could easily corrupt integrity. In migration process you donít care about distant future, you do care about current moment and speed. Of course if you are doubtful and think that your code cannot guarantee uniqueness of some column(-s) and parent/child relationship between some of your tables then you should enforce primary keys and foreign keys respectively. Enforcing them in the database is the fastest way if you think they could be broken.
7. Consider seriously PCTFREE parameter for migration tables. If these are subjects of heavily updates then default value of 10 is too less. With default value youíll end up with so many chained rows that they will very much slow down updates and selects on your tables. After testing the Ąreal migrationĒ (see paragraph 16) ANALYZE your tables and look for CHAIN_CNT in dba/all/user_tables view. If it is larger than some percents of total rows in table you should increase PCTFREE. Of course it all depends and I cannot give you exact value, but I become suspicious for values larger than 5%.
8. Standardize your migration tables e.g. if you use some kind of logging info then use it in a consistent way. If you have a flag that indicates whether row has to be moved to target tables then name it in the same manner and use the same data type everywhere. We used following standard columns for all our migration tables:
- Last transformation or validation that processed this row;
- Timestamp when it was last processed;
- Does it must be migrated?
- Timestamp when it has been migrated to target system.
9. Use SQL and set operations as much as you can. Use row by row operations only if these are really necessary. Remember that you have such constructions like DECODE and its younger brother CASE, NVL, various string functions, WITH keyword for SELECTS, INSERT into multiple tables that somehow helps you calculate and manipulate with values based on current row. And become familiar with ANALYTIC functions that give you ability to look on other rows and their products in a result set. Remember that you can create as many migration tables as you really need. Consider row by row operations and cursors only as a last option.
10. If you really cannot avoid loops and row by row operations (see paragraph above) then use some kind of tool that provides you with information how much work is done already. This tool for example can be DBMS_APPLICATION_INFO.SET_SESSION_LONGOPS. For more information read Oracle9i Supplied PL/SQL Packages and Types Reference or PL/SQL Packages and Types Reference 10g Release 1.
11. Calculate constants only once. Do not recalculate some lookup parameters (for example current user or userís organization) for every row. Even if you have some classifier that has several values it is better to initialize some variables or constants in the beginning of package or procedure than to query from a lookup classifier table for each row. Your code wouldnít be dynamic and probably not so elegant, but it would be much faster and that is main goal at least for migration code.
12. Consider using sysdate everywhere. Do you really care whether row was transformed in 10:31:46 when it really was transformed or all rows were transformed in 10:31:00 when transformation started? It is especially important if you use a construction SELECT sysdate INTO variable FROM dual. Why not simply initialize a variable in beginning of each transformation rather than doing this for each row? If you are concerned whether it costs anything just run dbms_profiler and look at results.
13. In your requirements gathering process you should identify some common transformations and/or validations for migration data. How to realize them? There are two possibilities: as a common procedure/function or everybody uses his own solution. Common solution would give consistent results every time but probably wouldnít be the fastest one, in opposition to everyoneís own solution. So you have to understand what benefits and disadvantages would give each solution (for example how big the risk is that somebody would code transformation in a way that isnít compatible with others) and make a decision.
14. Be suspicious about user-defined functions that are called for each row and do some queries on their own. Better incorporate them into upper statement.
15. Full scans arenít bad. Especially in migration process. Migration involves many transformations and validations that affect all or most of the data in a table. If you have to process all data in a table then do it! Do it with full scan! Try hash joins. Nested loops and access path via indexes can give you wonderful results in an OLTP application where you have to select or update several records, but it can be very slow process if you have to select or update all records. And one more benefit from full scans and hash joins - you can monitor them in v$session_longops and some tools like Enterprise Manager even can visualise them. I prefer to know how much work has been done and how much remained.
16. Before loading data into target schema turn off all auditing and drop all indexes if possible. Consider creating audit data later especially if this is some kind of home grown auditing via triggers. Loading audit data in one single move will be much faster than firing a trigger on each row. The other thing is indexes and constraints on target tables. Creating indexes later would avoid possible fragmentation issues as well as with NOLOGGING will save your time. Enabling constraints NOVALIDATE also will save your time.
17. Measure overall migration performance! We have tried to measure and compare many overall statistics and events from v$session_event and v$mystat and found that only reliable measurement is how many rows package migrated in the given time. Our migration rate was from 20 000 till 100 000 rows in a minute for various data groups. Unfortunately it was obvious that data groups with lowest rate wasnít the most complicated ones with most quantity or most sophisticated transformations. The biggest reason was the programming style of each developer. Developers having the least ratio didnít follow at least one or several suggestions given here.
Some more advices for migration performance measurement:
- Do that on the same hardware with the same environment;
- You can measure either target or source rows depending on your specificity. We measured target row count;
- Analyze the best and worst cases (assuming the complexity is comparable) and try to use practices from best case and avoid those used in worst case.
18. Test migration process at least once on all your real data. This would give you following benefits:
- Provide with a real case how long it takes to migrate all the data;
- Ensure that you have dealt with all possible data combinations at least in that given moment;
- Give you real information about physical parameters such as overall disk space both for migration data and target data, necessary temp and undo space, PCTFREE parameter (more in paragraph 7);
- Provide possibility to test target data whether anything isnít forgotten;
- Give you at least a little guarantee that main migration would run without emergency cases.
19. In actual migration be prepared for emergency. You should monitor migration process even if you have tested it several times, especially if data has been changed since then. New unexpected classifier values, changed execution plans Ė at least these are things that could affect you without notice.
20. And the last but not least: if you have any doubts about statements written here as well as anything else youíve heard is a correct approach, try it yourself! Test possible scenarios, measure results and draw a conclusion.
Data migration method described in this article is suitable for complicated transformations and validations in migration process. It can be used for small and middle sized data migration projects because it requires much additional storage along with source data and target data storage. And there are also involved two steps of big data movement from source to migration and from migration to target environments. For big databases without complicated transformations you probably may choose some kind of direct data moving from source to target environment.
Migration software design and coding issues highlighted above should give you some insight what problems youíll have, some tips how to avoid them and provide with overview of the successful experience.
About the author
Gints Plivna is system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in working with Oracle since 1997 and his interests mostly have been connected with analyzing system requirements, design, development and SQL tuning.
e-mail - firstname.lastname@example.org
website - http://www.gplivna.eu/
is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/
or send a letter to