Using Analytic Functions in Reports
What is
so cool in analytic functions?.
Case
one: persons and their citizenships
Case
two: peoples without valid identity cards
Case
three: getting the current value and first value in a hierarchy
Case
four: do all that in few sections
Appendix
1 Database model diagram used in this document
Introduction
Analytics
has been around since 8i but unfortunately it is terra incognita for many Oracle users. Some authors already have
pointed out valuable aspects of analytics including following books:
[1].
Effective Oracle by Design by Thomas Kyte ISBN:
[2].
Mastering Oracle SQL, 2nd Edition by Sanjay Mishra, Alan Beaulieu
ISBN:
0-596-00632-2.
As well as
articles and Oracle documentation:
[3].
Analytic functions by Example, by Shouvik Basu (http://www.orafaq.com/node/55);
[4]. Oracle®
Database Data Warehousing Guide, 10g Release 2 (10.2) (http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14223/analysis.htm#i1007779).
And of
course many more links can be found in Google.
I’ll try to
share my experience of rewriting reports using analytic functions. Although
much simplified and slightly changed, all of them were part of a recent project.
Sometimes rewritings led to amazing results.
Before
starting to read following chapters it would be worth to get familiar with data
model used throughout examples (see Figure
1 Database model diagram used in this
document). And one more thing – each case has create table
statements, data generation script as well as all select statements here.
I strongly encourage you to download them and try out yourself in a test
environment.
Here is
short explanation for each case:
1) Reports
usually tend to make use of all or at least big part of a table data. Do that
via full scan if you need all the data. After you have read all the data you have the basis for the whole report. “Case one: persons and
their citizenships” illustrate this situation.
2) Think
twice about your problem if you have to access a large part of a big table via index scan. Probably full scan with
hash join (or even merge join) will speed things up even if in a first glance
it seems impossible. Such a situation is shown in “Case two: peoples without
valid identity cards”.
3) If you
cannot format your data according to your needs in a single step but they
contain all the necessary information
for that – do that in a few steps. That’s the reason for “Case three: getting
the current value and first value in a hierarchy”.
4) Think
more than twice if you need to full scan a big table more than once. There are
SQL constructions now providing ability to make preaggregates and do necessary calculation with them. “Case four:
do all that in few sections” is an example of that.
What
is so cool in analytic functions?
Analytic
functions give you possibility to look at previous and following rows in a row
set without expensive and often impossible joins because there are times you don’t know how many joins you’ll need.
With analytics you can simply look for a value (or compute it using) either foregone or calculated rows forward or backwards. That means Oracle can scan
source data once and then sort it (if
necessary more than once) in memory or temp. My experience shows that sorting
even several times is much faster
than scanning source data more than once.
In next four cases I’ll try to show four different real world reports that
exploited analytic functions a lot.
Case
one: persons and their citizenships
Report
basically counted people and their citizenships. Mostly people have only one citizenship,
but those who are agile enough get more than one. Respectively report stated
following algorithm:
1) If
person has citizenship X (most of the people has it) then is counted it and
only it regardless of all other his citizenships;
2) If
person hasn’t citizenship X then are counted all his citizenships.
Table
PERSONS contains each person along with its name and surname. Table
CITIZENSHIPS is joined to PERSONS and each person’s citizenship has one row in
CITIZENSHIPS. So how to get the report using algorithm defined above?
Traditional
approach would be something like count all distinct people with citizenship X
(scan table CITIZENSHIPS via full scan, because X is very common citizenship)
and then count all citizenships whose people haven’t X (scan source data second
time most probably again via full scan and perform some index scan to check if
the current person has X or not). Query is shown in Code
listing 1:
SELECT
count(distinct ctz_prs_id) cnt,
ctz_code code
FROM citizenships
WHERE ctz_code = 'X'
GROUP BY ctz_code
UNION ALL
SELECT count(distinct ctz_prs_id) cnt, ctz_code code
FROM citizenships c1
WHERE ctz_code <> 'X'
AND NOT EXISTS (
SELECT 'X'
FROM citizenships c2
WHERE c1.ctz_prs_id = c2.ctz_prs_id
AND c2.ctz_code = 'X'
)
GROUP BY ctz_code
ORDER BY cnt desc;
Code listing 1 Traditional query
for Case one
But looking
from our perspective after first full scan of all citizenships Oracle already
knows all information it needs to determine
whether the current citizenship record has to be counted or not. The only
problem is that we have to look unknown number of rows back or forth in the same query result. But if we’d at
first try to order our query result then we’ll know exactly for what row we
should look. So the first step is clear – let’s order so that row with
citizenship X for each person will be the first one. In the second step we
should follow the algorithm defined in the beginning of this chapter using some
temporary flag column to mark whether we’ll count this row or not. And the last
step is to perform final counting.
First step is
implemented using partition and order by clause PARTITION BY ctz_prs_id ORDER BY
decode (ctz_code, 'X', 0, 1) of analytical function first_value (ctz_code). Using this analytical function for each row we
can move on to our second step – perform calculation whether we are interested
in current row or not i.e. if the current row belongs to another person than
previous row (which is counted using some other analytical function) then it is
applicable; if the current row belongs to the same person than previous row
then we should check whether the first citizenship for this person is X or not
and set flag accordingly. The second step is as follows:
CASE
WHEN ctz_prs_id <> prev_prs_id THEN 1
WHEN first_code = 'X' THEN 0
ELSE 1
END flag
Third step
is to filter out only rows with flag = 1 and perform simple group by citizenship. And putting together all pieces
we get final query (see Code listing 2).
SELECT count(*) cnt, ctz_code
FROM (
SELECT ctz_code,
CASE WHEN ctz_prs_id <>
prev_prs_id THEN 1
WHEN first_code = 'X' THEN 0
ELSE 1
END flag
FROM (
SELECT
ctz_prs_id,
ctz_code,
first_value (ctz_code) OVER (
PARTITION BY ctz_prs_id ORDER BY
decode (ctz_code, 'X', 0, 1))
first_code,
lag(ctz_prs_id, 1, -1) OVER (
ORDER BY ctz_prs_id,
decode (ctz_code, 'X', 0, 1))
prev_prs_id
FROM citizenships
)
)
WHERE flag = 1
GROUP BY ctz_code
ORDER BY cnt desc;
Code listing 2 Analytic query for
Case one
As a result
instead of two times full scanning and additional some index scan we have
scanned source once and according to autotrace performed the same count of
sorts. Using generated data (see here)
Table 1 shows compared results of these two queries:
|
Query/Metric |
Scans |
Elapsed
time (seconds) |
Consistent
gets |
Sorts |
|
1st
traditional |
2 full
scans, 1 index scan |
0.40 |
50977 |
3 |
|
2nd
with analytics |
1 full
scan |
0.18 |
61 |
3 |
Table 1 Compared runtime
statistics for Case one
Case
two: peoples without valid identity cards
This query
illustrates opposing situation than previous one. Current report had to count
peoples without (as opposing to
previous with) something,
particularly valid identity cards. Unfortunately it is even worse – if person
is citizen of citizenship X then he needs valid Citizen ID card, but if he is non-citizen of citizenship X then he
needs valid Non-citizen ID card.
Valid document means that today’s date is between its start date and end date.
The traditional
approach would be scanning all citizenships and see whether this person hasn’t
valid ID card of appropriate type (see Code listing 3).
SELECT COUNT(distinct ctz_prs_id), ctz_code, ctz_type
FROM citizenships
WHERE ctz_code = 'X'
AND NOT EXISTS (
SELECT 'X'
FROM documents
WHERE doc_valid_from < sysdate
AND doc_valid_to > sysdate
AND doc_prs_id = ctz_prs_id
AND (ctz_type = 'C' AND
doc_type = 'Citizen ID card'
OR
ctz_type = 'N' AND
doc_type = 'Non-citizen ID
card')
)
GROUP BY ctz_code, ctz_type;
Code listing 3 Traditional query
for Case two
As a result
seeking possible valid document for almost each person we had effectively
scanned almost all table documents. Unfortunately it wasn’t done via full scan
but using index-first-table-after approach. What if we’d try to join them
somehow? The idea can be expressed in following steps:
1) Join
both tables via outer join (because some people have citizenships but haven’t
valid ID cards).
2) Use
again already known technique of a “flag” column – this time it will show us
whether person has valid ID card or not.
3) After
that simple analytic partitioning by person and ordering by flag is performed row_number() OVER (PARTITION BY
ctz_prs_id ORDER BY flag desc). It means first row for each person will show us whether he has valid
ID card or not.
4) Final
step is to count just these persons that are flagged they haven’t valid ID
cards (see Code listing 4).
SELECT count(distinct ctz_prs_id), ctz_code, ctz_type
FROM (
SELECT ctz_prs_id, ctz_code, ctz_type,
flag,
row_number() OVER (PARTITION BY
ctz_prs_id
ORDER BY flag
desc) rn
FROM (
SELECT /*+ FULL(citizenships)
FULL(documents) */
ctz_prs_id, ctz_code, ctz_type,
CASE WHEN ctz_type = 'C' AND
doc_type = 'Citizen ID
card' THEN 1
WHEN ctz_type = 'N' AND
doc_type = 'Non-citizen
ID card' THEN 1
ELSE 0
END flag
FROM citizenships, documents
WHERE ctz_code = 'X'
AND doc_valid_from (+) < sysdate
AND doc_valid_to (+) > sysdate
AND doc_prs_id (+) = ctz_prs_id
)
)
WHERE rn = 1 AND flag = 0
GROUP BY ctz_code, ctz_type;
Code listing 4 Analytic query for
Case two
Using
generated data (see here)
Table 2 shows compared results of these two queries:
|
Query/Metric |
Scans |
Elapsed
time (seconds) |
Consistent
gets |
Sorts |
|
1st
traditional |
1 full
scan, 1 index scan (full table) |
0.10 |
17965 |
1 |
|
2nd
with analytics |
2 full
scans |
0.06 |
309 |
2 |
Table 2 Compared runtime
statistics for Case two
Case
three: getting the current value and first value in a hierarchy
In this
case I’ll try to explain the method proposed by Thomas Kyte in http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:13946369553642
as well as in [1]. In fact methods are two:
(1) One
general method – how to solve complicated problem using analytics. With complicated I mean problems that can’t
be solved in one/two obvious steps;
(2) One
technical method for a special case – how to make groups in a row set.
The
original problem for a report was as follows:
Addresses
have tendency to change over time. Most usual reason is street name change, and
some administrative reforms. To record this information address history was
designed as recursive relationship in ADDRESS table pointing to next address in
hierarchy. So the problem was to find all current
addresses for a given set of address and output both
1) Address
identifying information that conforms given criteria.
2) Most
recent address for each found address.
Both
methods will be explained using generated example data that can be found here.
So let’s
start iterative thinking with very simple first step – create a query that
shows information for the given address criteria (see Code
listing 5):
SELECT
adr_id, adr_adr_id, adr_country_code, adr_street
FROM addresses
WHERE adr_country_code = 'WIN'
AND adr_region_code = 'DOW';
Code listing 5 Step one query for
Case three
Output from
query looks as follows:
ADR_ID ADR_ADR_ID ADR ADR_STREET
----------
---------- --- ----------
8451 WIN WEEKNIGHT_
Last row haven’t child (adr_adr_id is empty),
but first two rows has at least one child. Having recursive relationship in
ADDRESS table obviously leads us to hierarchical query clause (CONNECT BY). Let’s
look at query that shows all address history over time for above given criteria
(Code listing 6).
SELECT
rownum rn, level lvl,
adr_id, adr_adr_id, adr_country_code,
adr_street
FROM addresses
START WITH adr_country_code = 'WIN'
AND adr_region_code = 'DOW'
CONNECT BY PRIOR adr_adr_id = adr_id;
Code listing 6 Step two query for
Case three
Output from
query looks as follows:
RN
LVL ADR_ID ADR_ADR_ID ADR
ADR_STREET
----
---------- ---------- ---------- --- ----------
1
1 8451 WIN WEEKNIGHT_
Looking at data
output we can see that actually we have groups of records that always start
with lvl = 1, but end with lvl >= 1. Rows with lvl = 1 always indicate rows
that conforms to starting address criteria. Row with maximal lvl value just
before next group of records indicates most current address. So we can draw at least
following conclusions:
1) There
are obvious groups of records.
2) Count of
groups is the same as row count conforming starting address criteria.
So what do we
really need for final output? We need somehow to combine first and last records
in each group and our task would be achieved! We can remember that we have used
analytical function first_value to found a first row for some partitioning. The only problem is to
somehow mark each group with some unique tag and we already know how to deal
with such a situation. As you already may guess rownum for above row set wasn’t
added just for fun, it is very useful unique identifier. So let’s mark start of
each group with a unique marker adding column in the end (see Code listing 7).
SELECT
rownum rn, level lvl,
adr_id, adr_adr_id,
adr_country_code, adr_street,
CASE WHEN level = 1 THEN rownum
ELSE NULL
END new_rn
FROM addresses
START WITH adr_country_code = 'WIN'
AND adr_region_code = 'DOW'
CONNECT BY PRIOR adr_adr_id = adr_id;
Code listing 7 Step three query
for Case three
Output from
query looks as follows:
RN
LVL ADR_ID ADR_ADR_ID ADR
ADR_STREET NEW_RN
----
---------- ---------- ---------- --- ---------- ----------
1
1 8451 WIN WEEKNIGHT_ 1
2
1 208452 408452 WIN WEEKEND_WI 2
3
2 408452 WIN EKEND_WIND
4 1
608453 808453 WIN
MAINTENANC 4
5
2 808453 1008453 WIN INTENANCE_
6
3 1008453 WIN ENANCE_WIN
Next task
is to mark all rows for each group with this unique identifier. It is done using
an analytical function that almost everyone has used as a set function – max. This time the only difference is
that it is used in analytical way. So we need max(new_rn) OVER (ORDER BY rn). We cannot use partitioning clause for
max simply because there isn’t any
column suitable for that. There is one thing that is hidden in this function
use – by default it is working from unbounded
preceding and current row. It means using this function for every row for
column new_rn it will display maximal value of
this column since the beginning up to current row. So the first level of
analytics has been applied and query now looks as follows (see Code listing 8).
SELECT rn, lvl, adr_id, adr_adr_id,
adr_country_code, adr_street,
max(new_rn) OVER (ORDER BY rn) grp
FROM (
SELECT rownum rn, level lvl,
adr_id, adr_adr_id, adr_country_code,
adr_street,
CASE WHEN level = 1 THEN rownum
ELSE NULL
END new_rn
FROM addresses
START WITH adr_country_code = 'WIN'
AND adr_region_code = 'DOW'
CONNECT BY PRIOR adr_adr_id = adr_id
);
Code listing 8 Step four query for
Case three
Output as
you already can imagine looks as follows:
RN
LVL ADR_ID ADR_ADR_ID ADR
ADR_STREET GRP
----
---------- ---------- ---------- --- ---------- ----------
1
1 8451 WIN WEEKNIGHT_ 1
2
1 208452 408452 WIN WEEKEND_WI 2
3
2 408452 WIN EKEND_WIND 2
4
1 608453 808453 WIN MAINTENANC 4
5
2 808453 1008453 WIN INTENANCE_ 4
6
3 1008453 WIN ENANCE_WIN 4
Each group
has been successfully marked and you already know what to do with such a result
set – simply use analytic function first_value and partition by just now computed column grp. But this time to make things a bit
different we will use exactly opposed function last_value. And this will give us another
possibility use analytic constructions because the working window from unbounded preceding and current row is true
also for functions first_value and last_value. Because we will show rows with lvl
= 1 along with it’s corresponding group last row that is after it in the
ordering we have to use non-default working window from current row and unbounded following. The last step performed is to show
only one row for each group – of course we have to leave only rows with lvl =
1. So the final query with second level of analytics and last filter is (see Code listing 9).
SELECT
* FROM (
SELECT last_value(adr_id) OVER (
PARTITION BY grp ORDER BY rn
ROWS BETWEEN CURRENT ROW AND
UNBOUNDED FOLLOWING) curr_addr_id,
last_value(adr_street) OVER (
PARTITION BY grp ORDER BY rn
ROWS BETWEEN CURRENT ROW AND
UNBOUNDED FOLLOWING)
curr_street,
lvl, adr_id, adr_country_code,
adr_street
FROM (
SELECT rn, lvl, adr_id, adr_adr_id,
adr_country_code, adr_street,
max(new_rn) OVER (ORDER BY rn)
grp
FROM (
SELECT rownum rn, level lvl,
adr_id, adr_adr_id,
adr_country_code, adr_street,
CASE WHEN level = 1 THEN
rownum
ELSE NULL
END new_rn
FROM addresses
START WITH adr_country_code = 'WIN'
AND adr_region_code = 'DOW'
CONNECT BY PRIOR adr_adr_id =
adr_id
)
)
)
WHERE lvl = 1;
Code listing 9 Step five query for
Case three
And the
result is the same as we have asked in the very beginning:
CURR_ADDR_ID
CURR_STREE LVL ADR_ID ADR ADR_STREET
------------
---------- ---------- ---------- --- ----------
8451 WEEKNIGHT_ 1 8451 WIN WEEKNIGHT_
408452 EKEND_WIND 1
208452 WIN WEEKEND_WI
1008453 ENANCE_WIN 1
608453 WIN MAINTENANC
One more
remark for this particular Case three – I’m by no means saying that the
solution provided here was the ultimate best one. The main idea was to show
method for solving queries step by step. I can imagine at least 3 more
solutions for this problem:
1) Method
provided by Thomas Kyte in (http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:489772591421)
that definitely could be written with much less code, but asks ~5 sorts for
each returned row compared to constant sort count for query with analytic
functions (see Code listing 10).
SELECT
to_number(substr((
SELECT max(to_char(level, 'fm000009')
|| ' ' || adr_id)
FROM addresses adr2
START WITH adr2.adr_id = adr1.adr_id
CONNECT BY PRIOR adr_adr_id IS NOT NULL
AND PRIOR adr_adr_id = adr_id
), 8)) curr_addr_id,
adr_id
FROM addresses adr1
WHERE adr_country_code = 'WIN'
AND adr_region_code = 'DOW';
Code listing 10 Another approach
for Case three
2) One can
use built-in function SYS_CONNECT_BY_PATH and filter out result later even in
9i.
3) Using
10g one can use CONNECT_BY_ROOT operator that returns column value using data
from the root row.
Case
four: do all that in few sections
This case
wouldn’t be about analytics although impact on a report execution time could be
in the same magnitude or sometimes even more. The basic idea is – for
complicated reports divide target numbers into preaggregated values and then
manipulate with them. I assume if you have created several complicated reports
in your life there were times you have used or at least thought of using
temporary tables holding temporary preaggregated values. WITH clause introduced
as of 9i provides this ability right in the very SELECT statement itself.
The real
example is taken from several of our reports that showed peoples citizenship
count over their address, like shown in Table 3:
|
Territory |
CitizenshipX
Citizens |
CitizenshipX
Non-citizens |
Other
citizenships Citizens |
Other
citizenships Non-citizens |
|
All territories |
|
|
|
|
|
Country A |
|
|
|
|
|
Big city1 |
|
|
|
|
|
Suburb1 |
|
|
|
|
|
Suburb2 |
|
|
|
|
|
Big city2 |
|
|
|
|
|
Region1 |
|
|
|
|
|
District1 |
|
|
|
|
|
District2 |
|
|
|
|
|
Big city1 |
|
|
|
|
|
Region2 |
|
|
|
|
Table 3 People count breakdown
by citizenships and address
The problem
is that I cannot use ROLLUP because 1) sum of elements is before elements
(probably even that could be overcome after with some ordering of rows) 2) the
main reason – big cities are shown several times in report and the overall sum
isn’t sum of all elements.
The
original query tried to solve this situation very straightforward – for each
territory type scan all involved tables schematically it was something like:
1) SELECT
count of citizens and non-citizens from CITIZENSHIP table representing all
territories row.
2) SELECT
count of citizens and non-citizens from CITIZENSHIP table, joined to PERSON
ADDRESSES table joined to ADDRESSES table joined to COUNTRIES table
representing rows for each country.
3) SELECT
count of citizens and non-citizens from CITIZENSHIP table, joined to PERSON
ADDRESSES table joined to ADDRESSES table joined to CITIES table filtering out
only big cities representing rows for each big city.
4) And
similarly for suburbs, regions, and districts.
As a result
some of such reports never ended on production data i.e. after running for more
than 48 hours they were simply killed. A simple question what to do was raised.
Of course we could eliminate some of
these continuous scans of the same data over and over again but I somehow had
feeling that after the first full scan of all involved large base tables we
have collected all information we need to
display the result. The only problem
was how to twist and format the result
to show it desired times in the desired place.
Solution
was simple – create preaggregated counts grouped by citizenship, citizenship
type and address components. After that we could easily sum and join them to
necessary address classifier components.
Following
query is basically the same query taken from “Case one: persons and their
citizenships” and joined with PERSON ADDRESSES and ADDRESSES tables (see Code listing 11).
SELECT count(*) cnt,
ctz_code ctz,
ctz_type,
adr_country_code cnt,
adr_region_code reg,
adr_city_code cty
FROM addresses, person_addresses,
-- a bit modified select from case1,
-- just without grouping
(SELECT * FROM (
SELECT ctz_prs_id, ctz_code,
ctz_type,
CASE WHEN ctz_prs_id <> prev_prs_id
THEN 1
WHEN first_code = 'X' THEN 0
ELSE 1
END flag
FROM (
SELECT
ctz_prs_id,
ctz_code,
ctz_type,
first_value (ctz_code) OVER (
PARTITION BY ctz_prs_id ORDER
BY
decode (ctz_code, 'X', 0, 1)) first_code,
lag(ctz_prs_id, 1, -1) OVER
(ORDER BY
ctz_prs_id, decode (ctz_code,
'X', 0, 1))
prev_prs_id
FROM citizenships
)
)
WHERE flag = 1) ctz_source
WHERE ctz_source.ctz_prs_id = pra_prs_id
AND pra_adr_id = adr_id
GROUP BY ctz_code,
ctz_type,
adr_country_code,
adr_region_code,
adr_city_code;
Code listing 11 Query used n WITH
clause for Case four
So here it
is – a preaggregated source to use in WITH clause and now we can take the same
steps we had in the original query:
1) Sum all
counts of citizens and non-citizens from preaggregate representing all
territories row.
2) Sum
counts of citizens and non-citizens from preaggregate joined to COUNTRIES table
representing rows for each country.
3) Sum
counts of citizens and non-citizens from preaggregate joined to CITIES table
filtering out only big cities representing rows for each big city.
4) And
similarly for suburbs, regions, and districts.
As a result
we eliminated roughly 5/6 of all work and additionally using technique described
in “Case one: persons and their citizenships” all such reports run on
production data in less than 20 minutes down from at least 48 hours and it was
achieved mostly because following factors:
1) Scanned big source tables (CITIZENSHIPS, ADDRESSES, PERSON_ADDRESSES in example, actually there were more) only once via full scan.
2) Scan created small temporary table in memory. In our real production system it contained several thousand rows.
3) Instead of joining big source table results several times to classifier tables we join small temporary table. The same number of times though.
Complete
query and sample data generation can be seen here.
Summary
I had few
goals in my mind writing this article:
1) Using
analytic functions makes reporting easier in many cases. It is like cycle
operator you can do on your result set. If you can successfully exploit it, you
can free up much time and resources for other tasks.
2) Solving
problems step by step is an ancient method for solving various tasks that could
be well used also finding solutions for complex reports.
3) Divide
et impera was (and is) successfully used method in politics, why not use it in
SQL? WITH clause and preaggregate values working like this ancient method can
sometimes be even more effective than analytic functions.
The minimum
goal of this article would be achieved if you’d be at least a little more
interested in analytic functions and WITH clause. The maximum goal would be
achieved if you’d be encouraged enough to try out them yourself.
About
author
Gints Plivna gints.plivna@gmail.com is system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in working with Oracle since 1997 and his interests mostly have been connected with analyzing system requirements, design, development and SQL tuning.
Contacts:
e-mail - gints.plivna@gmail.com
website - http://www.gplivna.eu/
Licenses
This work is licensed under the Creative Commons
Attribution-ShareAlike 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/
or send a letter to
Appendix 1
Database model diagram used in this document

Figure 1 Database model diagram used in this document