gplivna.eu

Using Analytic Functions in Reports

Introduction. 1

What is so cool in analytic functions?. 2

Case one: persons and their citizenships. 2

Case two: peoples without valid identity cards. 4

Case three: getting the current value and first value in a hierarchy. 5

Case four: do all that in few sections. 9

Summary. 11

About author 11

Licenses. 11

Appendix 1 Database model diagram used in this document 12

Introduction

Analytics has been around since 8i but unfortunately it is terra incognita for many Oracle users. Some authors already have pointed out valuable aspects of analytics including following books:

[1]. Effective Oracle by Design by Thomas Kyte ISBN: 0072230657;

[2]. Mastering Oracle SQL, 2nd Edition by Sanjay Mishra, Alan Beaulieu

ISBN: 0-596-00632-2.

As well as articles and Oracle documentation:

[3]. Analytic functions by Example, by Shouvik Basu (http://www.orafaq.com/node/55);

[4]. Oracle® Database Data Warehousing Guide, 10g Release 2 (10.2) (http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14223/analysis.htm#i1007779).

And of course many more links can be found in Google.

I’ll try to share my experience of rewriting reports using analytic functions. Although much simplified and slightly changed, all of them were part of a recent project. Sometimes rewritings led to amazing results.

Before starting to read following chapters it would be worth to get familiar with data model used throughout examples (see Figure 1 Database model diagram used in this document). And one more thing – each case has create table statements, data generation script as well as all select statements here. I strongly encourage you to download them and try out yourself in a test environment.

Here is short explanation for each case:

1) Reports usually tend to make use of all or at least big part of a table data. Do that via full scan if you need all the data. After you have read all the data you have the basis for the whole report. “Case one: persons and their citizenships” illustrate this situation.

2) Think twice about your problem if you have to access a large part of a big table via index scan. Probably full scan with hash join (or even merge join) will speed things up even if in a first glance it seems impossible. Such a situation is shown in “Case two: peoples without valid identity cards”.

3) If you cannot format your data according to your needs in a single step but they contain all the necessary information for that – do that in a few steps. That’s the reason for “Case three: getting the current value and first value in a hierarchy”.

4) Think more than twice if you need to full scan a big table more than once. There are SQL constructions now providing ability to make preaggregates and do necessary calculation with them. “Case four: do all that in few sections” is an example of that.

What is so cool in analytic functions?

Analytic functions give you possibility to look at previous and following rows in a row set without expensive and often impossible joins because there are times you don’t know how many joins you’ll need. With analytics you can simply look for a value (or compute it using) either foregone or calculated rows forward or backwards. That means Oracle can scan source data once and then sort it (if necessary more than once) in memory or temp. My experience shows that sorting even several times is much faster than scanning source data more than once. In next four cases I’ll try to show four different real world reports that exploited analytic functions a lot.

Case one: persons and their citizenships

Report basically counted people and their citizenships. Mostly people have only one citizenship, but those who are agile enough get more than one. Respectively report stated following algorithm:

1) If person has citizenship X (most of the people has it) then is counted it and only it regardless of all other his citizenships;

2) If person hasn’t citizenship X then are counted all his citizenships.

Table PERSONS contains each person along with its name and surname. Table CITIZENSHIPS is joined to PERSONS and each person’s citizenship has one row in CITIZENSHIPS. So how to get the report using algorithm defined above?

Traditional approach would be something like count all distinct people with citizenship X (scan table CITIZENSHIPS via full scan, because X is very common citizenship) and then count all citizenships whose people haven’t X (scan source data second time most probably again via full scan and perform some index scan to check if the current person has X or not). Query is shown in Code listing 1:

SELECT
  count(distinct ctz_prs_id) cnt,
  ctz_code code
FROM citizenships
WHERE ctz_code = 'X'
GROUP BY ctz_code
UNION ALL
SELECT count(distinct ctz_prs_id) cnt, ctz_code code
FROM citizenships c1
WHERE ctz_code <> 'X'
  AND NOT EXISTS (
    SELECT 'X'
    FROM citizenships c2
    WHERE c1.ctz_prs_id = c2.ctz_prs_id
      AND c2.ctz_code = 'X'
  )
GROUP BY ctz_code
ORDER BY cnt desc;

Code listing 1 Traditional query for Case one

But looking from our perspective after first full scan of all citizenships Oracle already knows all information it needs to determine whether the current citizenship record has to be counted or not. The only problem is that we have to look unknown number of rows back or forth in the same query result. But if we’d at first try to order our query result then we’ll know exactly for what row we should look. So the first step is clear – let’s order so that row with citizenship X for each person will be the first one. In the second step we should follow the algorithm defined in the beginning of this chapter using some temporary flag column to mark whether we’ll count this row or not. And the last step is to perform final counting.

First step is implemented using partition and order by clause PARTITION BY ctz_prs_id ORDER BY decode (ctz_code, 'X', 0, 1) of analytical function first_value (ctz_code). Using this analytical function for each row we can move on to our second step – perform calculation whether we are interested in current row or not i.e. if the current row belongs to another person than previous row (which is counted using some other analytical function) then it is applicable; if the current row belongs to the same person than previous row then we should check whether the first citizenship for this person is X or not and set flag accordingly. The second step is as follows:

CASE WHEN ctz_prs_id <> prev_prs_id THEN 1
     WHEN first_code = 'X' THEN 0
     ELSE 1
END flag

Third step is to filter out only rows with flag = 1 and perform simple group by citizenship. And putting together all pieces we get final query (see Code listing 2).

SELECT count(*) cnt, ctz_code
FROM (
  SELECT ctz_code,
    CASE WHEN ctz_prs_id <> prev_prs_id THEN 1
         WHEN first_code = 'X' THEN 0
         ELSE 1
    END flag
  FROM (
    SELECT
      ctz_prs_id,
      ctz_code,
      first_value (ctz_code) OVER (
        PARTITION BY ctz_prs_id ORDER BY
        decode (ctz_code, 'X', 0, 1)) first_code,
      lag(ctz_prs_id, 1, -1) OVER (
        ORDER BY ctz_prs_id,
        decode (ctz_code, 'X', 0, 1)) prev_prs_id
    FROM citizenships
  )
)
WHERE flag = 1
GROUP BY ctz_code
ORDER BY cnt desc;

Code listing 2 Analytic query for Case one

As a result instead of two times full scanning and additional some index scan we have scanned source once and according to autotrace performed the same count of sorts. Using generated data (see here) Table 1 shows compared results of these two queries:

Query/Metric

Scans

Elapsed time (seconds)

Consistent gets

Sorts

1st traditional

2 full scans, 1 index scan

0.40

50977

3

2nd with analytics

1 full scan

0.18

61

3

Table 1 Compared runtime statistics for Case one

Case two: peoples without valid identity cards

This query illustrates opposing situation than previous one. Current report had to count peoples without (as opposing to previous with) something, particularly valid identity cards. Unfortunately it is even worse – if person is citizen of citizenship X then he needs valid Citizen ID card, but if he is non-citizen of citizenship X then he needs valid Non-citizen ID card. Valid document means that today’s date is between its start date and end date.

The traditional approach would be scanning all citizenships and see whether this person hasn’t valid ID card of appropriate type (see Code listing 3).

SELECT COUNT(distinct ctz_prs_id), ctz_code, ctz_type
FROM citizenships
WHERE ctz_code = 'X'
  AND NOT EXISTS (
    SELECT 'X'
    FROM documents
    WHERE doc_valid_from < sysdate
      AND doc_valid_to > sysdate
      AND doc_prs_id = ctz_prs_id
      AND (ctz_type = 'C' AND
           doc_type = 'Citizen ID card' OR
           ctz_type = 'N' AND
           doc_type = 'Non-citizen ID card')
  )
GROUP BY ctz_code, ctz_type;

Code listing 3 Traditional query for Case two

As a result seeking possible valid document for almost each person we had effectively scanned almost all table documents. Unfortunately it wasn’t done via full scan but using index-first-table-after approach. What if we’d try to join them somehow? The idea can be expressed in following steps:

1) Join both tables via outer join (because some people have citizenships but haven’t valid ID cards).

2) Use again already known technique of a “flag” column – this time it will show us whether person has valid ID card or not.

3) After that simple analytic partitioning by person and ordering by flag is performed row_number() OVER (PARTITION BY ctz_prs_id ORDER BY flag desc). It means first row for each person will show us whether he has valid ID card or not.

4) Final step is to count just these persons that are flagged they haven’t valid ID cards (see Code listing 4).

SELECT count(distinct ctz_prs_id), ctz_code, ctz_type
FROM (
  SELECT ctz_prs_id, ctz_code, ctz_type, flag,
    row_number() OVER (PARTITION BY ctz_prs_id
                       ORDER BY flag desc) rn
  FROM (
    SELECT /*+ FULL(citizenships) FULL(documents) */
      ctz_prs_id, ctz_code, ctz_type,
      CASE WHEN ctz_type = 'C' AND
                doc_type = 'Citizen ID card' THEN 1
           WHEN ctz_type = 'N' AND
                doc_type = 'Non-citizen ID card' THEN 1
           ELSE 0
      END flag
    FROM citizenships, documents
    WHERE ctz_code = 'X'
      AND doc_valid_from (+) < sysdate
      AND doc_valid_to (+) > sysdate
      AND doc_prs_id (+) = ctz_prs_id
  )
)
WHERE rn = 1 AND flag = 0
GROUP BY ctz_code, ctz_type;

Code listing 4 Analytic query for Case two

Using generated data (see here) Table 2 shows compared results of these two queries:

Query/Metric

Scans

Elapsed time (seconds)

Consistent gets

Sorts

1st traditional

1 full scan, 1 index scan (full table)

0.10

17965

1

2nd with analytics

2 full scans

0.06

309

2

Table 2 Compared runtime statistics for Case two

Case three: getting the current value and first value in a hierarchy

In this case I’ll try to explain the method proposed by Thomas Kyte in http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:13946369553642 as well as in [1]. In fact methods are two:

(1) One general method – how to solve complicated problem using analytics. With complicated I mean problems that can’t be solved in one/two obvious steps;

(2) One technical method for a special case – how to make groups in a row set.

The original problem for a report was as follows:

Addresses have tendency to change over time. Most usual reason is street name change, and some administrative reforms. To record this information address history was designed as recursive relationship in ADDRESS table pointing to next address in hierarchy. So the problem was to find all current addresses for a given set of address and output both

1) Address identifying information that conforms given criteria.

2) Most recent address for each found address.

Both methods will be explained using generated example data that can be found here.

So let’s start iterative thinking with very simple first step – create a query that shows information for the given address criteria (see Code listing 5):

SELECT adr_id, adr_adr_id, adr_country_code, adr_street
FROM addresses
WHERE adr_country_code = 'WIN'
  AND adr_region_code = 'DOW';

Code listing 5 Step one query for Case three

Output from query looks as follows:

    ADR_ID ADR_ADR_ID ADR ADR_STREET

---------- ---------- --- ----------

    208452     408452 WIN WEEKEND_WI

    608453     808453 WIN MAINTENANC

      8451            WIN WEEKNIGHT_

Last row haven’t child (adr_adr_id is empty), but first two rows has at least one child. Having recursive relationship in ADDRESS table obviously leads us to hierarchical query clause (CONNECT BY). Let’s look at query that shows all address history over time for above given criteria (Code listing 6).

SELECT rownum rn, level lvl,
  adr_id, adr_adr_id, adr_country_code, adr_street
FROM addresses
START WITH adr_country_code = 'WIN'
       AND adr_region_code = 'DOW'
CONNECT BY PRIOR adr_adr_id = adr_id;

Code listing 6 Step two query for Case three

Output from query looks as follows:

  RN        LVL     ADR_ID ADR_ADR_ID ADR ADR_STREET

---- ---------- ---------- ---------- --- ----------

   1          1       8451            WIN WEEKNIGHT_

   2          1     208452     408452 WIN WEEKEND_WI

   3          2     408452            WIN EKEND_WIND

   4          1     608453     808453 WIN MAINTENANC

   5          2     808453    1008453 WIN INTENANCE_

   6          3    1008453            WIN ENANCE_WIN

Looking at data output we can see that actually we have groups of records that always start with lvl = 1, but end with lvl >= 1. Rows with lvl = 1 always indicate rows that conforms to starting address criteria. Row with maximal lvl value just before next group of records indicates most current address. So we can draw at least following conclusions:

1) There are obvious groups of records.

2) Count of groups is the same as row count conforming starting address criteria.

So what do we really need for final output? We need somehow to combine first and last records in each group and our task would be achieved! We can remember that we have used analytical function first_value to found a first row for some partitioning. The only problem is to somehow mark each group with some unique tag and we already know how to deal with such a situation. As you already may guess rownum for above row set wasn’t added just for fun, it is very useful unique identifier. So let’s mark start of each group with a unique marker adding column in the end (see Code listing 7).

SELECT rownum rn, level lvl,
       adr_id, adr_adr_id, adr_country_code, adr_street,
       CASE WHEN level = 1 THEN rownum
            ELSE NULL
       END new_rn
FROM addresses
START WITH adr_country_code = 'WIN'
       AND adr_region_code = 'DOW'
CONNECT BY PRIOR adr_adr_id = adr_id;

Code listing 7 Step three query for Case three

Output from query looks as follows:

  RN        LVL     ADR_ID ADR_ADR_ID ADR ADR_STREET     NEW_RN

---- ---------- ---------- ---------- --- ---------- ----------

   1          1       8451            WIN WEEKNIGHT_          1

   2          1     208452     408452 WIN WEEKEND_WI          2

   3          2     408452            WIN EKEND_WIND

   4          1     608453     808453 WIN MAINTENANC          4

   5          2     808453    1008453 WIN INTENANCE_

   6          3    1008453            WIN ENANCE_WIN

Next task is to mark all rows for each group with this unique identifier. It is done using an analytical function that almost everyone has used as a set function – max. This time the only difference is that it is used in analytical way. So we need max(new_rn) OVER (ORDER BY rn). We cannot use partitioning clause for max simply because there isn’t any column suitable for that. There is one thing that is hidden in this function use – by default it is working from unbounded preceding and current row. It means using this function for every row for column new_rn it will display maximal value of this column since the beginning up to current row. So the first level of analytics has been applied and query now looks as follows (see Code listing 8).

SELECT rn, lvl, adr_id, adr_adr_id,
       adr_country_code, adr_street,
       max(new_rn) OVER (ORDER BY rn) grp
FROM (
  SELECT rownum rn, level lvl,
    adr_id, adr_adr_id, adr_country_code, adr_street,
    CASE WHEN level = 1 THEN rownum
         ELSE NULL
    END new_rn
  FROM addresses
  START WITH adr_country_code = 'WIN'
         AND adr_region_code = 'DOW'
  CONNECT BY PRIOR adr_adr_id = adr_id
);

Code listing 8 Step four query for Case three

Output as you already can imagine looks as follows:

  RN        LVL     ADR_ID ADR_ADR_ID ADR ADR_STREET        GRP

---- ---------- ---------- ---------- --- ---------- ----------

   1          1       8451            WIN WEEKNIGHT_          1

   2          1     208452     408452 WIN WEEKEND_WI          2

   3          2     408452            WIN EKEND_WIND          2

   4          1     608453     808453 WIN MAINTENANC          4

   5          2     808453    1008453 WIN INTENANCE_          4

   6          3    1008453            WIN ENANCE_WIN          4

Each group has been successfully marked and you already know what to do with such a result set – simply use analytic function first_value and partition by just now computed column grp. But this time to make things a bit different we will use exactly opposed function last_value. And this will give us another possibility use analytic constructions because the working window from unbounded preceding and current row is true also for functions first_value and last_value. Because we will show rows with lvl = 1 along with it’s corresponding group last row that is after it in the ordering we have to use non-default working window from current row and unbounded following. The last step performed is to show only one row for each group – of course we have to leave only rows with lvl = 1. So the final query with second level of analytics and last filter is (see Code listing 9).

SELECT * FROM (
  SELECT last_value(adr_id) OVER (
           PARTITION BY grp ORDER BY rn
           ROWS BETWEEN CURRENT ROW AND
           UNBOUNDED FOLLOWING) curr_addr_id,
         last_value(adr_street) OVER (
           PARTITION BY grp ORDER BY rn
           ROWS BETWEEN CURRENT ROW AND
           UNBOUNDED FOLLOWING) curr_street,
         lvl, adr_id, adr_country_code, adr_street
  FROM (
    SELECT rn, lvl, adr_id, adr_adr_id,
           adr_country_code, adr_street,
          max(new_rn) OVER (ORDER BY rn) grp
    FROM (
      SELECT rownum rn, level lvl,
             adr_id, adr_adr_id, adr_country_code, adr_street,
             CASE WHEN level = 1 THEN rownum
                  ELSE NULL
             END new_rn
      FROM addresses
      START WITH adr_country_code = 'WIN'
             AND adr_region_code = 'DOW'
      CONNECT BY PRIOR adr_adr_id = adr_id
    )
  )
)
WHERE lvl = 1;

Code listing 9 Step five query for Case three

And the result is the same as we have asked in the very beginning:

CURR_ADDR_ID CURR_STREE        LVL     ADR_ID ADR ADR_STREET

------------ ---------- ---------- ---------- --- ----------

        8451 WEEKNIGHT_          1       8451 WIN WEEKNIGHT_

      408452 EKEND_WIND          1     208452 WIN WEEKEND_WI

     1008453 ENANCE_WIN          1     608453 WIN MAINTENANC

One more remark for this particular Case three – I’m by no means saying that the solution provided here was the ultimate best one. The main idea was to show method for solving queries step by step. I can imagine at least 3 more solutions for this problem:

1) Method provided by Thomas Kyte in (http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:489772591421) that definitely could be written with much less code, but asks ~5 sorts for each returned row compared to constant sort count for query with analytic functions (see Code listing 10).

SELECT to_number(substr((
  SELECT max(to_char(level, 'fm000009') || ' ' || adr_id)
  FROM addresses adr2
  START WITH adr2.adr_id = adr1.adr_id
  CONNECT BY PRIOR adr_adr_id IS NOT NULL
    AND PRIOR adr_adr_id = adr_id
  ), 8)) curr_addr_id,
  adr_id
FROM addresses adr1
WHERE adr_country_code = 'WIN'
  AND adr_region_code = 'DOW';

Code listing 10 Another approach for Case three

2) One can use built-in function SYS_CONNECT_BY_PATH and filter out result later even in 9i.

3) Using 10g one can use CONNECT_BY_ROOT operator that returns column value using data from the root row.

Case four: do all that in few sections

This case wouldn’t be about analytics although impact on a report execution time could be in the same magnitude or sometimes even more. The basic idea is – for complicated reports divide target numbers into preaggregated values and then manipulate with them. I assume if you have created several complicated reports in your life there were times you have used or at least thought of using temporary tables holding temporary preaggregated values. WITH clause introduced as of 9i provides this ability right in the very SELECT statement itself.

The real example is taken from several of our reports that showed peoples citizenship count over their address, like shown in Table 3:

Territory

CitizenshipX Citizens

CitizenshipX Non-citizens

Other citizenships Citizens

Other citizenships Non-citizens

All territories

 

 

 

 

Country A

 

 

 

 

  Big city1

 

 

 

 

    Suburb1

 

 

 

 

    Suburb2

 

 

 

 

  Big city2

 

 

 

 

  Region1

 

 

 

 

    District1

 

 

 

 

    District2

 

 

 

 

    Big city1

 

 

 

 

  Region2

 

 

 

 

Table 3 People count breakdown by citizenships and address

The problem is that I cannot use ROLLUP because 1) sum of elements is before elements (probably even that could be overcome after with some ordering of rows) 2) the main reason – big cities are shown several times in report and the overall sum isn’t sum of all elements.

The original query tried to solve this situation very straightforward – for each territory type scan all involved tables schematically it was something like:

1) SELECT count of citizens and non-citizens from CITIZENSHIP table representing all territories row.

2) SELECT count of citizens and non-citizens from CITIZENSHIP table, joined to PERSON ADDRESSES table joined to ADDRESSES table joined to COUNTRIES table representing rows for each country.

3) SELECT count of citizens and non-citizens from CITIZENSHIP table, joined to PERSON ADDRESSES table joined to ADDRESSES table joined to CITIES table filtering out only big cities representing rows for each big city.

4) And similarly for suburbs, regions, and districts.

As a result some of such reports never ended on production data i.e. after running for more than 48 hours they were simply killed. A simple question what to do was raised. Of course we could eliminate some of these continuous scans of the same data over and over again but I somehow had feeling that after the first full scan of all involved large base tables we have collected all information we need to display the result. The only problem was how to twist and format the result to show it desired times in the desired place.

Solution was simple – create preaggregated counts grouped by citizenship, citizenship type and address components. After that we could easily sum and join them to necessary address classifier components.

Following query is basically the same query taken from “Case one: persons and their citizenships” and joined with PERSON ADDRESSES and ADDRESSES tables (see Code listing 11).

SELECT count(*) cnt,
       ctz_code ctz,
       ctz_type,
       adr_country_code cnt,
       adr_region_code  reg,
       adr_city_code    cty
FROM addresses, person_addresses,
-- a bit modified select from case1,
-- just without grouping
  (SELECT * FROM (
    SELECT ctz_prs_id, ctz_code, ctz_type,
      CASE WHEN ctz_prs_id <> prev_prs_id THEN 1
           WHEN first_code = 'X' THEN 0
           ELSE 1
      END flag
    FROM (
      SELECT
        ctz_prs_id,
        ctz_code,
        ctz_type,
        first_value (ctz_code) OVER (
          PARTITION BY ctz_prs_id ORDER BY
          decode (ctz_code, 'X', 0, 1)) first_code,
        lag(ctz_prs_id, 1, -1) OVER (ORDER BY
          ctz_prs_id, decode (ctz_code, 'X', 0, 1))
          prev_prs_id
      FROM citizenships
    )
  )
  WHERE flag = 1) ctz_source
WHERE ctz_source.ctz_prs_id = pra_prs_id
  AND pra_adr_id = adr_id
GROUP BY ctz_code,
         ctz_type,
         adr_country_code,
         adr_region_code,
         adr_city_code;

Code listing 11 Query used n WITH clause for Case four

So here it is – a preaggregated source to use in WITH clause and now we can take the same steps we had in the original query:

1) Sum all counts of citizens and non-citizens from preaggregate representing all territories row.

2) Sum counts of citizens and non-citizens from preaggregate joined to COUNTRIES table representing rows for each country.

3) Sum counts of citizens and non-citizens from preaggregate joined to CITIES table filtering out only big cities representing rows for each big city.

4) And similarly for suburbs, regions, and districts.

As a result we eliminated roughly 5/6 of all work and additionally using technique described in “Case one: persons and their citizenships” all such reports run on production data in less than 20 minutes down from at least 48 hours and it was achieved mostly because following factors:

1) Scanned big source tables (CITIZENSHIPS, ADDRESSES, PERSON_ADDRESSES in example, actually there were more) only once via full scan.

2) Scan created small temporary table in memory. In our real production system it contained several thousand rows.

3) Instead of joining big source table results several times to classifier tables we join small temporary table. The same number of times though.

Complete query and sample data generation can be seen here.

Summary

I had few goals in my mind writing this article:

1) Using analytic functions makes reporting easier in many cases. It is like cycle operator you can do on your result set. If you can successfully exploit it, you can free up much time and resources for other tasks.

2) Solving problems step by step is an ancient method for solving various tasks that could be well used also finding solutions for complex reports.

3) Divide et impera was (and is) successfully used method in politics, why not use it in SQL? WITH clause and preaggregate values working like this ancient method can sometimes be even more effective than analytic functions.

 

The minimum goal of this article would be achieved if you’d be at least a little more interested in analytic functions and WITH clause. The maximum goal would be achieved if you’d be encouraged enough to try out them yourself.

About author

Gints Plivna gints.plivna@gmail.com is system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in working with Oracle since 1997 and his interests mostly have been connected with analyzing system requirements, design, development and SQL tuning.

 

Contacts:

e-mail - gints.plivna@gmail.com

website - http://www.gplivna.eu/

Licenses

 

This work is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

 

Appendix 1 Database model diagram used in this document

Figure 1 Database model diagram used in this document