DB2 data sampling

Sometimes it would be very helpful to have a chance gather sample values. What if rownum limitation affects quality of records? What if the main table logic is residental in the middle of huge table? Welcome to data sampling world 🙂

Some reasons for data sampling (please take it as my brainstorming result, not unchangable true):

1) sample can represent whole table very well in some conditions

2) in case we dont know „data demography“ – the only way how to obtain relevant result

3) volume of data is rising very fast, logic of filling can repeat in circles

For my test I created sample table (10 000 records), whole set is saved there. After running from command line ( db2 -tvsf /tmp/db2sample.sql), RUNSTAT + REORG circle was continuation and (inspiration here, more reading recommended :-D) data are prepared.


A) Good old way to query partial data

SELECT * FROM SAMTAB WHERE rand() < 0.01 order by 1

returning ~1% of overall dataset. You can always check data demography (ID values starts at 100 000, step 1 = last record 109 999)

B) Mr.Bernoulli? Come and play with us!

You can find further info about this sampling here. Usage is very simple:

select * from samtab tablesample bernoulli(0.8) order by 1

returning ~0,8% of overall dataset.

As Bernoulli is pure row-level sample method, every record must be read = can be very demanding and usage needs your consideration.

C) Let´s turn to next page

Very efficient sampling method (using pages, not rows)

select * from samtab tablesample system(0.8) order by 1

returning ~0,8% of overall dataset, using same pages = demography is completely different very often. In this case, there are sets of following records.


This entry was posted in Databases, DB2. Bookmark the permalink.

Napsat komentář