DATA
DISTRIBUTION CHARACTERISTICS.
What is data distribution ? Why is it important to know about
your data distribution ? What do you do after you know your data
distribution ?
Before we go into data distribution, a quick recap of the
optimizer. When presented with a SQL statement DB2 tries to find an
access path with the lowest cost.
This cost is an "estimate" only, and DB2 selects the access path that
has the "lowest estimate". It follows that if the "estimate" is wrong,
the selected access path is also wrong. This is why we need to
know about data distribution.
When DB2 performs optimization it heavily depends on some thing
called the filter factor. And when the data distribution is skewed the
filter factor is wrong!!!
Filter Factor = 1/Column Cardinality
Where Column Cardinality is the number of "distinct
values" in the column. Let us assume a table EMPLOYEE with
column EMPLOYEE_NUMBER, if there are 500 distinct values for this
column, then the filter factor for a matching predicate on this column
( WHERE EMPLOYEE_NUMBER=:HV) is 1/500.
In other words, DB2 estimates that
SELECT * FROM EMPLOYEE WHERE EMPLOYEE_NUMBER=:HV ;
will return 1/500 th of the total rows in the table. DB2
estimates 1 row to be returned ( I am assuming that EMPLOYEE_NUMBER is
the primary key).
Based on this estimate DB2 will allocate the resources it needs to
execute the SQL.
Next let us move on to a different example.
Let us assume that the same EMPLOYEE table has a column called
EMPLOYEE_SEX, and the Column Cardinality is 2. That means, among the
500 different occurrences of this column there are only two distinct
values of EMPLOYEE_SEX , Male and Female.
It follows that the filter factor for a matching predicate on this
column ( WHERE EMPLOYEE_SEX=:HV) is 1/2.
In other words, DB2 estimates that
SELECT * FROM EMPLOYEE WHERE EMPLOYEE_SEX=:HV ;
will return 1/2 of the total rows in the table. DB2 estimates 250
rows to be returned.
Based on this estimate DB2 will allocate the resources it needs to
execute the SQL.
Let us further assume that this EMPLOYEE table is used to store sales
people details for a retail outlet that sells cosmetics such as
lipstick, perfume, face cream etc. Abandoning political correctness for
scientific reasons, let us think for a second about the chances of an
employee being Male. Pretty Slim. But it does not preclude a couple of
artistically inclined Males from seeking employment with the outlet .
Hence we end up with a situation where out of our 500 employees 498 are
Females and 2 are Males. Does the fact that there are only 2
Males out of 500 employees change the cardinality of the EMPLOYEE_SEX
column? No. There are still 2 distinct values Male and Female.
Let us again look at the SQL
SELECT * FROM EMPLOYEE WHERE EMPLOYEE_SEX = 'Female';
The cardinality of EMPLOYEE_SEX is 2. Therefore the predicate
EMPLOYEE_SEX='Female' has a filter factor of 1/2 and DB2 will
"estimate" that 250 rows will be returned. But this estimate is wrong
because we know that 498 rows are going to be returned.
This SQL will perform poorly because DB2 is ill prepared to handle the
additional data. A SQL on EMPLOYEE_SEX='Male' will cause other SQL
statements to perform poorly as DB2 will over allocate precious
resources that other needy SQLs can use. Either way it is not a good
situation.
This is why Skewed data distribution is bad news for the optimizer.
What can you do about it? The answer depends on whether the predicate
column belongs to an index or not. If it belongs to an index then you
can run a RUNSTATS with the KEYCARD FREQVAL
NUMCOLS COUNT key words. This collects the data distribution
statistics for indexed columns and updates the catalog. If
the columns are not indexed then you can use IBM's DSTATS program
to collect the stats on the non indexed columns and update the catalog
tables manually with the collected values so that the optimizer can
make use of it. The DSTATS is free and can be downloaded
from
http://www-1.ibm.com/support/docview.wss?uid=swg24001598