What is data distribution ? Why is it important to know about your data distribution ? What do you do after you know your data distribution ?

Before we go into data distribution, a quick recap of the optimizer. When presented with a SQL statement DB2 tries to find an access path with the lowest cost.

This cost is an "estimate" only, and DB2 selects the access path that has the "lowest estimate". It follows that if the "estimate" is wrong, the selected access path is also wrong. This is why we need to know about data distribution.

When DB2 performs optimization it heavily depends on some thing called the filter factor. And when the data distribution is skewed the filter factor is wrong!!!

Filter Factor = 1/Column Cardinality

Where Column Cardinality is the number of "distinct values" in the column. Let us assume a table EMPLOYEE with column EMPLOYEE_NUMBER, if there are 500 distinct values for this column, then the filter factor for a matching predicate on this column ( WHERE EMPLOYEE_NUMBER=:HV) is 1/500.

In other words, DB2 estimates that

SELECT * FROM EMPLOYEE WHERE EMPLOYEE_NUMBER=:HV ;

will return 1/500 th of the total rows in the table. DB2 estimates 1 row to be returned ( I am assuming that EMPLOYEE_NUMBER is the primary key).

Based on this estimate DB2 will allocate the resources it needs to execute the SQL.

Next let us move on to a different example.

Let us assume that the same EMPLOYEE table has a column called EMPLOYEE_SEX, and the Column Cardinality is 2. That means, among the 500 different occurrences of this column there are only two distinct values of EMPLOYEE_SEX , Male and Female.

It follows that the filter factor for a matching predicate on this column ( WHERE EMPLOYEE_SEX=:HV) is 1/2.

In other words, DB2 estimates that

SELECT * FROM EMPLOYEE WHERE EMPLOYEE_SEX=:HV ;

will return 1/2 of the total rows in the table. DB2 estimates 250 rows to be returned.

Based on this estimate DB2 will allocate the resources it needs to execute the SQL.

Let us further assume that this EMPLOYEE table is used to store sales people details for a retail outlet that sells cosmetics such as lipstick, perfume, face cream etc. Abandoning political correctness for scientific reasons, let us think for a second about the chances of an employee being Male. Pretty Slim. But it does not preclude a couple of artistically inclined Males from seeking employment with the outlet .

Hence we end up with a situation where out of our 500 employees 498 are Females and 2 are Males. Does the fact that there are only 2 Males out of 500 employees change the cardinality of the EMPLOYEE_SEX column? No. There are still 2 distinct values Male and Female.

Let us again look at the SQL

SELECT * FROM EMPLOYEE WHERE EMPLOYEE_SEX = 'Female';

The cardinality of EMPLOYEE_SEX is 2. Therefore the predicate EMPLOYEE_SEX='Female' has a filter factor of 1/2 and DB2 will "estimate" that 250 rows will be returned. But this estimate is wrong because we know that 498 rows are going to be returned.

This SQL will perform poorly because DB2 is ill prepared to handle the additional data. A SQL on EMPLOYEE_SEX='Male' will cause other SQL statements to perform poorly as DB2 will over allocate precious resources that other needy SQLs can use. Either way it is not a good situation.

This is why Skewed data distribution is bad news for the optimizer.

What can you do about it? The answer depends on whether the predicate column belongs to an index or not. If it belongs to an index then you can run a RUNSTATS with the KEYCARD FREQVAL NUMCOLS COUNT key words. This collects the data distribution statistics for indexed columns and updates the catalog. If the columns are not indexed then you can use IBM's DSTATS program to collect the stats on the non indexed columns and update the catalog tables manually with the collected values so that the optimizer can make use of it. The DSTATS is free and can be downloaded from http://www-1.ibm.com/support/docview.wss?uid=swg24001598