1 Hybrid Histograms

官网：http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm#TGSQL372

A hybrid histogram combines characteristics of both height-based histograms and frequency histograms. This "best of both worlds" approach enables the optimizer to obtain better selectivity estimates in some situations.

Hybrid Histograms结合了height-based histograms和frequency histograms的特点。这种“两全其美”的方法使优化器能够在某些情况下获得更好的选择性估计。

The height-based histogram sometimes produces inaccurate estimates for values that are almost popular. For example, a value that occurs as an endpoint value of only one bucket but almost occupies two buckets is not considered popular.

To solve this problem, a hybrid histogram distributes values so that no value occupies more than one bucket, and then stores the endpoint repeat count value, which is the number of times the endpoint value is repeated, for each endpoint (bucket) in the histogram. By using the repeat count, the optimizer can obtain accurate estimates for almost popular values.

hybrid histogram分布值，因此没有值占用超过一个bucket，然后存储endpoint重复值，对于直方图中的每个endpoint(bucket)来说，这是endpoint值重复的次数。

1.1 Endpoint是如何重复统计

下面例子是官网上的。

图1，是coins列的所有值。

1.1.1 收集统计信息

method_opt =>FOR ALL COLUMNS SIZE 3 ---分3组收集直方图

--第一次值的分布情况

3个分组中，Bucker1中有1和5，Bucket2中有5,10,25，Bucket3中有25,50,100。

之后优化器会移动bucket的边界(就是同一个bucket中所包含的值)，以保证同一个值都在同一个bucket中。

--优化之后的分布情况

重新优化后，要重新计算每个bucket中endpoint重复值的总和。如下图中的，Bucket1中重复值5总和是3。

通过使用重复值的出现次数，优化器可以准确的知道endpoint值出现的次数。例如：优化器知道值5出现了3次，25出现4次，100出现2次。这种信息帮助优化器产生更准确的基数估计。

1.2 产生Hybrid Histograms的条件

与 top frequency histograms相反，Hybrid Histograms是top n出现频率要小于P阈值。

l NDV is greater than n, where n is the number of histogram buckets (default is 254).

l The criteria for top frequency histograms do not apply.

This is another way to stating that the percentage of rows occupied by the top n frequent values is less than threshold p, where p is (1-(1/n))*100. See "Criteria For Top Frequency Histograms."

l The estimate_percent parameter in the DBMS_STATS statistics gathering procedure is set to AUTO_SAMPLE_SIZE.

If users specify their own percentage, then the database creates frequency or height-balanced histograms.

1.1 实验-如何生成Hybrid Histograms

1.1.1 创建表并插入数据

--创建表

create table hybrid_t(id number);

--插入数据

insert into hybrid_t values(2014);
insert into hybrid_t values(2014);
insert into hybrid_t values(2014);
insert into hybrid_t values(2014);
insert into hybrid_t values(2014);
insert into hybrid_t values(2014);
insert into hybrid_t values(2014);
insert into hybrid_t values(2014);
insert into hybrid_t values(2055);
insert into hybrid_t values(2055);
insert into hybrid_t values(2055);
insert into hybrid_t values(2055);
insert into hybrid_t values(2055);
insert into hybrid_t values(2055);
insert into hybrid_t values(2055);
insert into hybrid_t values(2032);
insert into hybrid_t values(2032);
insert into hybrid_t values(2032);
insert into hybrid_t values(2032);
insert into hybrid_t values(2032);
insert into hybrid_t values(2032);
insert into hybrid_t values(2054);
insert into hybrid_t values(2054);
insert into hybrid_t values(2054);
insert into hybrid_t values(2054);
insert into hybrid_t values(2054);
insert into hybrid_t values(2054);
insert into hybrid_t values(2056);
insert into hybrid_t values(2056);
insert into hybrid_t values(2056);
insert into hybrid_t values(2056);
insert into hybrid_t values(2056);
insert into hybrid_t values(2031);
insert into hybrid_t values(2031);
insert into hybrid_t values(2031);
insert into hybrid_t values(2031);
insert into hybrid_t values(2031);
insert into hybrid_t values(2042);
insert into hybrid_t values(2042);
insert into hybrid_t values(2042);
insert into hybrid_t values(2042);
insert into hybrid_t values(2042);
insert into hybrid_t values(2051);
insert into hybrid_t values(2051);
insert into hybrid_t values(2051);
insert into hybrid_t values(2051);
insert into hybrid_t values(2022);
insert into hybrid_t values(2022);
insert into hybrid_t values(2022);
insert into hybrid_t values(2011);
insert into hybrid_t values(2021);
insert into hybrid_t values(2052);

1.1.2 生成hybrid histogram：

1.1.2.1 收集统计信息

收集表topfreq_t 10 buckets的统计信息

BEGIN  DBMS_STATS.GATHER_TABLE_STATS (
    ownname     => 'LEI'
,   tabname     => 'HYBRID_T'
,   method_opt  => 'FOR COLUMNS ID SIZE 5'
);
END;
/

1.1.2.2 查看每个值的总数

SQL> SELECT COUNT(id) AS NUM_OF_ROWS, id
FROM  HYBRID_T
GROUP BY id
ORDER BY 1 DESC;  2    3    4  
 
NUM_OF_ROWS	    ID
----------- ----------
  8	  2014
  7	  2055
  6	  2032
  6	  2054
  5	  2031
  5	  2056
  5	  2042
  4	  2051
  3	  2022
  1	  2011
  1	  2052
 
NUM_OF_ROWS	    ID
----------- ----------
  1	  2021
 
12 rows selected.

共有12个不同的值，分组(bucket)为5，小于12。

P = 90%

Top 5最频繁的值所占比例：32/75=62%，小于p=80%

1.1.2.3 查看列ID上的直方图信息

SQL> SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM   USER_TAB_COL_STATISTICS
WHERE  TABLE_NAME='HYBRID_T'
AND    COLUMN_NAME='ID';  2    3    4  
 
TABLE_NAME COLUMN_NAME	NUM_DISTINCT HISTOGRAM
---------- -------------------- ------------ -----
HYBRID_T  ID	  12  HYBRID

可以看到，已经生成了hybrid histogram

1.1.2.4 查看ID列上的ENDPOINT_NUMBER，ENDPOINT_VALUE，ENDPOINT_REPEAT_COUNT

SQL> SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE, ENDPOINT_REPEAT_COUNT
FROM   USER_HISTOGRAMS
WHERE  TABLE_NAME='HYBRID_T'
AND    COLUMN_NAME='ID'
ORDER BY 1;  2    3    4    5  
 
ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT
--------------- -------------- ---------------------
      1   2011	   1
     18   2031	   5
     33   2051	   4
     40   2054	   6
     52   2056	   5

1.1 总结

height-based histogram中会将70条数据平均分配到10个bucket中，每个bucket有7条数据，可能会造成同一的数据分配到不同的bucket中。而hybrid histogram只取top 10的值分配到10个bucket中，而且相同值只会存在一个bucket中。然后通过endpoint repeat count，优化器可以准确估计基数。

oracle12c Hybrid Histograms

签到成功

CNDBA社区

oracle12c新类型的优化器统计信息-Hybrid Histograms