Sometimes a histogram existence may mislead to a suboptimal plan

INTRO 

In one of our critical OLTP database, we faced "Latch: cache buffer chains" wait event issue(a long running query) on a commonly used page. Long running query was called inside a PL/SQL package function. To alleviate the situation, first we changed the function to a deterministic function and rewrite the query's function call into a scalar subquery allows Oracle to use scalar subquery caching to optimize the number of function calls. As these cautions could not cure the issue and segment access statistics were not even, we changed table partitioning strategy from list partitioned to hash-partitioned. For the problematic query issue was solved, but a batch job which completes in 5 minutes before, started to experience performance problems and i digged down the issue. Root cause was the change of partitioning strategy. One query suffered deeply. This is how i investigated poor query performance and cured it. 

Partial index usage by creating functional index

ANALYSIS

After changing the partitioning strategy optimizer started to choose nested loop in the outer query instead of hash join. Plan output and query is not available.  Query is not at all easy to understand, consists of lots of different subqueries. I will not share the real query to comply with data protection regulations. Most expensive section of the query was as below.  On T_ACC_DOC and T_ACC_DOC_ENT tables, full table scan is performed. 

It looked pretty easy at first glance, there were two "IS NULL" predicate filters which causes full table scan operation. By default null values are not stored in index. For that reason, no index is used. FTS chosen.

Query was called inside a PL/SQL package function which is solely managed by dba team. For that reason, we have the chance to rewrite query as we want, so i examined selectivity of each condition on S_ACC_DOC table.  I calculated selectivity based on the assumption of "Replacing NULL values with an unused value" as below. On RELATED_DOC_ID column, i replaced NULL values with a negative value (-1903). On RELATED_DOC column, i replaced NULL values with 'BJK'

Selectivity of DOC_TYPE IN ('X11', 'X21','X71', 'X81')                                       => 0,175438596491228

Selectivity of NVL ("RELATED_DOC_ID", -1903)=-1903                                   => 0,543859649122807

Selectivity of DOC_REC_DATE  >= TO_DATE ('01.01.2014', 'DD.MM.YYYY') => 0,719298245614035

Selectivity of NVL ("RELATED_DOC", 'BJK') ='BJK'                                           => 0,929824561403509

I created a concatenated index according to the selectivity of conditions and rewrite the query as below.

Initially, it looked like solved. Block gets were decreased heavily %87. Also execution time decreased dramatically from 60 seconds to 1 second. But after two or three days later, batch job did not complete in time again.  Problem revived.  Examined the execution plan in cursor cache and saw that plan_hash_value was changed. Our concatenated index was not used in the new execution plan. Why optimizer changed its choice was the new mystery. When searching for a clue, I noticed that table statistics of one of the base tables was refreshed yesterday. Gathering of the table_stats triggered a change of execution plan and it resulted with a degraded performance for our query.  

I abbreviated the sample query with the below one to simplify testing. 

Our concatenated index is not preferred by CBO. I dropped the index and recreated it and executed query.  

Strange but true, index is preferred again and query performed better.  I gathered table statistics (EXEC dbms_stats.gather_table_stats( 'ACCS', 'T_ACC_DOC')) and executed query again and query performed poorly. That was reproducible.

I tried to clarify why optimizer changes the decision after statistics gathered freshly. And  compared the 10053 analysis of the query, before and after of the statistics gather operation.  I noticed that the calculations are made according to the hybrid histogram statistics after statistics gathered freshly. Plan statistics of the query is as below.

Estimated rows(623000) and affected rows(85619) differ highly. Optimizer estimate cardinalities looks a little bit inaccurate and i tried with /*+ dynamic_sampling(11) */ hint whether optimizer may generate an optimal execution plan.  Here is the output.

When dynamic statistics with level 11 used, Optimizer estimations have been more accurate(86000), almost perfect. On predicate information,  SYS_NC00076$ and SYS_NC00077$ virtual columns are used. With default dynamic_statistics level 2 is in charge, what happened step by step is explained here.

  * A concatenated index (IN_ACCDOC_DCTYPRLTDDDTRLTD) created according to filter conditions. 

  * In this concatenated index, 4 columns are referenced from base table(T_ACC_DOC), two of them are referenced in NVL function. 

 * Oracle created two virtual columns internally for the NVL functions. (SYS_NC00076$,  SYS_NC00077$)  This is default behaviour when a formula or expression is used. These virtual columns may also be seen in dba_tab_cols dictionary view and will not be seen in dba_tab_columns view. 

   * As no table statistics gathering preference is in use for  T_ACC_DOC table, table statistics are collected with default parameters (METHOD_OPT => 'FOR ALL COLUMNS AUTO').

   * Oracle evaluated column usage statistics (select * from sys.col_usage$ where obj#=83662 and intcol# in (76,77)) and decided to collect a hybrid histogram statistic to help optimizer make better estimates. As  problematic query's predicate filters are recorded in col_usage$ table, Oracle was aware of these virtual columns statistics may be useful for subsequent executions.  Nigel Bayliss blog post (How does the Oracle Database choose what histograms to create?) has more information about it.

  * Also as data is highly skewed on some values such as NVL ("RELATED_DOC_ID", -1903)=-1903 and NVL ("RELATED_DOC", 'BJK')=BJK. These values are spotted as popular values in histogram statistics. 

    * In the light of all these information, Full table scan looked reasonable to CBO, when -1903 and BJK values are queried all together.  

We have lots of possible solutions, some of them are
    
     ✔ We can use an index hint /*+ INDEX(D, IN_ACCDOC_DCTYPRLTDDDTRLTD */ 

     ✔ We can use a sql patch.

     ✔ We can fix the good plan with a baseline.

     ✔ We can use a sql_profile.

     ✔ We can set a table preference and limit the bucket size to 1, not to gather histogram statistics on virtual columns (SYS_NC00076$,  SYS_NC00077$). Here is the command may be used to set a table preference. A blog post (How do I drop an existing histogram on a column and stop the Auto Stats gathering job from creating it in the future?) may be referenced for more information.

    ✔ We can rewrite.

    * As a rule of thumb, I always try to make it work with all the defaults in place.  Using a hint or fixing a sql plan may also be a good solution for this problem. But what i did was emulate a partial/filtered index in Oracle. It is available on my blog post "How to Implement a Filtered Index in Oracle"

Hope it helps.

Comments

  1. I was just wondering what level of dynamic sampling oracle decided to go with when you set it to 11. Was it available from the optimizer trace?

    ReplyDelete
    Replies
    1. Sorry Fadai, Unfortunately trace is not available. If i have a chance, i will try to check.

      Delete

Post a Comment

Popular posts from this blog

Using table stats hint to force a direct path read but with a rownum predicate surprise

Oracle Grid Release Update by using Ansible Playbooks

Oracle Database Release Update by Using Ansible Playbooks