Tuesday, April 10, 2012

On queries with many values in the IN clause

A few customers with rather extreme needs have contacted us about a performance issue with the range optimizer. Our solution to the problem is to introduce a new variable in MySQL 5.6, eq_range_index_dive_limit, which can be used to control whether or not the range optimizer will a) do index dives, or b) use index statistics when estimating the number of rows in the ranges of the query. The former method gives a far more accurate estimate while the latter costs a lot less to compute.

This is what the help text has to tell about the variable:

The optimizer will use existing index statistics instead of doing index dives for equality ranges if the number of equality ranges for the index is larger than or equal to [the value of variable]. If set to 0, index dives are always used.
"Equality range" means predicates using operators IN() or =, and it's important to notice that the number of such ranges is counted on a per index basis. Furthermore, index dives are not used on unique/primary indexes, so only non-unique indexes are affected. 

The two ways row estimates can be computed

For as long as there have been a range access method in MySQL, the number of rows in a range has been estimated by diving down the index to find the start and end of the range and use these to count the number of rows between them. This technique is accurate, and is therefore a good basis to make the best possible execution plan (QEP) for the query. Unfortunately, it involves two index dives for each range. That's not a big deal if you run normal queries with a few equality ranges like

SELECT title FROM albums WHERE artist IN (10, 12)
But some MySQL users have queries with hundreds of values in the IN clause. For such queries it may take considerably longer to optimize the query than to execute it. This is when it makes sense to use the less accurate index statistics. Consider this example:

mysql> SELECT @@eq_range_index_dive_limit;
+-----------------------------+
| @@eq_range_index_dive_limit |
+-----------------------------+
|                          10 |
+-----------------------------+
1 row in set (0.00 sec)

mysql> /* Uses index statistics */ 
mysql> SELECT * FROM my_table WHERE col1 IN(<200 values>);
(cut)
199 rows in set (0.02 sec)

mysql> SELECT duration 
    -> FROM INFORMATION_SCHEMA.PROFILING
    -> WHERE state LIKE "statistics" ORDER BY query_id DESC LIMIT 1;
+----------+
| duration |
+----------+
| 0.001275 |
+----------+
1 row in set (0.00 sec)

mysql> SET eq_range_index_dive_limit=0;
Query OK, 0 rows affected (0.00 sec)

mysql> /* Uses index dives */ 
mysql> SELECT * FROM my_table WHERE col1 IN(<200 values>);
(cut)
199 rows in set (0.05 sec)

mysql> SELECT duration 
    -> FROM INFORMATION_SCHEMA.PROFILING
    -> WHERE state LIKE "statistics" ORDER BY query_id DESC LIMIT 1;
+----------+
| duration |
+----------+
| 0.029883 |
+----------+
1 row in set (0.00 sec)

This is for a table with 1M rows and the index completely in memory. Expect bigger impact for indexes partially on disk.

When to use index statistics instead of index dives

"Index statistics" refers to the very same statistics that is used to estimate the number of matching rows for [eq-]ref access, and the same that is displayed by SHOW INDEX . If the statistics is fairly up to date (e.g. because you've run ANALYZE TABLE recently), the numbers stored for this index pretty much matches the actual index cardinality (average number of rows per index value). Notice average! This number may be way off for some values but will be more and more accurate the more equality ranges your query contains. Further more, you don't save much by using index statistics for just a few equality ranges because just a few index dives doesn't cost much. I therefore suggest that tuning eq_range_index_dive_limit should be way down on your list of things to do to improve your query response time.


How many equality ranges does your query contain?

Let's finish off by showing how many equality ranges a few example queries contain:

CREATE TABLE my_table (
  col1 INT, 
  col2 INT, 
  INDEX col1_idx (col1), 
  INDEX col2_idx (col2)
);

/* 1 equality range for col1_idx, 0 equality ranges for col2_idx            */
/* if eq_range_index_dive_limit=1, these queries will use index statistics  */
/* to estimate rows in the range for col1_idx. If 0 or >1, index dives      */
/* will be used.                                                            */
SELECT * FROM my_table WHERE col1=1;
SELECT * FROM my_table WHERE col1 IN(1);
SELECT * FROM my_table WHERE col1 IN(1) AND col2>4;

/* 2 equality ranges for col1_idx, 0 equality ranges for col2_idx           */
/* if eq_range_index_dive_limit=1 or 2, this query will use index           */
/* statistics to estimate rows in the ranges for col1_idx.                  */
*/ Otherwise, index dives will be used.                                     */
SELECT * FROM my_table WHERE col1=1 OR col1=2;
SELECT * FROM my_table WHERE col1 IN(1, 2);
SELECT * FROM my_table WHERE col1 IN(1, 2) AND col2>4;

/* 5 equality ranges for col1_idx                                           */
SELECT * FROM my_table WHERE col1=1 OR col1 IN(2, 4, 7, 10);

Composite indexes 

It's harder to count equality ranges for composite indexes. If you're in doubt how many equality ranges your query contains you can use optimizer tracing and count the number of ranges that contain only ranges with this pattern:
"val1 <= col1 <= val1 AND val2 <= col2 <= val2 AND ..."
The optimizer trace also tells you whether or not the number of equality ranges exceeded the value of eq_range_index_dive_limit. Look for this part:

"range_scan_alternatives": [
 {
    "index": "col1_col2",
    "index_dives_for_eq_ranges": true,
    "ranges": [
       "1 <= col1 <= 1 AND 4 <= col2 <= 4"
    ],
    ...
}

This trace tells us that there is 1 equality range for index col1_col2 and that row estimation will be done by index dives ("index_dives_for_eq_ranges": true).


Further reading

4 comments:

  1. Hello,

    I have a query with 10,000 values in the IN clause that gets run against a table containing ~30 million rows. It is running against a single index. On MySQL 5.5.30 it performs well, averaging about 0.381 seconds (when the index and data is in the Innodb buffer). On 5.6.10, it averages about 3 seconds, and almost all of that is in the statistics state. I tried setting eq_range_index_dive_limit to a higher value, much higher than the total number of equality ranges, and it made a very small difference (~0.2 seconds faster).

    Do you know why the optimizer would take so much longer for this type of query in 5.6.10 than 5.5.x? My understanding from your post and the documentation is that the range optimizer has always had to do index dives prior to this new variable, so I wouldn't have expected the performance to decrease.

    On a brighter note, I am seeing a very significant improvement in other queries, particularly ones that take advantage of MRR, materialized subqueries or ICP. Thanks for the great work!

    ReplyDelete
    Replies
    1. Hi Lucas,

      It's impossible to tell for sure without more info, but it sounds like this bug: http://bugs.mysql.com/bug.php?id=68046

      It was fixed in MySQL 5.6.11, which is the next version to be released.

      Delete
    2. Thank you for replying. That's exactly the same problem I'm seeing. I did almost the same test as Martijn, using a simple table with a single column, and had the same result. Thanks for pointing me in the right direction!

      I'm glad my search for a solution led me here. I've benefitted quite a bit from reading your posts, along with some of the others on the optimizer team blog. Optimizer tracing in particular is a really great new feature that I'm sure I'll use quite a bit.

      Delete
    3. That's great to hear! To be honest we're pretty excited about this release ourselves :-)

      Delete