Tuesday, August 23, 2011

The MySQL range access method explained

The range access method uses an index to read a subset of rows that form one or multiple continuous index value intervals. The intervals are defined by the query's range predicates, which are comparisons using any of =, <=>, IN(), IS NULL, IS NOT NULL, >, <, >=, <=, BETWEEN, !=, <> or LIKE.

Some examples:
SELECT * FROM blog WHERE author_id IN (1, 7, 8, 10)
SELECT * FROM orders WHERE value > 1000

You know that the range access method is used when EXPLAIN shows type=range.

Naturally, there has to be an index on the column used by the range predicate. Since indexes are ordered, MySQL will, for each interval, dive down the index using the interval start value and read it's way through the index leaves until it reaches the interval end value:



FOR all intervals
   idx_row = index row at starting point of the interval
   WHILE range predicate is true for idx_row
      read row from table
      idx_row = next index row
   ENDWHILE
ENDFOR

The case for multiple-part indexes
Just to be clear: a multiple-part index, aka composite index, is an index over multiple columns from a single table. You create them like this:

CREATE TABLE orders (
   order_id INT NOT NULL PRIMARY KEY,
   customer_id INT,
   value INT,
   order_date DATE,
   KEY custid_value (customer_id, value)    <----multiple-part index
)

These indexes are great for range access. There are no limits to how advanced range predicates you can write and still make use of the range access method. Except that there are. Limits, that is.

A common misconception among MySQL users is that as long as you stick to the same operators mentioned above, you can pretty much use a multiple-part index for any range predicate. Example:

SELECT * FROM orders WHERE customer_id = 2 AND value > 1000;

For this query, MySQL will dive down the custid_value index to the start of the interval (the first index row where customer_id = 2 AND value > 1000) and retrieve rows from the table as long as both predicates are true. Just as expected.

However, the following query will only use the first keypart, customer_id, of the index:

SELECT * FROM orders WHERE customer_id < 2 AND value = 1000;

To understand why, we have to look at how the index is built up. As you can see from the picture, the index is ordered first on customer_id and then on value. The order of the columns is significant.

Remember that "the range access method uses an index to read a subset of rows that form one or multiple continuous index value intervals"?

The range predicate of the first query makes up a continuous interval in the index. Not so for the second query, because if the orders table has these (customer_id,value) pairs: (0, 1000), (1, 500), (1, 1000) the rows do not form a continuous interval matching the predicate. Thus, the range retrieval algorithm sketched above would fail to read anything after the (1, 500) row. That is why the manual says:

"The optimizer attempts to use additional key parts to determine the interval as long as the comparison operator is =, <=>, or IS NULL. If the operator is >, <, >=, <=, !=, <>, BETWEEN, or LIKE, the optimizer uses it but considers no more key parts."

Thus, range access will return all rows with customer_id < 2 from the storage engine and the "value = 1000" predicate is applied after the row has been read from the table. The result is still correct, of course.

Conclusion
If you are trying to improve performance of a query with range predicates over multiple columns indexed by a multiple-part index, the column order in the index matters. To get the best performance out of your index, the columns with predicate operators =, <=>, or IS NULL should be listed first. 

You can use EXPLAIN to see how many keyparts are used by the range access method:

EXPLAIN SELECT * FROM orders WHERE customer_id = 2 AND value > 1000;
+----+---------+-------+--------------+---------+------+-----------------------+
| id | table   | type  | key          | key_len | rows | Extra                 |
+----+---------+-------+--------------+---------+------+-----------------------+
| 1  | orders  | range | custid_value | 10      | 2    | Using index condition |
+----+---------+-------+--------------+---------+------+-----------------------+

EXPLAIN SELECT * FROM orders WHERE customer_id < 2 AND value = 1000;
+----+---------+-------+--------------+---------+------+-----------------------+
| id | table   | type  | key          | key_len | rows | Extra                 |
+----+---------+-------+--------------+---------+------+-----------------------+
| 1  | orders  | range | custid_value | 5       | 5    | Using index condition |
+----+---------+-------+--------------+---------+------+-----------------------+

Notice the difference in key_len. For the first query, the range access method uses the first 10 bytes of each index row. One INT takes 4 bytes and the NULL/NOT NULL info takes 1 byte. customer_id + value thus makes up 10 bytes. For the second query, key_len is 5. This indicates that only the customer_id part was used (one INT + NULL/NOT NULL info).

3 comments:

  1. See my blog about ICP for a new 5.6 feature that partially remedies the case for multiple part indexes with columns in "wrong order"

    ReplyDelete
  2. Thanks for the excellent article.

    I found that my query was running slowly, despite having my fields in the correct order, when choosing a range between 2 values after an equals:

    WHERE id = 10 AND arrived > '2012-10-10' AND arrived < '2012-12-10'

    I found that it improved drastically when I used a BETWEEN rather than "< .. AND > ...":

    WHERE id = 10 AND arrived BETWEEN '2012-10-10' AND '2012-12-10'

    ReplyDelete
  3. Thanks for the feedback, Andrew.

    The difference between <,> and BETWEEN sounds strange as these conditions should be treated the same way in the optimizer code. One theory is that the optimizer considers two ways of executing this query to be almost equally good and therefore sometimes picks a good plan and other times a suboptimal plan due to small differences in index statistics.

    If you are able to reproduce this issue consistently, however, you might want to file a bug at bugs.mysql.com.

    ReplyDelete