MySQL Internals Series, Part 4: Query Execution & Optimization
How MySQL optimizes queries — cost model, statistics, execution plans, and steering the optimizer with hints.
The MySQL query optimizer is responsible for choosing how to execute a query. Given SELECT * FROM users WHERE age > 30 AND status = 'active', should it scan the table, use the age index, use the status index, or both? The optimizer estimates the cost of each strategy and picks the cheapest one.
But the optimizer is not always right. Understanding how it works — its cost model, statistics, and limitations — gives you the power to fix bad plans and guide the optimizer when it makes wrong choices.
The Optimizer: A Quick Overview
The MySQL optimizer executes these steps:
SQL Query
↓
Parser (syntax check)
↓
Preprocessor (semantic validation)
↓
Optimizer (cost-based planning)
├─ Generate possible execution plans
├─ Estimate cost of each plan
└─ Choose the cheapest
↓
Query Executor (executes the chosen plan)
↓
Results
Our focus is on the optimizer — the component that chooses the plan.
Execution Plans and EXPLAIN
Use EXPLAIN to see the optimizer’s chosen plan:
EXPLAIN SELECT * FROM users WHERE age > 30 AND status = 'active'\G
Output (simplified):
id 1
select_type SIMPLE
table users
partitions NULL
type ALL
possible_keys age, status
key NULL
key_len NULL
ref NULL
rows 10000
filtered 15
Extra Using where
Key fields:
- type: How MySQL accesses the table (
ALL= full scan,range= index range,eq_ref= unique index lookup,const= constant) - possible_keys: Indexes that could be used
- key: The index actually chosen (NULL = table scan)
- rows: Estimated number of rows examined (not returned)
- filtered: Percentage of rows passing the WHERE clause
Type Values (from fastest to slowest):
-
const — Single row match (PK or UNIQUE key)
SELECT * FROM users WHERE id = 1; -- const -
eq_ref — Join via unique key
SELECT * FROM orders o JOIN users u ON u.id = o.user_id; -- eq_ref -
ref — Join via non-unique index
SELECT * FROM orders WHERE user_id = 123; -- ref -
range — Index range scan (BETWEEN, >, <, IN)
SELECT * FROM users WHERE age BETWEEN 30 AND 40; -- range -
index — Full index scan (all keys, no table access)
SELECT id, email FROM users ORDER BY id; -- index -
ALL — Full table scan (slowest)
SELECT * FROM users; -- ALL
Cost Model and Statistics
The optimizer uses statistics about the table to estimate cost:
- Table size — row count
- Index cardinality — distinct values in the index
- Data distribution — histograms of value frequencies
Histograms
MySQL 8.0 introduced histograms for accurate statistics on non-uniform data:
-- Create a histogram
ANALYZE TABLE users UPDATE HISTOGRAM ON age, status;
-- View histograms
SELECT * FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
WHERE TABLE_NAME = 'users'\G
Without histograms, the optimizer assumes uniform distribution. If your data is skewed (e.g., 99% of users have status=‘active’), the optimizer may make poor estimates.
Cost Formula
The optimizer estimates cost roughly as:
Cost = (rows_examined * CPU_cost) + (disk_reads * I/O_cost)
Default weights (configurable):
- cpu_cost: 0.35 per row examined
- io_cost: 1.0 per page read
So examining 1000 rows = 350 units, reading 1 page = 1 unit. The optimizer prefers smaller rows estimates and fewer disk accesses.
Why the Optimizer Makes Bad Decisions
Reason 1: Outdated Statistics
-- Stats are stale; row count estimate is wrong
-- Rebuild statistics
ANALYZE TABLE users;
-- Check when stats were last updated
SHOW TABLE STATUS WHERE NAME = 'users'\G
Reason 2: Optimizer Limitations
The optimizer has hard-coded heuristics that override the cost model:
- Range scan limit: If an index can match >1000 rows, the optimizer may prefer a table scan
- Full text search: Often prefers table scan even with indexes
- Subqueries: May not correlate properly with outer query
- UNION: Uses OR expansion, which can be suboptimal
Reason 3: Incorrect Cardinality Estimation
Secondary indexes don’t store exact cardinality; InnoDB estimates it by sampling pages. Large tables with skewed data get bad estimates.
-- Check index cardinality
SELECT
OBJECT_SCHEMA,
OBJECT_NAME,
COUNT_DISTINCT
FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_BRIEF
WHERE OBJECT_NAME LIKE 'idx_%';
Steering the Optimizer: Hints and Techniques
1. Index Hints (not recommended, but sometimes necessary)
Force the optimizer to use (or avoid) specific indexes:
-- Force index usage
SELECT * FROM users USE INDEX (idx_age) WHERE age > 30;
-- Ignore an index
SELECT * FROM users IGNORE INDEX (idx_status) WHERE status = 'active';
-- Use index for WHERE, JOIN, and ORDER BY
SELECT * FROM users
WHERE age > 30
ORDER BY created_at
USE INDEX FOR WHERE (idx_age)
USE INDEX FOR ORDER BY (idx_created);
2. Optimizer Hints (MySQL 5.7+, preferred)
More flexible hints using comments:
-- Force a specific index
SELECT /*+ INDEX(users idx_age) */ * FROM users WHERE age > 30;
-- Hint to join two tables in a specific order
SELECT /*+ BKA(o) */ * FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.age > 30;
-- Use HASH join (MySQL 8.0.20+)
SELECT /*+ HASH_JOIN(u, o) */ * FROM users u
JOIN orders o ON u.id = o.user_id;
-- Skip ICP (index condition pushdown) if it's hurting performance
SELECT /*+ NO_ICP(users idx_age) */ * FROM users WHERE age > 30 AND status = 'active';
Available hints: INDEX, BKA, HASH_JOIN, NO_BKA, NO_HASH_JOIN, DERIVED_MERGE, NO_DERIVED_MERGE, SEMIJOIN, NO_SEMIJOIN, and more.
3. Query Rewriting
Sometimes rewriting the query helps:
-- Bad: Optimizer doesn't use index due to function
SELECT * FROM users WHERE YEAR(created_at) = 2024;
-- Good: Optimizer uses range index
SELECT * FROM users WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';
-- Bad: Subquery not optimized
SELECT * FROM orders WHERE user_id IN (SELECT id FROM users WHERE status = 'vip');
-- Good: Explicit JOIN
SELECT o.* FROM orders o
JOIN users u ON u.id = o.user_id
WHERE u.status = 'vip';
4. STRAIGHT_JOIN
Force left-to-right join order (MySQL 5.0+, deprecated in favor of hints):
SELECT STRAIGHT_JOIN * FROM users u
JOIN orders o ON u.id = o.user_id;
-- Guarantees: users is joined first, then orders
Analyzing Join Plans
Multi-table queries have different join strategies:
EXPLAIN FORMAT=JSON
SELECT * FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.age > 30\G
Output shows:
{
"query_plan": {
"select#": 1,
"table": {
"table_name": "u",
"access_type": "range", -- users table: range scan on age index
"possible_keys": ["idx_age"],
"key": "idx_age",
"rows": 100
},
"nested_loop": [
{
"table": {
"table_name": "o",
"access_type": "ref", -- orders table: ref access via user_id
"key": "idx_user_id",
"rows": 5
}
}
]
}
}
Join strategies:
-
Nested Loop Join (always available, but slow for large sets)
- For each row in
u, scan all rows inolooking for matches
- For each row in
-
Block Nested Loop Join (pre-MySQL 8.0.20)
- Buffer rows from
u, then scanoonce per buffer
- Buffer rows from
-
Hash Join (MySQL 8.0.20+, preferred for large joins)
- Hash the smaller table, then scan the larger table
Practical Optimization Example
Problem query:
SELECT o.id, o.total, u.name FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.created_at > '2024-01-01' AND u.age > 30
ORDER BY o.created_at DESC
LIMIT 100;
EXPLAIN FORMAT=JSON ... \G
-- Shows: Full table scan on orders, then lookup in users
-- Estimated rows: 100,000 (very high!)
Analysis:
ordershas no index oncreated_at— full scanuserslookup is fast (index on id), but the first step is slow
Solution:
- Add index on
orders(created_at)ororders(created_at, user_id)to help filter - Let the optimizer use
usersindex forage > 30if it helps (add histogram)
CREATE INDEX idx_orders_created ON orders(created_at DESC, user_id);
ANALYZE TABLE users UPDATE HISTOGRAM ON age;
-- Re-run EXPLAIN
-- Should show: Range scan on orders(created_at), ref access to users(id)
Key Takeaways
- EXPLAIN shows the plan — use
EXPLAIN FORMAT=JSONfor details - Statistics drive optimization — keep them fresh with
ANALYZE TABLE - Histograms help with skewed data — MySQL 8.0+ feature
- Hints guide the optimizer — use
/*+ ... */comments, not old hint syntax - Join order matters — left-to-right is joined first, which affects cost
- Query rewriting can help — avoid functions on columns, use explicit JOINs
Next Steps
In Part 5, we’ll explore the buffer pool — how MySQL caches pages in memory, evicts old pages, and manages dirty page flushing. Understanding this is critical for tuning memory usage.
Part 4 complete. Next: Buffer Pool & Memory Management