MySQL Internals Series, Part 3: Indexes & B-Trees

Indexes are the single most important factor in query performance. But most developers create them via trial-and-error, adding an index whenever a query is slow. Understanding B-tree structure, clustered vs. secondary indexes, and covering indexes transforms you from index-guessing to index-designing.

This part explores how InnoDB’s B-trees work, why secondary index lookups require two searches, and how to design indexes that actually help the optimizer.

B-Tree Basics

InnoDB uses B-trees (balanced search trees) for both tables (clustered indexes) and secondary indexes. A B-tree has these properties:

Balanced — all leaves are at the same depth
Multi-way branching — each internal node can have many children
Sorted keys — keys within each node are sorted, enabling binary search
Self-balancing — insertions/deletions automatically rebalance

A simplified B-tree looks like:

                    ┌──────────────────┐
                    │  [40 | 60 | 80]  │ (root)
                    └──────┬───┬───┬───┘
          ┌─────────────────┘   │   │
          │                     │   │
    ┌─────┴─────┐      ┌───────┘   │
    │ [10|20|30]│      │[50]│[70]  │
    └────┬──────┘      └──┬─┴──────┘
         │                │
    leaf nodes with       leaf nodes
    actual data

Key properties:

Node capacity — limited by page size (16 KB). A typical node has 100-1000 keys depending on key size
Search — logarithmic O(log N), e.g., 50 million rows need ~6 disk accesses
Insertion/deletion — also O(log N), but may trigger node splits/merges

Clustered Index vs Secondary Indexes

Clustered Index (the table itself)

In InnoDB, the clustered index is the table. It stores:

Primary key columns
All user columns
MVCC metadata (transaction ID, undo pointer)

CREATE TABLE users (
    id BIGINT PRIMARY KEY,      -- Clustered index key
    email VARCHAR(255),
    name VARCHAR(100),
    age INT
);

-- This creates ONE B-tree:
-- Clustered Index: (id -> email, name, age, mvcc_metadata)

Leaf nodes contain full rows. Searching by primary key is fast (one B-tree search).

Secondary Indexes

Secondary indexes are separate B-trees that store:

Secondary key columns
Clustered index key (the primary key)
NOT other columns

CREATE INDEX idx_email ON users(email);

-- This creates a second B-tree:
-- Secondary Index: (email -> id)
-- Note: id is appended automatically

Why is this important? When you query:

SELECT name FROM users WHERE email = 'alice@example.com';

MySQL must:

Search the secondary index (email) to find the primary key id
Search the clustered index (id) to fetch the row (and get name)

This is called a two-level lookup or index + table access.

Covering Indexes

A covering index contains all the columns needed for a query, eliminating the second lookup:

CREATE INDEX idx_email_name ON users(email, name);

-- Query the same query:
SELECT name FROM users WHERE email = 'alice@example.com';

-- MySQL now does:
-- 1. Search idx_email_name for (email, name)
-- 2. Done! No second lookup to the clustered index

This is called an index-only scan or covered query. It’s much faster because it avoids the second B-tree search.

Design rule: For frequently accessed columns, add them to the index via the INCLUDE clause (MySQL 8.0.20+):

-- INCLUDE columns are stored but not used for sorting/filtering
CREATE INDEX idx_email ON users(email) INCLUDE (name, age);

-- Or the old way (adds them to the sort key, but effective):
CREATE INDEX idx_email_name_age ON users(email, name, age);

Composite Indexes and Key Ordering

When you create an index on multiple columns, order matters:

CREATE INDEX idx_email_age ON users(email, age);

-- This index helps:
-- - WHERE email = '...'
-- - WHERE email = '...' AND age > 30
-- - WHERE email = '...' AND age = 30

-- This index does NOT help:
-- - WHERE age > 30  (must scan entire index)
-- - WHERE age = 30 AND email = '...'  (email is 2nd, can't use without email)

Ordering rule (for a WHERE clause with filters):

Put equality columns first (used for filtering, don’t require range scans)
Put range columns next (used for range scans)
Put sort columns last (if you need sorting)

Example:

-- Query: SELECT name FROM users WHERE email = '...' AND age > 30 ORDER BY created_at
-- Best index: (email, age, created_at)
CREATE INDEX idx_optimal ON users(email, age, created_at);

Adaptive Hash Index

InnoDB has a secret weapon: the adaptive hash index (AHI). It’s an in-memory hash table (not a B-tree) that speeds up frequently accessed B-tree nodes.

When InnoDB detects that certain B-tree searches are repeated (e.g., lookups by primary key in tight loops), it automatically builds a hash index on hot parts of the B-tree.

SHOW ENGINE INNODB STATUS\G
-- Look for "Adaptive hash index stats:" section

-- Monitor AHI
SELECT * FROM INFORMATION_SCHEMA.INNODB_METRICS
WHERE NAME LIKE 'adaptive%'\G

You can’t control AHI directly, but you can disable it if it’s consuming too much memory:

SET GLOBAL innodb_adaptive_hash_index = OFF;

Index Fragmentation

Over time, indexes become fragmented — leaf pages are not sequentially stored on disk, and internal pages have gaps. This increases I/O because a B-tree search bounces between scattered pages.

Check fragmentation:

SELECT
    OBJECT_SCHEMA,
    OBJECT_NAME,
    COUNT_LEAF_PAGES,
    COUNT_LEAF_DELETED,
    RATIO_DELETED
FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_BRIEF
WHERE OBJECT_NAME LIKE 'users%';

If RATIO_DELETED is high (>10%), rebuild the index:

-- Full table rebuild (rewrites clustered index + all secondary indexes)
OPTIMIZE TABLE users;

-- Or just rebuild a specific index (faster)
ALTER TABLE users ENGINE=InnoDB;  -- Rebuilds all indexes

Partial Indexes and Invisible Indexes

Partial Indexes (MySQL 8.0.13+)

Index only rows matching a WHERE condition:

-- Index only active users (skip millions of inactive rows)
CREATE INDEX idx_email_active ON users(email) WHERE status = 'active';

-- Query must match the condition to use the index
SELECT * FROM users WHERE status = 'active' AND email = '...';

Invisible Indexes

Test removing an index without deleting it:

-- Make index invisible (optimizer won't use it)
ALTER TABLE users ALTER INDEX idx_email INVISIBLE;

-- Monitor query performance
EXPLAIN SELECT * FROM users WHERE email = '...' \G

-- If performance is bad, make it visible again
ALTER TABLE users ALTER INDEX idx_email VISIBLE;

-- If not needed, delete it
DROP INDEX idx_email ON users;

Practical Index Design

Given this table:

CREATE TABLE orders (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    status VARCHAR(50),
    total_amount DECIMAL(10, 2),
    created_at TIMESTAMP
);

Queries:

SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC → Index: (user_id, created_at DESC)
SELECT * FROM orders WHERE status = 'pending' AND created_at > ? → Index: (status, created_at)
SELECT COUNT(*) FROM orders WHERE status = 'completed' → Index: (status) (or full scan if rare status)

Final indexes:

CREATE INDEX idx_user_date ON orders(user_id, created_at DESC);
CREATE INDEX idx_status_date ON orders(status, created_at);

Key Takeaways

Clustered index is the table — it stores all columns, secondary indexes only store keys
Secondary index lookups are two-stage — index search + clustered key lookup
Covering indexes eliminate the second lookup — put all needed columns in the index
Order matters in composite indexes — equality first, range second, sort last
Adaptive hash index is automatic — InnoDB optimizes hot paths without configuration
Fragmentation reduces performance — use OPTIMIZE TABLE when ratio_deleted is high

Next Steps

In Part 4, we’ll explore how the MySQL optimizer uses indexes to build execution plans. You’ll learn the cost model, statistics, and how to steer the optimizer with hints.

Part 3 complete. Next: Query Execution & Optimization