Database Internals: Indexing Strategies for Performance
Database Internals: Indexing Strategies for Performance
"Just add an index" is the common advice for slow queries. But blind indexing can hurt write performance and consume disk space without solving the read problem. To truly optimize databases, one must understand the underlying data structures.
How Data is Stored: The Heap
By default (in most relational databases like PostgreSQL/MySQL), data is stored in a heap structure. Rows are appended to the file in no particular order. Finding a specific row requires a Full Table Scan—reading every page from disk until the record is found (O(N)). This is disastrous for performance on large tables.
The B-Tree Index
The most common index type is the Balanced Tree (B-Tree). It generalizes the binary search tree but allows nodes to have more than two children.
How it works:
- The B-Tree keeps keys in sorted order.
- It consists of a Root Node, Branch Nodes, and Leaf Nodes.
- Leaf Nodes contain the actual pointer to the heap location (Resource ID / TID) of the data.
- Because the tree is balanced, all leaf nodes are at the same depth.
Looking up a record in a table with millions of rows might only require 3 or 4 disk jumps (O(log N)).
Clustered vs. Non-Clustered Indexes
- Clustered Index: The physical order of rows on the disk is the same as the index order. A table can typically have only one clustered index (usually the Primary Key). In MySQL (InnoDB), the leaf nodes are the actual data pages. This makes primary key lookups extremely fast.
- Non-Clustered (Secondary) Index: A separate structure from the data rows. The leaf node contains the key value and a pointer (or Primary Key) to the actual data row. Using a secondary index often involves two lookups: finding the pointer in the index, then fetching the data.
Composite Indexes and the Leftmost Prefix Rule
You can create an index on multiple columns: CREATE INDEX idx_name_age ON users(last_name, age).
The order matters immensely! The database sorts first by last_name, and then by age.
The query matters:
WHERE last_name = 'Smith'-> Uses IndexWHERE last_name = 'Smith' AND age > 20-> Uses IndexWHERE age > 20-> Index NOT Used (or inefficient scan)
This is because the tree is not sorted by age globally, only locally within a specific last name.
Other Index Types
- Hash Index: O(1) lookups but only supports equality checks (
=). Cannot handle range queries (>,<) or sorting. - GIN / GiST (PostgreSQL): Used for complex data types like JSONB, Full-text Search, and Geo-spatial data.
Optimizing Implementation
- Cardinality: Index columns with high cardinality (many unique values). Indexing a 'Gender' column (M/F) is usually useless because the DB engine will prefer a sequential scan over jumping back and forth for 50% of the table.
- Covering Indexes: If an index contains all the columns required by a query (e.g.,
SELECT age FROM users WHERE last_name='Smith'), the DB doesn't need to look up the heap at all. It answers purely from the index (Index Only Scan). This is a massive performance win. - Write Penalty: Every
INSERT,UPDATE, orDELETErequires updating every index on the table. Over-indexing makes writes slow.
Conclusion
Databases are efficient, but they aren't magic. Understanding B-Trees and execution plans distinguishes a senior engineer from a junior one. Use EXPLAIN ANALYZE to prove your theories.
Comments
Sign in to join the conversation