Database Internals: Indexing Strategies for Performance
Comments
Sign in to join the conversation
Sign in to join the conversation
"Just add an index" is the common advice for slow queries. But blind indexing can hurt write performance and consume disk space without solving the read problem. To truly optimize databases, one must understand the underlying data structures.
By default (in most relational databases like PostgreSQL/MySQL), data is stored in a heap structure. Rows are appended to the file in no particular order. Finding a specific row requires a Full Table Scan—reading every page from disk until the record is found (O(N)). This is disastrous for performance on large tables.
The most common index type is the Balanced Tree (B-Tree). It generalizes the binary search tree but allows nodes to have more than two children.
How it works:
Looking up a record in a table with millions of rows might only require 3 or 4 disk jumps (O(log N)).
You can create an index on multiple columns: CREATE INDEX idx_name_age ON users(last_name, age).
The order matters immensely! The database sorts first by last_name, and then by age.
The query matters:
WHERE last_name = 'Smith' -> Uses IndexWHERE last_name = 'Smith' AND age > 20 -> Uses IndexWHERE age > 20 -> Index NOT Used (or inefficient scan)This is because the tree is not sorted by age globally, only locally within a specific last name.
=). Cannot handle range queries (>, <) or sorting.SELECT age FROM users WHERE last_name='Smith'), the DB doesn't need to look up the heap at all. It answers purely from the index (Index Only Scan). This is a massive performance win.INSERT, UPDATE, or DELETE requires updating every index on the table. Over-indexing makes writes slow.Databases are efficient, but they aren't magic. Understanding B-Trees and execution plans distinguishes a senior engineer from a junior one. Use EXPLAIN ANALYZE to prove your theories.