Understanding Database Indexing: B-Trees, Hash Indexes, and Beyond

Database performance is often the difference between a smooth user experience and a frustratingly slow application. At the heart of this performance lies the index. But not all indexes are created equal. Understanding the underlying data structures is critical for any engineer working with data at scale.

What is a Database Index?

An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Think of it like the index at the back of a textbook: instead of scanning every page to find a topic, you look it up in the index and jump straight to the relevant page.

1. The B-Tree Index: The Swiss Army Knife

B-Trees (and their variants like B+ Trees) are the default indexing structure for most relational databases (MySQL, PostgreSQL, Oracle). Their primary strength is handling a wide range of queries.

How it Works

In a B-Tree, data is organized in a balanced tree structure where each node can have multiple children. This allows the database to find any record in O(log n) time. Unlike binary trees, B-Trees are optimized for systems that read and write large blocks of data, like disk drives.

Best Use Cases:

  • Equality Queries: SELECT * FROM users WHERE id = 101
  • Range Queries: SELECT * FROM orders WHERE created_at > '2026-01-01'
  • Sorting: SELECT * FROM products ORDER BY price ASC

2. The Hash Index: Speed at a Cost

Hash indexes use a hash function to map keys to specific buckets.

How it Works

When you query by a key, the database hashes the key and goes directly to the bucket where the data (or a pointer to it) is stored. This provides O(1) average lookup time.

The Trade-offs:

  • No Range Queries: Because hashes are essentially random, you cannot use them to find a range of values.
  • No Sorting: Data is not stored in any particular order.
  • Collisions: If multiple keys hash to the same bucket, performance degrades as the database must scan a list of entries in that bucket.

3. Advanced Indexing: GIN and GiST

In modern databases like PostgreSQL, we have access to more specialized indexes:

  • GIN (Generalized Inverted Index): Perfect for indexing "composite" items like JSONB blobs or arrays. If you need to search for a value inside a list, GIN is your friend.
  • GiST (Generalized Search Tree): Useful for geometric data and full-text search.

4. The Cost of Indexing: "Write Amplification"

Indexing isn't free. Every time you INSERT, UPDATE, or DELETE a row, the database must also update every index associated with that table.

Best Practices:

  • Index the "WHERE" and "JOIN" columns: These are your most frequent access points.
  • Avoid indexing low-cardinality columns: Indexing a "Boolean" column (like is_active) usually provides little benefit because the database still has to scan half the table.
  • Monitor Index Usage: Use tools like EXPLAIN ANALYZE to see if your queries are actually using the indexes you've built.
Decision Framework: Use B-Trees for 99% of use cases. Only consider Hash indexes for purely equality-based lookups in massive, read-heavy environments where range scans are never required.

Conclusion

A well-indexed database is the backbone of a scalable application. By choosing the right index for the right job, you can reduce query times from seconds to milliseconds, significantly improving the end-user experience.