Wondered what is Database Indexing and how it works?

Database

In computing, a database is an organized collection of data stored and accessed electronically from a computer system. It is a system for storing data in an ordered manner. The data is stored in a specified structure within the storage. Each database type has its own data storage format. They've been tweaked and tuned for various scenarios.

Let's take this simple Employee Table as an example

Wondered how this simple table is actually saved Internally.

Storage

Internally, each database is kept as a file with a certain encoding and layout. Let's pretend that a database is backed by a CSV file for the purpose of simplicity. Here's how it appears:

Id,FirstName,LastName,Designation
1,Analise,Holsall,Account Executive
2,Agnes,St. Hill,Structural Analysis Engineer
3,Ulrica,Willimot,Mechanical Systems Engineer
4,Coop,Awcoate,VP Product Management
5,Willi,De Maria,Business Systems Development Analyst

Everything appears to be straightforward. It's not difficult to perform a lookup with only five entries but what if there were 100,000 entries? It would take a long time to go through the entire file. The query time grows in direct proportion to the file size.

This brings us to finding an optimal solution to retrieve data faster.

Indexing comes to the rescue!!

Playmobil mountain rescue service heli with crew. Made with Canon 5d Mark III and analog vintage lens, Leica Elmarit-R 2.8 135mm (Year: 1987) — Indexing at Rescue

Indexing

A database index is a data structure that can be used to speed up data retrieval activities. What does it resemble?

It would be much faster to skip to the appropriate row without looping through the remainder of the table if we needed to fetch an employee with ID 5. This is the central concept of indexing. We must additionally save the offset, which leads to the appropriate entry.

The simplest way to achieve this would be to use hashing. The key is the value of the column we want to index (in this example, it is the Id column). The hash value is the offset(starting index) in the database file. For ID = 1, the offset is 0. For ID = 2, the offset is 36.

The end-to-end structure will look like this:

1 -> 0  -------> 1,Analise,Holsall,Account Executive
2 -> 36  ------> 2,Agnes,St. Hill,Structural Analysis Engineer
3 -> 72  ------> 3,Ulrica,Willimot,Mechanical Systems Engineer
4 -> 118 ------> 4,Coop,Awcoate,VP Product Management
5 -> 155 ------> 5,Willi,De Maria,Business Systems Development Analyst

Querying employees by ID will yield faster results if we build an index. The retrieved request accesses the hash index and retrieves the offset for the desired ID.

There is also the option of having several indexes. If any other column needs to be accessed quickly, we create an index for it as well. For example, we could create a designation index and query employees by designation faster! However, each new index adds an additional load to the database.

The Cost of Indexing

Firstly, each index hash requires additional Memory space. It's crucial to remember to index only the columns that will be queried frequently. Otherwise, indexing each column would take up a huge amount of Memory space.

Secondly, there will be slightly slower write operations for the quick read operations. We must construct an item in the hash index every time we add an entry to the table. This makes insert operations slow.

TL;DR

A database index can help speed up read queries.
Memory usage increases for each index added.
Adding an index has an impact on database writing operations.

In this article, we saw a simple file-based data storage with offset-based indexing. There are many other approaches like BTree or B+Tree but the concept remains relevant.

I always wondered how to enhance my API performances and indexing was one of many ways we can use to make our APIs faster.

Hope you like it. Cheers 🍻