It was a typical day of coding, and I was working on a straightforward task—building a user registration page. Everything seemed fine until I hit the submit button to check if a username already existed in the database. And then, time stood still. Seconds turned into minutes, and then hours, with no result in sight. Frustration set in as I realised this wasn’t a one-time glitch—it was a performance issue.
What went wrong? As Arundhati Roy said,
The trouble is, once you see it, you can’t unsee it.
My system wasn’t optimised to handle the increasing load of user data. The username check was being performed using a full table scan, which took longer and longer as the user base grew.
This experience opened my eyes to various strategies for optimising such scenarios.
Once I visited a library and this was the situation, very close to what we are discussing
Suppose that you was walking into a library where there are no sections or genres—just endless racks of books scattered everywhere. Now, if you were asked to find a specific book, you’d have to search every shelf, which could take hours. This is exactly like running a full table scan in a database. It works, but it’s highly inefficient as the data grows.
Now, think about what could improve this experience. If the library had genre-based sections, you could go directly to the section you need, cutting down your search time significantly. This is like using an index in a database—it allows you to quickly find what you’re looking for without scanning every entry.
Another way to speed up the process would be to place popular books on a display right at the reception. If the book you’re looking for happens to be popular, you can grab it immediately without searching the shelves—this is similar to caching frequently accessed data for quicker retrieval.
Finally, you could ask the librarian if the book is available in the library at all. If they say it’s not, you save time by not looking for it—this is how a Bloom filter works, helping you avoid unnecessary database queries when something definitely doesn’t exist.
Indexing: A frequently used strategy
The first thing I did was implement indexing on the username field in the database. Indexing is like a well-organized book’s index—it allows the database to quickly locate a record instead of scanning every entry. Without an index, the database performs a full table scan, which becomes slower as the data grows.
By adding an index to the username column, I reduced the query time from hours to milliseconds.
-- Create an index on the username column
CREATE INDEX idx_username ON users(username);
-- Optimized query using the index
SELECT * FROM users WHERE username = 'soumya';
Before the index, the system had to go through every record to find if a username existed. With the index, the database could jump directly to the record it needed, drastically improving performance. However, indexing comes with a cost—adding too many indexes can slow down inserts and updates, so it’s important to use them wisely.
Caching: Speed Up Frequent Checks
Even after optimizing with indexing, I realized that the system was still doing redundant checks for usernames that had already been queried. This is where caching came into play.
Caching involves storing frequently accessed data in memory (like Redis or Memcached) to avoid querying the database for every request. For example, once a username check is done, it can be stored in a cache, so if the same username is checked again, the result can be returned instantly.
# Check if username is in cache first
cached_username = cache.get('username:soumya')
if cached_username:
return cached_username
else:
# If not in cache, query the database
user = db.query('SELECT * FROM users WHERE username = "soumya"')
cache.set('username:soumya', user)
return user
This simple strategy dramatically cut down on database load and response time.
Bloom Filters: Avoiding Unnecessary Queries
While caching is great, it has its own challenges, like invalidation and keeping the cache up-to-date. This led me to explore Bloom filters, a probabilistic data structure that tells you if an element might be present in a set or if it definitely is not.
In my case, using a Bloom filter helped prevent unnecessary database queries. If the Bloom filter returned a negative result, I knew for sure that the username didn’t exist, so I could skip the database check altogether. However, a positive result required confirming with a database query, as there’s a small chance of false positives.
# Check the Bloom filter first
if bloom_filter.check('soumya'):
# Might exist, so query the database to confirm
user = db.query('SELECT * FROM users WHERE username = "soumya"')
else:
# Definitely does not exist, no need to query the database
return "Username available"
Bloom filters are highly efficient when dealing with large datasets, as they help reduce the load on the database by filtering out negative checks.
Hybrid Strategy: Combining Indexing, Caching, and Bloom Filters
After experimenting with indexing, caching, and Bloom filters separately, I realized that the best results came from a hybrid approach. Each method has its strengths, and using them together ensures optimal performance.
Here’s how I combined these strategies:
The system first checks the Bloom filter to see if the username might exist.
If the Bloom filter says no, the username doesn’t exist, and no further checks are needed.
If the Bloom filter says yes, the system checks the cache.
If the username is found in the cache, it serves the result immediately.
If not, the system queries the indexed database, updates both the cache and the Bloom filter, and serves the result.
# Check Bloom filter first
if bloom_filter.check('soumya'):
# Bloom filter says it might exist, check cache next
cached_username = cache.get('username:soumya')
if cached_username:
return cached_username
else:
# Cache miss, query the database
user = db.query('SELECT * FROM users WHERE username = "soumya"')
cache.set('username:soumya', user)
bloom_filter.add('soumya')
return user
else:
# Bloom filter says username does not exist
return "Username available"
Lessons Learned
Reflecting on this experience, I was reminded of Chetan Bhagat’s quote:
Success comes from making better decisions and avoiding problems early on.
In my case, the problem wasn’t the code—it was the system design. Not planning for scalability early on led to performance issues that could have been avoided. By applying the right optimization techniques, I transformed an inefficient process into a scalable, high-performance system.
FAQs
How does indexing improve query performance?
Indexing allows the database to quickly locate records, bypassing the need for a full table scan, significantly reducing query time for large datasets.
What is caching, and how does it help?
Caching stores frequently accessed data in memory, reducing the need to repeatedly query the database and improving response time for repeated queries.
How does a Bloom filter work?
A Bloom filter is a space-efficient probabilistic data structure that helps avoid unnecessary database queries by determining if an element is likely in a set or definitely not.
When should I use a hybrid strategy?
A hybrid strategy combining caching, indexing, and Bloom filters is ideal for systems that need to handle large volumes of data while minimising database queries and maximising performance.
What are the downsides of too many indexes?
While indexes improve read performance, having too many can slow down writes (like inserts and updates) since the database has to maintain the index structures.
Conclusion
Optimizing system performance is not just about writing clean code—it’s about planning for scale and implementing strategies that grow with your application. Indexing, caching, and Bloom filters each play a vital role in solving performance issues, but combining them in a hybrid approach provides the best results.
As Rabindranath Tagore said,
You can’t cross the sea merely by standing and staring at the water.
You have to dive deep into system design, explore potential issues, and implement solutions that ensure your application can handle increasing loads smoothly.