How I Rebuilt a Search System
Published on
Background
In the course of refactoring an older Django project involving books, I ended up building a new search system. This comprised of an autocomplete and a main search that had to both complement one another, so that the experience for the user was predictable, responsive, and accessible.
You can study the completed files for yourself in my Github repo here - it is currently in the 'refactor/ui' branch but will show up in the master branch soon. It consists of the search bar (markup and styles), the frontend JS logic, the Django views (autocomplete, book_search), and a python file of helpers used by the views.
Better links for the above paragraph will show up in the future.
Below I'll try and explain how I did it, for my benefit as much as anyone else's.
What was the goal?
In my existing Django project, I had a main search and an autocomplete search, but these were both quite different:
Notice how the autocomplete performs a naive title-only text search, while the main book search takes advantage of powerful Postgres tools like SearchVector, SearchQuery and SearchRank to filter and rank across both title and author name.
One approach I could have taken here was to study these Postgres tools further, use them to improve the main search, and then make the autocomplete search consistent with how the main search operates. But I wanted to move away from using these tools and instead come up with my own custom-built way of searching titles instead.
What would often happen then is the autocomplete results would differ from the main search results, sometimes in peculiar ways. This meant the user experience wouldn't be great.
Unified search model
The unified search system I settled on uses this search filter:
There is a fair bit to unpack here.
First of all, the function takes in the list terms representing each word in a query, e.g. `['quick', 'brown', 'fox'].
This list of terms is then split between the last term and everything else - or rather the 'active' term and the 'completed' terms. We want to do this because how we search the last term for an autocomplete needs to be different from how we search the last term for the main search. That is, the last term should be fuzzily matched to an author name when using autocomplete, whereas in the main search we only want exact matches to author names.
The way in which we distinguish between main search and autocomplete search is via the parameter allow_prefix - if it's set to True then the autocomplete branch of the logic will run.
The list of terms are iterated over, and within each iteration we consider plurals - for each term we ensure there is both a plural form and a singular form.
The helper function below builds a set of two terms - the original term and the most likely plural version (and vice versa). So, if a user searches 'diary', the word 'diaries' will also be included in the search filter, or if a user searches "catches", then "catch" is also included, and so on. The intention was not to overly engineer this kind of function, to try and pick out all the potential edge cases - you don't want to invent words, or be super predictive on what the user is entering.
For each form we add filters to our query, and pass the combined query filter to the Book.objects.filter method:
Normalising the input and using tokens
This has to be done in both the frontend and the backend. The main idea is that the search is based around meaningful tokens of a certain minimum length. So, for example, this is how I normalised the input in the frontend:
The function above joins the tokens back into a single string because the whole thing has to be sent to the autocomplete endpoint. But in the Django view that handles this request, it does the same normalising, before then leaving the words as tokens so that it can be used to both filter the results but also score them.
Scoring the results for relevance
This was maybe the most 'fun' part of the process - figuring out a way to score every result the queryset would bring back from the search. It took some thought to work out how best to score them, but it helped to group the scores into a meaningful set, so that i could play around with the order and numbers:
In the function above, for a single book result, we iterate through every term and build a score that represents how well that book result matches to the overall query.
We do this for every book result, where we then order the results based on this score:
Confidently redirecting
I wanted to cater for the scenario where the user's search query is most likely going to be belong to a single book in the database. There are different ways of going about this, but I felt like if there was just a single result left from carrying out a search, we have to find a way to determine if that result was worthy of being redirected straight to the book page itself.
Initially I thought I could piggyback into the score_book function, but I found it difficult. That scoring system is specifically designed to rank a group of books, whereas here we are trying to determine if the book is redirect-worthy. So, I needed to come up with a new kind of scoring system, specifically for the scenario where there is only one book result from a search:
So, once again I iterate through every term in the query, (seems like there's a whole lot of iterating going on in the course of just a single search!) but at least it's only for one book.
Basically, I decided the following would meet the criteria for redirecting to the book:
- if the book starts with one of the terms
- if there is a single match in the title and a single match in either the author first or last name
- if there is both exact matches for the author's first and last name
- if there are more than 1 title matches.
So, if there's only a partial match, or if there is only an author surname put in, it's not enough for a redirect and instead the user just see's that single search result on the search results page.
Again, this was another fun exercise to work through, unlike the next one...
Dealing with autocomplete behaviour in the frontend
Aside from normalising input, there are two important things to consider doing in the frontend while carrying out an autocomplete search.
Dead-end detection
If a search query produces zero autocomplete results, but the user carries on extending their query, we should not continue hitting that endpoint. Here's what we can do to achieve this behaviour:
-
We need to create some 'state' that helps us monitor the last effective query and whether that query produced results.
-
Next, in the input
changeevent listener, we normalise the new query and check whether the last query returned zero results and new query only extends it - if neither are true, we exit early: -
If we get past this, then we do the fetch and update the state:
This code helps to detect if a search is in a dead-end, and protects the endpoint from being unnecessarily hit. Two points worth mentioning: if the user begins deleting their query, the database still won't be hit until such time that the query last resembled the last effective query - pretty sweet! The other thing is that it is important to reset the state if the length of the query dips below the minimum lenth range so that the user can begin writing a new query 'afresh'.
Abort-safe fetch
If a user types really quickly then we want to avoid race conditions, UI flickering and wasted work. So, this is where the Abort-safe fetch mechanism is really useful. The steps are:
- Wrap up the code in a try-catch block. This is important so that you can swallow the error that is produced when a fetch is aborted.
- Create a new AbortController object.
- Associate the controller's signal with the fetch's signal.
Summary
So, that about wraps up how I rebuilt the search. I replaced a search system that contained differing approaches into a cohesive, human-aware, intent-aware search system that behaves predictably. Users aren't shown differing results between the autocomplete and main search, except where they should be (fuzzy matching).
I didn't mention the accessibility stuff, which is important, nor did I get too much into the markup and styles and how they're wired to event listeners (how do you get an autocomplete box to appear under a search bar in an accessibility-driven way I hear you ask - well, maybe I'll save that for another exciting post!).
I also haven't really gone over the process of how I got from where I started to where I finished - suffice it to say I started simple, I allowed the code to be clunky and unorganised, I iterated a lot, did a lot of trial and error, a fair bit of research, and then only towards the end did I organise the code better, refine, and retune little bits here and there. I also accepted my limitations - I didn't even get into stuff like trigrams and stemming!
Anyway, this was a long arduous task, but ultimately really fruitful and edifying. I feel a lot more knowledgeable about some of the key aspects involved in coordinating a good search system. Honestly, even though there is a small part of me that wonders why I didn't just brush up on learning those Postgres tools that can do all of what I managed and a whole heap more, I still feel I gained a lot and ultimately, and most importantly, the search is in a much better place than where it was to better serve the people who will use it.