February 17, 2022
I run a newsreading service called Blaðberi.is. The service reads the RSS feeds of all major media in Iceland, saves the items to a database and presents them to visitors immediately.
I wanted to add a search functionality to the website, but noticed immediately when i started testing it out that the basic search functionality of Postgres (e.g. SELECT title WHERE text ILIKE 'hello'
) is very slow, as it needs to scan the entire table every time a search is performed.
The solution is to index the text and use the index for searching. But for the index to work, the text needs to be vectorized, i.e. analyzed and exctracted into a list of words which can be searched for. To create this index, Postgres needs to convert the text into a tsvector
format first.
The issue with Icelandic is that every noun and name has 8 versions - 4 conjugations in the singular version and 4 in the plural version. So a tsvector
format of thousands of items would quickly become very large, and also return incomplete results.
Unfortunately, Postgres does not include any language support for Icelandic. But a few individuals from the Icelandic open-source community have created an Icelandic dictionary, called Hunspell-is, which is mainly used for spell-checking in the LibreOffice software.
I found this thread on StackOverflow which helped me figure out how to use a custom dictionary like the Icelandic one.
To add the Icelandic dictionary to Postgres, I did the following:
is.aff
and is.dic
files from the FreeDesktop repositoryis.aff
-> is_is.affix
is.dic
-> is_is.dict
$SHAREDIR/tsearch_data
directoryicelandic.stop
in $SHAREDIR/tsearch_data
CREATE TEXT SEARCH DICTIONARY icelandic_hunspell (
TEMPLATE = ispell,
DictFile = is_is,
AffFile = is_is,
Stopwords = icelandic);
Now the dictionary is available in Postgres. The next step is to create a text search configuration which specifies how Postgres should use it. The Postgres docs provided the information I needed.
First, create a new text search configuration, using English defaults which we will then change:
CREATE TEXT SEARCH CONFIGURATION public.icelandic ( COPY = pg_catalog.english );
Next, we alter this configuration to use the Icelandic dictionary first, and a language-agnostic dictionary as a backup:
ALTER TEXT SEARCH CONFIGURATION icelandic
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH icelandic_hunspell, simple;
simple
here is a built-in dictionary in Postgres, which returns the lower-case version of the word.
Lastly, we omit certain types of terms to reduce index bloating:
ALTER TEXT SEARCH CONFIGURATION icelandic
DROP MAPPING FOR email, url, url_path, sfloat, float;
Now we have configured the Postgres text-search to:
We want to create the index of tsvector
terms beforehand for each item, as creating it for every search would defeat the purpose of using it, since that would not be any faster than the basic search functionality. The Postgres docs also provide instructions for this.
The table I will be searching looks like this:
CREATE TABLE items (
id BIGSERIAL PRIMARY KEY,
pubdate timestamp(0) without time zone,
medium text NOT NULL,
publisher text NOT NULL,
category text NOT NULL,
title text NOT NULL,
body text,
url text NOT NULL,
inserted_at timestamp(0) without time zone NOT NULL,
updated_at timestamp(0) without time zone NOT NULL
);
To store the index, we create a stored generated column as follows:
ALTER TABLE items
ADD COLUMN text_index_col tsvector
GENERATED ALWAYS AS (to_tsvector('icelandic', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
Next, we create an index on this column to make searching fast:
CREATE INDEX textsearch_idx ON items USING GIN (text_index_col);
In my testing database, I have around 4,000 items in the table. Let's compare the performance using this data.
Our baseline is the basic ILIKE search functionality (gæludýr is Icelandic for pet):
explain analyze select pubdate, title, body
from items
where title ilike '%gæludýr%' or body ilike '%gæludýr%'
Results:
Execution Time: 44.984 ms
Let's now use the index for comparison:
SELECT pubdate, title, body
FROM items
WHERE text_index_col @@ to_tsquery('icelandic', 'gæludýr')
ORDER BY pubdate DESC
Results:
Execution Time: 0.109 ms
That's a 400x speedup!
Additionally, I would like the search to be live, so that results start to show up immediately as the user starts typing. For this to work, the full-text search should match the beginning of words. I found a solution on StackOverflow, which formats the search term to allow for this. Here is my modified query for Icelandic
WITH search AS (
SELECT to_tsquery(string_agg(lexeme || ':*', ' & ' order by positions)) AS query
FROM unnest(to_tsvector('icelandic', 'gæludýr'))
)
SELECT items.pubdate, items.title, items.body, search.query
FROM items, search
WHERE (items.text_index_col @@ search.query)
For the live search to be intuitive, I also want the matched words to be highlighted in the results. Postgres also includes a function for this, called ts_headline()
.
This function is expensive computationally, as Postgres needs to fetch the whole text, read it, and modify it to highlight the matched words. Thus, we only use this function on the already matched results, by finding the matched records using the above query as a subquery:
SELECT pubdate,
ts_headline(title, query, 'StartSel = <em>, StopSel = </em>, HighlightAll=True') AS title,
ts_headline(body, query, 'StartSel = <em>, StopSel = </em>,MaxFragments=2,MaxWords=31, MinWords=30') AS body
FROM (WITH search AS
(SELECT to_tsquery(string_agg(lexeme || ':*', ' & '
ORDER BY positions)) AS query
FROM unnest(to_tsvector('icelandic', 'gæludý')))
SELECT items.pubdate,
items.title,
items.body,
search.query
FROM items,
search
WHERE (items.text_index_col @@ search.query) LIMIT 10) AS foo
This still incurs a significant performance penalty, as EXPLAIN ANALAYZE
now shows an execution time of 3.5ms instead of 0.1ms previously. However, this penalty should not grow with the number of records, as I limit them to 10 in the subquery, so this function is only used on a maximum of 10 queries, regardless of the number of items in the database.
I now have a first version of a fast full-text search engine in Icelandic ready in Postgres. Now, all that is left is to implement it in the app itself.