Postgres full-text search in Icelandic

I run a newsreading service called Blaðberi.is. The service reads the RSS feeds of all major media in Iceland, saves the items to a database and presents them to visitors immediately.

I wanted to add a search functionality to the website, but noticed immediately when i started testing it out that the basic search functionality of Postgres (e.g. SELECT title WHERE text ILIKE 'hello') is very slow, as it needs to scan the entire table every time a search is performed.

The solution is to index the text and use the index for searching. But for the index to work, the text needs to be vectorized, i.e. analyzed and exctracted into a list of words which can be searched for. To create this index, Postgres needs to convert the text into a tsvector format first.

The issue with Icelandic is that every noun and name has 8 versions - 4 conjugations in the singular version and 4 in the plural version. So a tsvector format of thousands of items would quickly become very large, and also return incomplete results.

Unfortunately, Postgres does not include any language support for Icelandic. But a few individuals from the Icelandic open-source community have created an Icelandic dictionary, called Hunspell-is, which is mainly used for spell-checking in the LibreOffice software.

Adding an Icelandic dictionary

I found this thread on StackOverflow which helped me figure out how to use a custom dictionary like the Icelandic one.

To add the Icelandic dictionary to Postgres, I did the following:

Downloaded the is.aff and is.dic files from the FreeDesktop repository
Renamed them as follows: is.aff -> is_is.affix is.dic -> is_is.dict
Added them to the Postgres $SHAREDIR/tsearch_data directory
Created a custom list of stop-words (i.e. words that Postgres should ignore when creating an index). These are common icelandic words such as "og", which translates to "and" in English
Saved the stop-word list as icelandic.stop in $SHAREDIR/tsearch_data
Loaded all the files into Postgres with the following command:

CREATE TEXT SEARCH DICTIONARY icelandic_hunspell (
    TEMPLATE = ispell,
    DictFile = is_is,
    AffFile = is_is,
    Stopwords = icelandic);

Configuring the full-text search

Now the dictionary is available in Postgres. The next step is to create a text search configuration which specifies how Postgres should use it. The Postgres docs provided the information I needed.

First, create a new text search configuration, using English defaults which we will then change:

CREATE TEXT SEARCH CONFIGURATION public.icelandic ( COPY = pg_catalog.english );

Next, we alter this configuration to use the Icelandic dictionary first, and a language-agnostic dictionary as a backup:

ALTER TEXT SEARCH CONFIGURATION icelandic
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH icelandic_hunspell, simple;

simple here is a built-in dictionary in Postgres, which returns the lower-case version of the word.

Lastly, we omit certain types of terms to reduce index bloating:

ALTER TEXT SEARCH CONFIGURATION icelandic
    DROP MAPPING FOR email, url, url_path, sfloat, float;

Now we have configured the Postgres text-search to:

Normalize Icelandic words to a singular, non-conjugated version
Skip common Icelandic stop-words
Omit terms we do not need to be searchable, such as urls and emails
Store words not found in the Icelandic dictionary in lowercase

Indexing the table

We want to create the index of tsvector terms beforehand for each item, as creating it for every search would defeat the purpose of using it, since that would not be any faster than the basic search functionality. The Postgres docs also provide instructions for this.

The table I will be searching looks like this:

CREATE TABLE items (
    id BIGSERIAL PRIMARY KEY,
    pubdate timestamp(0) without time zone,
    medium text NOT NULL,
    publisher text NOT NULL,
    category text NOT NULL,
    title text NOT NULL,
    body text,
    url text NOT NULL,
    inserted_at timestamp(0) without time zone NOT NULL,
    updated_at timestamp(0) without time zone NOT NULL
);

To store the index, we create a stored generated column as follows:

ALTER TABLE items
    ADD COLUMN text_index_col tsvector
               GENERATED ALWAYS AS (to_tsvector('icelandic', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;

Next, we create an index on this column to make searching fast:

CREATE INDEX textsearch_idx ON items USING GIN (text_index_col);

Performance

In my testing database, I have around 4,000 items in the table. Let's compare the performance using this data.

Our baseline is the basic ILIKE search functionality (gæludýr is Icelandic for pet):

explain analyze select pubdate, title, body
from items
where title ilike '%gæludýr%' or body ilike '%gæludýr%'

Results:

Execution Time: 44.984 ms

Let's now use the index for comparison:

SELECT pubdate, title, body
FROM items
WHERE text_index_col @@ to_tsquery('icelandic', 'gæludýr')
ORDER BY pubdate DESC

Results:

Execution Time: 0.109 ms

That's a 400x speedup!

Partial searches

Additionally, I would like the search to be live, so that results start to show up immediately as the user starts typing. For this to work, the full-text search should match the beginning of words. I found a solution on StackOverflow, which formats the search term to allow for this. Here is my modified query for Icelandic

WITH search AS (
    SELECT to_tsquery(string_agg(lexeme || ':*', ' & ' order by positions)) AS query
    FROM unnest(to_tsvector('icelandic', 'gæludýr'))
)
SELECT items.pubdate, items.title, items.body, search.query
FROM items, search
WHERE (items.text_index_col @@ search.query)

Highlighting results

For the live search to be intuitive, I also want the matched words to be highlighted in the results. Postgres also includes a function for this, called ts_headline().

This function is expensive computationally, as Postgres needs to fetch the whole text, read it, and modify it to highlight the matched words. Thus, we only use this function on the already matched results, by finding the matched records using the above query as a subquery:

SELECT pubdate,
       ts_headline(title, query, 'StartSel = <em>, StopSel = </em>, HighlightAll=True') AS title,
       ts_headline(body, query, 'StartSel = <em>, StopSel = </em>,MaxFragments=2,MaxWords=31, MinWords=30') AS body
FROM (WITH search AS
        (SELECT to_tsquery(string_agg(lexeme || ':*', ' & '
                                      ORDER BY positions)) AS query
         FROM unnest(to_tsvector('icelandic', 'gæludý')))
      SELECT items.pubdate,
             items.title,
             items.body,
             search.query
      FROM items,
           search
      WHERE (items.text_index_col @@ search.query) LIMIT 10) AS foo

This still incurs a significant performance penalty, as EXPLAIN ANALAYZE now shows an execution time of 3.5ms instead of 0.1ms previously. However, this penalty should not grow with the number of records, as I limit them to 10 in the subquery, so this function is only used on a maximum of 10 queries, regardless of the number of items in the database.

Conclusion

I now have a first version of a fast full-text search engine in Icelandic ready in Postgres. Now, all that is left is to implement it in the app itself.

Björn Brynjúlfur