Full-text search

Full-text search is a powerful way to search text in the database. You should already be familiar with how indexes work, but let's go over the basics.

An index works like a lookup table that allows the query engine to quickly find records with a given value. For example, if you have a title field in your object, you can create an index on that field to make it faster to find objects with a given title.

Why is full-text search useful?

You can easily search text using filters. There are various string operations for example .startsWith(), .contains() and .matches(). The problem with filters is that their runtime is O(n) where n is the number of records in the collection. String operations like .matches() are especially expensive.

TIP

Full-text search is much faster than filters but indexes have some limitations. In this recipe we will explore how to work around these limitations.

Basic example

The idea is always the same: Instead of indexing the whole text, we index the words in the text so we can search for them individually.

Let's create the most basic full-text index:

class Post {
    int? id;

    late String title;

    @Index()
    List<String> get titleWords => title.split(' ');
}

We can now search for posts with specific words in the title:

final posts = await isar.posts
  .where()
  .titleWordsAnyEqualTo('hello')
  .findAll();

This query is super fast but there are some problems:

  1. We can only search for entire words
  2. We do not consider punctation
  3. We do not support other whitespace characters

Splitting text the right way

Let's try to improve the previous example. We could try to come up with a complicated regex to fix word splitting but it will likely be slow and wrong for edge cases.

The Unicode Annex #29open in new window defines how to split text into words correctly for almost all languages. It is quite complicated but fortunately, Isar does the heavy lifting for us:

Isar.splitWords('hello world'); // -> ['hello', 'world']

Isar.splitWords('The quick (“brown”) fox can’t jump 32.3 feet, right?');
// -> ['The', 'quick', 'brown', 'fox', 'can’t', 'jump', '32.3', 'feet', 'right']

I want more control

Easy peasy! We can change our index to also support prefix matching and case-insensitive matching:

class Post {
    int? id;

    late String title;

    @Index(type: IndexType.value, caseSensitive: false)
    List<String> get titleWords => title.split(' ');
}

By default Isar will store the words as hashed values. This is a good thing because it is faster and more space efficient. But hashes can't be used for prefix matching. By using IndexType.value, we can change the index to use the words directly instead. It gives us the .titleWordsAnyStartsWith() where clause:

final posts = await isar.posts
  .where()
  .titleWordsAnyStartsWith('hel')
  .or()
  .titleWordsAnyStartsWith('welco')
  .or()
  .titleWordsAnyStartsWith('howd')
  .findAll();

I also need .endsWith()

Sure thing! We will use a trick to achieve .endsWith() matching:

class Post {
    int? id;

    late String title;

    @Index(type: IndexType.value, caseSensitive: false)
    List<String> get revTitleWords {
        return Isar.splitWords(title).map((word) => word.reversed).toList();
    }
}

Don't forget to also reverse the ending you want to search for:

final posts = await isar.posts
  .where()
  .revTitleWordsAnyStartsWith('lcome'.reversed)
  .findAll();

Stemming algorithms

Unfortunately indexes do not support contains matching (this is true for other databases as well). But there are a few alternatives that are worth exploring. It highly depends on your use case which one you choose. You can for example index word stems instead of the full word.

A stemming algorithm is a process of linguistic normalisation, in which the variant forms of a word are reduced to a common form, for example,

connection
connections
connective          --->   connect
connected
connecting

Popular examples are the Porter stemming algorithmopen in new window and the Snowball stemming algorithmsopen in new window.

There are also more advanced forms like lemmatizationopen in new window.

Phonetic algorithms

A phonetic algorithmopen in new window is an algorithm for indexing of words by their pronunciation. In other words, it allows you to find words that sound similar to the one you are looking for.

WARNING

Most phonetic algorithms only support a single language.

Soundex

Soundexopen in new window is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. It is a very simple algorithm and there are multiple improved versions.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261".

Double Metaphone

The Double Metaphoneopen in new window phonetic encoding algorithm is the second generation of this algorithm. It makes a number of fundamental design improvements over the original Metaphone algorithm.

Double Metaphone tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origins.