How Semantic Search Works

My last side project was a proof-of-concept multi-agent platform called A.M.I.C.A.. Among other features, it provides centralized tool management, allowing agents to search for relevant tools when handling user requests.

The initial implementation of the Tool Manager interface was the LuceneToolManager class, which used an in-memory Lucene index to search for the tools relevant to a user prompt.

After publishing the article, I thought that it would make sense to create another implementation that used semantic search, as it is a more powerful mechanism to identify relations between text fragments.

The term “semantic search” refers to using the meaning of words to find similarities, in contrast with lexical search that takes into account key word matches without understanding what they mean. RAG applications are a common example of systems that use semantic search to identify which documents are relevant to a user query.

In this article I will describe how each searching strategy works, and compare the results to understand the difference.

Lexical Search

Lexical (keyword-based) search relies on the presence of at least one word of the query in a candidate to consider it relevant. The input text is used to build an index using the following process:

Convert each text entry into a collection of unique elements. Several techniques are used to achieve this:
- Tokenization reduces each text into individual parts (tokens). In the simplest form, it will use spaces and punctuation signs as separators.
- Stemming reduces each word to its stem, for example removing plurals.
- Stop Words Filtering removes the most common words in a language. This will reduce noise and increase search performance.
- Normalization removes accents and other marks.
- Synonym Expansion adds known synonyms of each word to the index.
Build a map where the keys are the unique tokens, and the values are the list of input texts where that token was found.

Then, when we introduce a query, it is converted to tokens using the same process and the index is used to identify the input texts that contain at least one of those tokens. The relevancy of each match is derived from the number of tokens and their position in the sentence structure.

Lucene is a powerful search engine library that provides analysis tools for common use cases. It powers many of the text search utilities you use every day.

Lucene also supports vector search, but for this example I focused on classical text analyzers, using the basic English Analyzer.

Semantic Search

Semantic search uses a very different approach. It converts each word to a vector (a sequence of numbers) that represents its meaning. Then, it defines the similarity between two words as the distance between their vector representation.

The process of representing words as numerical vectors is called embedding and uses machine learning models trained with huge amounts of text to identify word relations.

Modern embedding models are based on BERT (“Bidirectional Encoder Representations from Transformers”), introduced by Google in 2019 in the famous paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. They used a data set of more than 3 billion words combining the BookCorpus and the English Wikipedia, to train a model by randomly masking 15% of input tokens and asking it to predict them using both left and right context (hence the bidirectional term, in contrast with unidirectional models like GPT).

This deep learning process identifies the different roles that a word can have in text and represents them as vectors in a high-dimension space. In that space, the word “apple” will be close to “banana” or “grape”, but also to “smartphone”, “microsoft” or “mac”.

This goes beyond the concept of synonyms, as it also identifies words with related meanings, such as “pests” and “insects”.

BERT launched the era of large-scale pre-trained Transformers in Natural Language Processing, laying the foundation for successors like RoBERTa, ALBERT, and DistilBERT.

Results Comparison

I have written a simple tool to compare the results of both search strategies. It takes a string as an argument, and compares it with a fixed set of sentences:

public class SearchComparison {
  public static void main(String[] args) {
    if (args.length != 1) {
      System.out.println("Usage: SearchComparison <query>");
      System.exit(1);
    }
    var searchers = List.of(new LuceneTextSearch(), new SemanticTextSearch());

    var candidates =
        List.of(
            "What is the best time to plant rice?",
            "How often should I water tomato plants?",
            "What fertilizer is good for wheat crops?",
            "How can I control pests on my cotton farm?",
            "Which crop is best for sandy soil?",
            "Is there a difference between four-wheel drive and all-wheel drive?");

    var query = args[0];
    System.out.printf("\nCandidates for query \"%s\"\n\n", query);
    for (TextSearch searcher : searchers) {
      System.out.printf("Searching with \"%s\"\n", searcher.getClass().getSimpleName());
      searcher.filterCandidates(candidates, query).forEach(System.out::println);
      System.out.println("---");
    }
  }
}

The LuceneTextSearch class uses an in-memory Lucene index to calculate the similarity between the query and the candidates:

public class LuceneTextSearch implements TextSearch {
  private static final String FIELD_DESCRIPTION = "description";

  private static final Analyzer analyzer = new EnglishAnalyzer();
  private static final QueryParser parser = new QueryParser(FIELD_DESCRIPTION, analyzer);

  @Override
  public List<Result> filterCandidates(List<String> candidates, String query) {
    var indices = buildIndices(candidates);
    Query parsedQuery = getParsedQuery(query);
    return indices.entrySet().stream()
        .map(
            entry -> {
              var score = entry.getValue().search(parsedQuery);
              return new Result(entry.getKey(), score);
            })
        .sorted(Comparator.comparingDouble(Result::score).reversed())
        .toList();
  }

  private static Query getParsedQuery(String query) {
    Query parsedQuery = null;
    try {
      parsedQuery = parser.parse(query);
    } catch (ParseException e) {
      throw new RuntimeException(e);
    }
    return parsedQuery;
  }

  private Map<String, MemoryIndex> buildIndices(List<String> candidates) {
    return candidates.stream()
        .collect(
            Collectors.toMap(
                Function.identity(),
                candidate -> {
                  var idx = new MemoryIndex();
                  idx.addField(FIELD_DESCRIPTION, candidate, analyzer);
                  return idx;
                }));
  }
}

In contrast, SemanticTextSearch uses a model from the LangChain4J embeddings library to create the emeddings, and then calculates the cosine similarity to measure text relevance:

public class SemanticTextSearch implements TextSearch {
  private EmbeddingModel embeddingModel;

  public SemanticTextSearch() {
    getEmbeddingModel();
  }

  @Override
  public List<Result> filterCandidates(List<String> candidates, String query) {
    var embeddings = createEmbeddings(candidates);
    var queryEmbedding = getEmbeddingModel().embed(query).content();
    return embeddings.entrySet().stream()
        .map(
            entry -> {
              var score = CosineSimilarity.between(entry.getValue(), queryEmbedding);
              return new Result(entry.getKey(), score);
            })
        .sorted(Comparator.comparingDouble(Result::score).reversed())
        .toList();
  }

  private Map<String, Embedding> createEmbeddings(List<String> candidates) {
    return candidates.stream()
        .collect(
            Collectors.toMap(
                Function.identity(), candidate -> getEmbeddingModel().embed(candidate).content()));
  }

  private EmbeddingModel getEmbeddingModel() {
    if (embeddingModel == null) {
      embeddingModel = new BgeSmallEnV15EmbeddingModel();
    }
    return embeddingModel;
  }
}

Thanks to this tool, we can see that when a query contains words present in the sentences, both mechanisms produce similar results:

Candidates for query "when is a good time to plant cotton?"

Searching with "LuceneTextSearch"
0.26 What is the best time to plant rice?
0.13 How often should I water tomato plants?
0.13 What fertilizer is good for wheat crops?
0.00 Which crop is best for sandy soil?
0.00 How can I control pests on my cotton farm?
0.00 Is there a difference between four-wheel drive and all-wheel drive?
---
Searching with "SemanticTextSearch"
0.77 What is the best time to plant rice?
0.69 How can I control pests on my cotton farm?
0.66 Which crop is best for sandy soil?
0.61 How often should I water tomato plants?
0.58 What fertilizer is good for wheat crops?
0.38 Is there a difference between four-wheel drive and all-wheel drive?
---

But things become more interesting when we introduce a query that contains words not included in any of the candidates. Here, keyword search fails to find any connection between the query and the candidates, while semantic search recognizes the relationship between ‘insects’ and ‘pests,’ identifying a meaningful match.

Candidates for query "We want to avoid insects"

Searching with "LuceneTextSearch"
0.00 Which crop is best for sandy soil?
0.00 How can I control pests on my cotton farm?
0.00 How often should I water tomato plants?
0.00 What fertilizer is good for wheat crops?
0.00 What is the best time to plant rice?
0.00 Is there a difference between four-wheel drive and all-wheel drive?
---
Searching with "SemanticTextSearch"
0.72 How can I control pests on my cotton farm?
0.54 What is the best time to plant rice?
0.54 Which crop is best for sandy soil?
0.50 What fertilizer is good for wheat crops?
0.50 How often should I water tomato plants?
0.41 Is there a difference between four-wheel drive and all-wheel drive?
---

Conclusion

In this article, we discussed the advantages of semantic search over lexical search. By understanding the meaning of words, semantic search is a far more powerful tool for identifying text relevant to user queries.

The principal drawback of semantic search is the dependency on machine learning models that are hard to train, given the enormous amounts of text and computing power needed. Fortunately, many organizations train and publish open-source models, and projects like Hugging Face make these models more accessible.