
Although tokenisation libraries provide a set of methods for splitting text, users can implement their own rules using regular expressions. Tokenisation consists of splitting the text string or documents into tokens, or smaller chunks. Tokenisation is a fundamental concept of the Natural Language Processing (NLP) field, which is also being applied to search engines. Both algorithms are rooted in the concept of tokenisation. There are two main algorithms used for scoring: Term Frequency-Inverse Document Frequency (TD-IDF) and Best Match 25 (BM25). The third aspect encompasses the algorithms which calculate the scores. In fact, giving feedback to the algorithm on the user’s perception is a very fascinating field related to reinforcement learning techniques, which by itself would warrant a separate article. The second aspect concerns knowledge about a user’s satisfaction with the relevance score provided in the search. Even though different users perform the same search, the relevance results could be different for each of them therefore, the relevance should be aligned to the user expectations. The first aspect is related to the relevance definition itself. Three aspects should be considered for the scoring process: Similar to how the Google search engine works, the outcome of the scoring process is a sorted array containing all the search matches ordered by score. Scoring is the process that compares the user input against the stored documents and assigns relevance values for each result. The following table shows an example of an inverted index substructures build based on a set of documents:

When you need to search for a specific term in the book, you use the book index and check which pages contain information about the queried term. The concept of the inverted index is close to the concept of a book index. A simple document is illustrated in the following figure:Īn inverted index is a data structure storing information in a complex HashMap, aiming to facilitate the search of terms contained within the fields of the documents. Fields are the minimal unit of storage in the Lucene ecosystem. A document contains a set of fields and is generally stored in JSON format. Documents do not have a specific scheme and every document pushed into the index is tagged with a unique identifier. Given that Elasticsearch is a distributed system and clusters can be added on demand, there is virtually no limit to the number of documents an Elasticsearch server can store.Ī document is a record containing information related to the index.

You can think of an index as a folder with multiple related documents.


I will briefly describe these concepts below.Īn index is a collection of documents sharing conceptual and logical similarities. The fundamental concepts required to understand the theory behind Apache Lucene are indexes, documents, inverted indexes, scoring, and tokenisation. The figure below depicts the integration between Elasticsearch and Lucene, and how they interact with external systems: The core of Elasticsearch is the Apache Lucene library, which includes features for indexing, searching, retrieving and updating documents, and text analysis. FUNDAMENTAL CONCEPTS OF THE APACHE LUCENE LIBRARY Additionally, I will present a case where these Elasticsearch features were evaluated and Elasticsearch was proposed as the main data repository for an internal project. This article explores fundamental Elasticsearch concepts such as indexes, documents, and inverted indexes, and how these concepts work together to provide storage and relevance scoring. Which is why I want to share some fundamental information on the topic.Ĭurrently, Elasticsearch is ranked as the most popular search engine according to DB-Engines. That could be the reason Google became so popular, and Google certainly resolved that problem.Īs the amount of content is growing daily and with an increased pace, giving such powerful search capabilities to users is getting more important as well. I still remember those days of using search engines in different portals and not getting even one relevant result.
