Introduction to Apache Lucene

2024年11月19日

7

Have you ever ever been interested in what drives among the greatest Discover functions like Elasticsearch and Solr in use circumstances like e-commerce and numerous different doc retrieval programs which have excessive efficiency? Apache Lucene is a robust search library in Java and performs tremendous quick searches on giant volumes of information. Lucene’s indexing and search capabilities provide the very best options for search engines like google.

By the tip of this text, you should have mastered the basics of Apache Lucene even if you’re new to the sphere of search engineering.

Studying goals

Be taught the basic ideas of Apache Lucene.
See how Lucene powers search functions like Elasticsearch, Solr, and so forth.
Perceive how indexing and looking out work in Lucene.
Be taught in regards to the several types of queries supported by Apache Lucene.
Perceive how you can create a easy search utility utilizing Lucene and Java.

This text was printed as a part of the Information Science Blogathon.

What’s Apache Lucene?

To know Lucene in depth, there are some key terminologies and ideas. Let us take a look at every of them intimately together with examples. Think about an instance the place we’ve got the next details about three totally different merchandise in our assortment.

{
  "product_id": "1",
  "title": "Wi-fi Noise Cancelling Headphones",
  "model": "Bose",
  "class": ("Electronics", "Audio", "Headphones"),
  "worth": 300
}

{
  "product_id": "2",
  "title": "Bluetooth Mouse",
  "model": "Jelly Comb",
  "class": ("Electronics", "Laptop Equipment", "Mouse"),
  "worth": 30
}

{
  "product_id": "3",
  "title": "Wi-fi Keyboard",
  "model": "iClever",
  "class": ("Electronics", "Laptop Equipment", "Keyboard"),
  "worth": 40
}

Doc

A doc is a elementary unit of indexing and looking out in Lucene. A doc ID identifies every doc. Lucene converts uncooked content material into paperwork that comprise fields and values.

Area

A Lucene doc comprises a number of fields. Every area has a reputation and a worth. See instance beneath.

product_id
qualification
model
class
worth

Time period

A time period is a search unit in Lucene. Lucene performs a number of preprocessing steps on the uncooked content material earlier than creating phrases like tokenization, and so forth.

Doc identification	Phrases
1	qualification: wi-fi noise canceling headphonesmodel: boseclass: electronics, audio, headphones
2	qualification: bluetooth mousemodel: gelatin, combclass: electronics, computing, equipment
3	qualification: wi-fi keyboard model: craftyclass: electronics, computing, equipment

Inverted index

The underlying information construction in Lucene that permits super-fast searches is the inverted index. In an inverted index, every time period is assigned to the paperwork that comprise it, together with the place of the time period in these paperwork. That is known as Put up Listing.

Section

Lucene can subdivide an index into a number of segments. Every phase is an index in itself. Section searches are often carried out in collection.

Rating

Lucene calculates the relevance of a doc utilizing scoring mechanisms corresponding to inverse doc time period frequency (TF-IDF). There are additionally different scoring algorithms, corresponding to BM25, that enhance TF-IDF.

Now let’s perceive how TF-IDF is calculated.

Time period Frequency (TF)

Time period frequency is the variety of occasions a time period t seems in a doc.

Doc Frequency (DF)

Doc frequency is the variety of paperwork that comprise a time period t. The inverse doc frequency divides the variety of paperwork within the assortment by the variety of paperwork containing the time period t. It measures the distinctiveness of a specific time period to keep away from giving extra significance to repetitive phrases corresponding to “a”, “the”, and so forth. The “1+” is added to the denominator when the variety of paperwork containing the time period t is 0.

Time period Frequency Inverse Doc Frequency (TF-IDF)

The TF-IDF is the product of the deadline frequency and the inverse doc frequency. A better TF-IDF worth signifies that the time period is extra distinctive and distinctive in relevance to your entire assortment.

Term Frequency Inverse Document Frequency (TF-IDF)

Elements of a Lucene Search Software

Lucene comprises two primary elements that are:

indexer – Lucene makes use of the Index Author class to index
Seeker – Lucene makes use of the index search class to look.

Lucene Indexer

The Lucene Index is answerable for indexing paperwork for the search utility. Lucene performs a number of textual content evaluation and processing steps, corresponding to tokenization, earlier than indexing the phrases in an inverted index. Lucene makes use of the IndexWriter class for indexing.

IndexWriter requires the specification of a listing the place the index will probably be saved, in addition to a parser for the uncooked content material. Though it is fairly simple to jot down your individual customized analyzer, Lucene’s StandardAnalyzer does an important job at this.

Listing listing = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(listing, indexWriterConfig);

Lucene Finder

Lucene searches utilizing the IndexSearcher class. The IndexSearcher class requires us to specify a sound Question object. A person question string may be transformed to a sound Question object utilizing the QueryParser class.

By specifying the utmost variety of search outcomes (often known as search outcomes) we would like for the question, the Lucene search engine will return a TopDocs object containing the highest outcomes for the question. Every topDoc comprises a rating for every of the retrieved doc IDs.

searcher = new IndexSearcher(listing);
parser = new QueryParser("question", new StandardAnalyzer());
Question question = parser.parse(searchString)
searcher.search(question, numHits)

Kinds of search queries supported by Lucene

Lucene helps a number of several types of queries. Let us take a look at the 5 most used queries together with examples.

Phrases session

A time period question matches paperwork that comprise a specific time period.

Question question = new TermQuery(new Time period("model", "jelly"));

boolean question

Boolean queries match paperwork which might be legitimate for a Boolean mixture of different queries.

BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(new TermQuery(new Time period("class", "Laptop Equipment")), BooleanClause.Happen.SHOULD);
builder.add(new TermQuery(new Time period("model", "Jelly")), BooleanClause.Happen.SHOULD);
Question question = builder.construct();

Vary question

Vary queries match paperwork that comprise area values inside a variety. The next instance finds merchandise whose worth is between 30 and 50.

Question question = NumericRangeQuery.newIntRange("worth", 30, 50, true, true);

Phrase question

A phrase question searches for paperwork that comprise a specific sequence of phrases.

Question question = new PhraseQuery("title", "Noise", "Cancelling");

Operate question

Calculates scores for paperwork primarily based on the worth of a area. The perform question can be utilized to extend the rating of outcomes primarily based on a area within the doc.

Question question = new FunctionQuery(new FloatFieldSource("worth"));

Constructing a easy search app with Lucene

Thus far, we’ve got discovered in regards to the fundamentals of Lucene, indexing, looking out, and the several types of queries you should utilize.

Let’s now put all these bits collectively in a sensible instance the place we construct a easy search utility utilizing the principle components of Lucene: Indexer and Searcher.

Within the following instance, we index 3 paperwork the place every doc comprises the next fields.

The title is added as a textual content area and the e-mail is added as a string area. Lucene doesn’t tokenize string fields.

import org.apache.lucene.evaluation.Analyzer;
import org.apache.lucene.doc.Doc;
import org.apache.lucene.doc.Area;
import org.apache.lucene.doc.StringField;
import org.apache.lucene.doc.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.retailer.Listing;

import java.io.IOException;

public class MyIndexer {
    personal Listing indexDirectory;
    personal static closing String NAME = "title";
    personal static closing String EMAIL = "electronic mail";
    personal Analyzer analyzer;

    public MyIndexer(Listing listing, Analyzer analyzer) {
        this.indexDirectory = listing;
        this.analyzer = analyzer;
    }

    public void indexDocuments() throws IOException {
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        IndexWriter indexWriter = new IndexWriter(indexDirectory, indexWriterConfig);
        indexNewDocument(indexWriter, "john", "(electronic mail protected)");
        indexNewDocument(indexWriter, "jane", "(electronic mail protected)");
        indexNewDocument(indexWriter, "ana", "(electronic mail protected)");
        indexWriter.shut();
    }

    public void indexNewDocument(IndexWriter indexWriter, String title, String electronic mail) throws IOException {
        Doc doc = new Doc();
        doc.add(new TextField(NAME, title, Area.Retailer.YES));
        doc.add(new StringField(EMAIL, electronic mail, Area.Retailer.YES));
        indexWriter.addDocument(doc);
    }
}

As soon as the paperwork are listed, we will question them utilizing Lucene queries. Within the following instance, we use a easy TermQuery to search out and print paperwork that match the time period. “jane.”

import org.apache.lucene.evaluation.Analyzer;
import org.apache.lucene.evaluation.commonplace.StandardAnalyzer;
import org.apache.lucene.doc.Doc;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Time period;
import org.apache.lucene.search.*;
import org.apache.lucene.retailer.Listing;
import org.apache.lucene.retailer.FSDirectory;

import java.io.IOException;
import java.nio.file.Paths;

public class SimpleSearchApplication {
    public static void primary(String() args) throws IOException {
        String INDEX_DIRECTORY = "listing";
        Listing indexDirectory = FSDirectory.open(Paths.get(INDEX_DIRECTORY));
        Analyzer analyzer = new StandardAnalyzer();
        MyIndexer indexer = new MyIndexer(indexDirectory, analyzer);
        indexer.indexDocuments();

        // Search on the listed paperwork
        IndexReader indexReader = DirectoryReader.open(indexDirectory);
        IndexSearcher searcher = new IndexSearcher(indexReader);

        // Assemble a Time period question to seek for the title "jane"
        Question question = new TermQuery(new Time period("title", "jane"));
        int maxHits = 10;

        TopDocs searchResults = searcher.search(question, maxHits);

        System.out.println("Paperwork with title 'jane':");
        for (ScoreDoc scoreDoc : searchResults.scoreDocs) {
            Doc doc = searcher.doc(scoreDoc.doc);
            System.out.println("title: " + doc.get("title") + ", electronic mail: " + doc.get("electronic mail"));
        }
        indexReader.shut();
    }
}

The above code returns the next outcome:

Paperwork with title 'jane':
title: jane, electronic mail: (electronic mail protected)

Conclusion

Apache Lucene is a strong search library that permits the event of high-performance search functions. With the introduction of Lucene 9.9, vital enhancements to question analysis, vector search, and different options have improved its capabilities. All through this information, we cowl the basic elements of Lucene, how indexers and search engines like google work, and how you can create a easy search utility in Java. Moreover, we discover the several types of search queries supported by Lucene. Armed with this data, it is best to now really feel assured in your understanding of Lucene and be ready to create extra superior search functions utilizing its highly effective options.

Key takeaways

Apache Lucene is a robust Java library that may carry out super-fast full-text searches.
Lucene helps a number of varieties of queries that go well with totally different search use circumstances.
Lucene types the spine of a number of high-performance search functions like Elasticsearch, Solr, Nrtsearch, and so forth.
Lucene IndexWriter and IndexSearcher are vital courses that allow quick indexing and looking out.

Ceaselessly requested questions

q1. Does Lucene assist Python?

A. Sure, Apache Lucene has a PyLucene undertaking that helps Python search functions.

q2. What are the totally different open supply search engines like google obtainable?

A. Some open supply search engines like google embody Solr, Open Search, Meilisearch, Swirl, and so forth.

q3. Does Lucene assist semantic and vector search?

A. Sure, it does. Nevertheless, the utmost dimensions for vector fields are restricted to 1024, a quantity that’s anticipated to extend sooner or later.

q4. What are the totally different relevance scoring algorithms?

A. A few of them embody Time period Frequency Inverse Doc Frequency (TF-IDF), Greatest Match 25 (BM25), Latent Semantic Evaluation (LSA), Vector Area Fashions (VSM), and so forth.

q5. What are some examples of complicated queries supported by Lucene?

A. Some examples of complicated queries embody fuzzy queries, extension queries, multi-phrase queries, common expression queries, and so forth.

The media proven on this article will not be the property of Analytics Vidhya and is used on the writer’s discretion.