DMelt:Text/2 Searching in text using Lucene

From HandWiki
Jump to: navigation, search
Limitted access. First login to DataMelt if you are a full DataMelt member. Then login to HandWiki as a user.

Searching_in_text_using_Lucene

DMelt include the Apache Lucene library ([lucene.apache.org]) that can be used for Java-based indexing and search technology, as well as spell checking, hit highlighting and advanced analysis/tokenization capabilities. DMelt includes the version 2.3.2. This is older, flagship, version version of lucene. The main advantage of this version is that it is simple, and there is robust C++ support via the project CLucente ([clucene.sourceforge.net]), a high-performance, scalable, C++ analog of Java Lucene. CLucene is a port of the very Java Lucene 2.3.2, and many parts of DMelt web page are powered by the C++ version of Lucene.

As usual in DMelt, one can recast all Java statements in Java scripting languages.

Let us consider a simple example. Let us assume we have a list with text, where each sentence is one entry to the list. Let us make a search egine in Python that create index file (in the memory), and then we will use it to search for a given word, printing the score of our results. The score close to 1 means largest likelihood for finding the given word.

In this example we will use org.apache.lucene.index.IndexWriter org.apache.lucene.index.IndexWriter to write the index file in memory, and then we will apply org.apache.lucene.search.IndexSearcher org.apache.lucene.search.IndexSearcher to find a given word. Your simple search engine will look as: