DMelt:Text/2 Searching in text using Lucene

From HandWiki
Limitted access. First login to DataMelt if you are a full DataMelt member. Then login to HandWiki as a user.


DMelt include the Apache Lucene library ([]) that can be used for Java-based indexing and search technology, as well as spell checking, hit highlighting and advanced analysis/tokenization capabilities. DMelt includes the version 2.3.2. This is older, flagship, version version of lucene. The main advantage of this version is that it is simple, and there is robust C++ support via the project CLucente ([]), a high-performance, scalable, C++ analog of Java Lucene. CLucene is a port of the very Java Lucene 2.3.2, and many parts of DMelt web page are powered by the C++ version of Lucene.

As usual in DMelt, one can recast all Java statements in Java scripting languages.

Let us consider a simple example. Let us assume we have a list with text, where each sentence is one entry to the list. Let us make a search egine in Python that create index file (in the memory), and then we will use it to search for a given word, printing the score of our results. The score close to 1 means largest likelihood for finding the given word.

In this example we will use org.apache.lucene.index.IndexWriter org.apache.lucene.index.IndexWriter to write the index file in memory, and then we will apply to find a given word. Your simple search engine will look as: