DMelt:Text/2 Searching in text using Lucene

From HandWiki
Member

Searching_in_text_using_Lucene

DMelt include the Apache Lucene library ([lucene.apache.org]) that can be used for Java-based indexing and search technology, as well as spell checking, hit highlighting and advanced analysis/tokenization capabilities. DMelt includes the version 2.3.2. This is older, flagship, version version of lucene. The main advantage of this version is that it is simple, and there is robust C++ support via the project CLucente ([clucene.sourceforge.net]), a high-performance, scalable, C++ analog of Java Lucene. CLucene is a port of the very Java Lucene 2.3.2, and many parts of DMelt web page are powered by the C++ version of Lucene.

As usual in DMelt, one can recast all Java statements in Java scripting languages.

Let us consider a simple example. Let us assume we have a list with text, where each sentence is one entry to the list. Let us make a search egine in Python that create index file (in the memory), and then we will use it to search for a given word, printing the score of our results. The score close to 1 means largest likelihood for finding the given word.

In this example we will use org.apache.lucene.index.IndexWriter org.apache.lucene.index.IndexWriter to write the index file in memory, and then we will apply org.apache.lucene.search.IndexSearcher org.apache.lucene.search.IndexSearcher to find a given word. Your simple search engine will look as:


from org.apache.lucene.document import Field, Document
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriter
from org.apache.lucene.store import RAMDirectory
from org.apache.lucene.search import IndexSearcher,Query
from org.apache.lucene.queryParser import QueryParser


# create index file in memory, optimize, and perform searches
text=["Use dmelt data analysis for any statistical data analysis. DMelt is good for this",
     "Dmelt is event more. You can analyze text, images and use in natural sciences", 
     "Any activity can be possible. Any combination, any task. Try it to see this",
     "Let those who are in favour with their stars, Of public honour and proud titles boast",
     "Whilst I, whom fortune of such triumph bars, Unlook for joy in that I honour most"]

def addText(doc,txt):
  doc.add(Field("content",txt, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS))
  writer.addDocument(doc)
  writer.optimize()


print "Create index in memory using input string"
xdir = RAMDirectory()
writer = IndexWriter(xdir,StandardAnalyzer(),True)
doc=Document()
for t in text: addText(doc,t) # add text page by page 
writer.close()


print 'Find the pattern inside text..'
searcher = IndexSearcher(xdir)
parser = QueryParser("content", StandardAnalyzer())
query=parser.parse("dmelt") # search dmelt 
hits = searcher.search(query)
print "Searching for: ",query.toString("content")
print "Number of found hits=",hits.length()," Look at score for each page:" 
for i in range(hits.length()):
    print "doc=",hits.id(i)," score=",hits.score(i)