Method |
Description |
|
GetAnyTokenStream ( IndexReader reader, int docId, String field, Analyzer analyzer ) : TokenStream |
A convenience method that tries a number of approaches to getting a token stream. The cost of finding there are no termVectors in the index is minimal (1000 invocations still registers 0 ms). So this "lazy" (flexible?) approach to coding is probably acceptable |
|
GetAnyTokenStream ( IndexReader reader, int docId, String field, Document doc, Analyzer analyzer ) : TokenStream |
A convenience method that tries to first get a TermPositionVector for the specified docId, then, falls back to using the passed in {@link org.apache.lucene.document.Document} to retrieve the TokenStream. This is useful when you already have the document, but would prefer to use the vector first. |
|
GetTokenStream ( Document doc, String field, Analyzer analyzer ) : TokenStream |
|
|
GetTokenStream ( IndexReader reader, int docId, String field, Analyzer analyzer ) : TokenStream |
|
|
GetTokenStream ( IndexReader reader, int docId, System field ) : TokenStream |
|
|
GetTokenStream ( String field, String contents, Analyzer analyzer ) : TokenStream |
|
|
GetTokenStream ( TermPositionVector tpv ) : TokenStream |
|
|
GetTokenStream ( TermPositionVector tpv, bool tokenPositionsGuaranteedContiguous ) : TokenStream |
Low level api. Returns a token stream or null if no offset info available in index. This can be used to feed the highlighter with a pre-parsed token stream In my tests the speeds to recreate 1000 token streams using this method are: - with TermVector offset only data stored - 420 milliseconds - with TermVector offset AND position data stored - 271 milliseconds (nb timings for TermVector with position data are based on a tokenizer with contiguous positions - no overlaps or gaps) The cost of not using TermPositionVector to store pre-parsed content and using an analyzer to re-parse the original content: - reanalyzing the original content - 980 milliseconds The re-analyze timings will typically vary depending on - 1) The complexity of the analyzer code (timings above were using a stemmer/lowercaser/stopword combo) 2) The number of other fields (Lucene reads ALL fields off the disk when accessing just one document field - can cost dear!) 3) Use of compression on field storage - could be faster due to compression (less disk IO) or slower (more CPU burn) depending on the content. |
|