C# (CSharp) Lucene.Net.Analysis.Ngram Namespace

Classes

Name Description

Creates new instances of EdgeNGramTokenFilter.

 <fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/> </analyzer> </fieldType>

EdgeNGramTokenFilter

Tokenizes the given token into n-grams of given size(s).

This TokenFilter create n-grams from the beginning edge or ending edge of a input token.

As of Lucene 4.4, this filter does not support Side#BACK (you can use ReverseStringFilter up-front and afterward to get the same behavior), handles supplementary characters correctly and does not update offsets anymore.

EdgeNGramTokenizer

Tokenizes the input from an edge into n-grams of given size(s).

This Tokenizer create n-grams from the beginning edge or ending edge of a input token.

As of Lucene 4.4, this tokenizer

can handle maxGram larger than 1024 chars, but beware that this will result in increased memory usage

doesn't trim the input,

sets position increments equal to 1 instead of 1 for the first token and 0 for all other ones

doesn't support backward n-grams anymore.

supports #isTokenChar(int) pre-tokenization,

correctly handles supplementary characters.

Although highly discouraged, it is still possible to use the old behavior through Lucene43EdgeNGramTokenizer.

EdgeNGramTokenizerFactory

Creates new instances of EdgeNGramTokenizer.

 <fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="1" maxGramSize="1"/> </analyzer> </fieldType>

EdgeNGramTokenizerTest Tests EdgeNGramTokenizer for correctness.

EdgeNGramTokenizerTest.AnalyzerAnonymousInnerClassHelper

EdgeNGramTokenizerTest.AnalyzerAnonymousInnerClassHelper2

Lucene43EdgeNGramTokenizer

NGramFilterFactory

Factory for NGramTokenFilter.

 <fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/> </analyzer> </fieldType>

NGramTokenFilter

Tokenizes the input into n-grams of the given size(s).

You must specify the required Version compatibility when creating a NGramTokenFilter. As of Lucene 4.4, this token filters:

handles supplementary characters correctly,
emits all n-grams for the same token at the same position,
does not modify offsets,
sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").

You can make this filter use the old behavior by providing a version < Version#LUCENE_44 in the constructor but this is not recommended as it will lead to broken TokenStreams that will cause highlighting bugs.

If you were using this TokenFilter to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to use NGramTokenizer, and potentially override NGramTokenizer#isTokenChar(int) to perform pre-tokenization.

NGramTokenFilter.PositionIncrementAttributeAnonymousInnerClassHelper

NGramTokenFilter.PositionLengthAttributeAnonymousInnerClassHelper

NGramTokenFilterTest Tests NGramTokenFilter for correctness.

NGramTokenFilterTest.AnalyzerAnonymousInnerClassHelper

NGramTokenFilterTest.AnalyzerAnonymousInnerClassHelper2

NGramTokenFilterTest.AnalyzerAnonymousInnerClassHelper3

NGramTokenizer

Tokenizes the input into n-grams of the given size(s).

On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.

For example, "abcde" would be tokenized as (minGram=2, maxGram=3):

Term	ab	abc	bc	bcd	cd	cde	de
Position increment	1	1	1	1	1	1	1
Position length	1	1	1	1	1	1	1
Offsets	[0,2[	[0,3[	[1,3[	[1,4[	[2,4[	[2,5[	[3,5[

This tokenizer changed a lot in Lucene 4.4 in order to:

tokenize in a streaming fashion to support streams which are larger than 1024 chars (limit of the previous version),
count grams based on unicode code points instead of java chars (and never split in the middle of surrogate pairs),
give the ability to #isTokenChar(int) pre-tokenize the stream before computing n-grams.

Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).

Although highly discouraged, it is still possible to use the old behavior through Lucene43NGramTokenizer.

NGramTokenizerTest Tests NGramTokenizer for correctness.

NGramTokenizerTest.AnalyzerAnonymousInnerClassHelper

NGramTokenizerTest.NGramTokenizerAnonymousInnerClassHelper

TestNGramFilters Simple tests to ensure the NGram filter factories are working.