C# (CSharp) org.apache.lucene.analysis.miscellaneous Namespace

Classes

Name Description
CodepointCountFilter Removes words that are too long or too short from the stream.

Note: Length is calculated as the number of Unicode codepoints.

CodepointCountFilterFactory Factory for CodepointCountFilter.
 <fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.CodepointCountFilterFactory" min="0" max="1" /> </analyzer> </fieldType>
EmptyTokenStream An always exhausted token stream.
HyphenatedWordsFilter When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml:
 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true"/> <filter class="solr.HyphenatedWordsFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldtype> 
HyphenatedWordsFilterFactory Factory for HyphenatedWordsFilter.
 <fieldType name="text_hyphn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.HyphenatedWordsFilterFactory"/> </analyzer> </fieldType>
KeywordMarkerFilter Marks terms as keywords via the KeywordAttribute.
KeywordRepeatFilter This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute#setKeyword(boolean) set to true and once set to false. This is useful if used with a stem filter that respects the KeywordAttribute to index the stemmed and the un-stemmed version of a term into the same field.
LengthFilter Removes words that are too long or too short from the stream.

Note: Length is calculated as the number of UTF-16 code units.

LengthFilterFactory Factory for LengthFilter.
 <fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LengthFilterFactory" min="0" max="1" /> </analyzer> </fieldType>
LimitTokenCountAnalyzer This Analyzer limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.apache.lucene.index.IndexWriter.
LimitTokenCountFilter This TokenFilter limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.apache.lucene.index.IndexWriter.

By default, this filter ignores any tokens in the wrapped {@code TokenStream} once the limit has been reached, which can result in {@code reset()} being called prior to {@code incrementToken()} returning {@code false}. For most {@code TokenStream} implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping a {@code TokenStream} which requires that the full stream of tokens be exhausted in order to function properly, use the #LimitTokenCountFilter(TokenStream,int,boolean) consumeAllTokens option.

LimitTokenCountFilterFactory Factory for LimitTokenCountFilter.
 <fieldType name="text_lngthcnt" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10" consumeAllTokens="false" /> </analyzer> </fieldType>

The {@code consumeAllTokens} property is optional and defaults to {@code false}. See LimitTokenCountFilter for an explanation of it's use.

LimitTokenPositionFilter This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.

By default, this filter ignores any tokens in the wrapped {@code TokenStream} once the limit has been exceeded, which can result in {@code reset()} being called prior to {@code incrementToken()} returning {@code false}. For most {@code TokenStream} implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping a {@code TokenStream} which requires that the full stream of tokens be exhausted in order to function properly, use the #LimitTokenPositionFilter(TokenStream,int,boolean) consumeAllTokens option.

LimitTokenPositionFilterFactory Factory for LimitTokenPositionFilter.
 <fieldType name="text_limit_pos" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="3" consumeAllTokens="false" /> </analyzer> </fieldType>

The {@code consumeAllTokens} property is optional and defaults to {@code false}. See LimitTokenPositionFilter for an explanation of its use.

Lucene47WordDelimiterFilter
Lucene47WordDelimiterFilter.WordDelimiterConcatenation A WDF concatenated 'run'
PatternAnalyzer
PatternAnalyzer.FastStringReader A StringReader that exposes it's contained string for fast direct access. Might make sense to generalize this to CharSequence and make it public?
PatternAnalyzer.FastStringTokenizer Special-case class for best performance in common cases; this class is otherwise unnecessary.
PatternAnalyzer.PatternTokenizer The work horse; performance isn't fantastic, but it's not nearly as bad as one might think - kudos to the Sun regex developers.
PatternKeywordMarkerFilter Marks terms as keywords via the KeywordAttribute. Each token that matches the provided pattern is marked as a keyword by setting KeywordAttribute#setKeyword(boolean) to true.
PerFieldAnalyzerWrapper This analyzer is used to facilitate scenarios where different fields require different analysis techniques. Use the Map argument in #PerFieldAnalyzerWrapper(Analyzer, java.util.Map) to add non-default analyzers for fields.

Example usage:

 {@code Map analyzerPerField = new HashMap<>(); analyzerPerField.put("firstname", new KeywordAnalyzer()); analyzerPerField.put("lastname", new KeywordAnalyzer()); PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer(version), analyzerPerField); } 

In this example, StandardAnalyzer will be used for all fields except "firstname" and "lastname", for which KeywordAnalyzer will be used.

A PerFieldAnalyzerWrapper can be used like any other analyzer, for both indexing and query parsing.

PrefixAndSuffixAwareTokenFilter Links two PrefixAwareTokenFilter.

NOTE: This filter might not behave correctly if used with custom Attributes, i.e. Attributes other than the ones located in org.apache.lucene.analysis.tokenattributes.

PrefixAndSuffixAwareTokenFilter.PrefixAwareTokenFilterAnonymousInnerClassHelper
PrefixAndSuffixAwareTokenFilter.PrefixAwareTokenFilterAnonymousInnerClassHelper2
PrefixAwareTokenFilter Joins two token streams and leaves the last token of the first stream available to be used when updating the token values in the second stream based on that token. The default implementation adds last prefix token end offset to the suffix token start and end offsets.

NOTE: This filter might not behave correctly if used with custom Attributes, i.e. Attributes other than the ones located in org.apache.lucene.analysis.tokenattributes.

TrimFilter Trims leading and trailing whitespace from Tokens in the stream.

As of Lucene 4.4, this filter does not support updateOffsets=true anymore as it can lead to broken token streams.

TrimFilterFactory Factory for TrimFilter.
 <fieldType name="text_trm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.NGramTokenizerFactory"/> <filter class="solr.TrimFilterFactory" /> </analyzer> </fieldType>
WordDelimiterFilter Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:
  • split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi""Wi", "Fi"
  • split on case transitions: "PowerShot""Power", "Shot"
  • split on letter-number transitions: "SD500""SD", "500"
  • leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'""hello", "there", "dude"
  • trailing "'s" are removed for each subword: "O'Neil's""O", "Neil"
    • Note: this step isn't performed in a separate filter because of possible subword combinations.
The combinations parameter affects how subwords are combined:
  • combinations="0" causes no subword combinations: "PowerShot"0:"Power", 1:"Shot" (0 and 1 are the token positions)
  • combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:
    • "PowerShot"0:"Power", 1:"Shot" 1:"PowerShot"
    • "A's+B's&C's" -gt; 0:"A", 1:"B", 2:"C", 2:"ABC"
    • "Super-Duper-XL500-42-AutoCoder!"0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).
WordDelimiterFilter.OffsetSorter
WordDelimiterFilter.WordDelimiterConcatenation A WDF concatenated 'run'
WordDelimiterIterator A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterFilter rules. @lucene.internal