C# (CSharp) Lucene.Net.Analysis.Standard Namespace

Nested Namespaces

Lucene.Net.Analysis.Standard.Std31
Lucene.Net.Analysis.Standard.Std34
Lucene.Net.Analysis.Standard.Std36
Lucene.Net.Analysis.Standard.Std40

Classes

Name Description
ClassicFilter Normalizes tokens extracted with ClassicTokenizer.
ClassicFilterFactory Factory for ClassicFilter.
 <fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ClassicTokenizerFactory"/> <filter class="solr.ClassicFilterFactory"/> </analyzer> </fieldType>
ClassicTokenizer A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, StandardTokenizer implements Unicode text segmentation, as specified by UAX#29.

ClassicTokenizerFactory Factory for ClassicTokenizer.
 <fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="120"/> </analyzer> </fieldType>
ClassicTokenizerImpl This class implements the classic lucene StandardTokenizer up until 3.0
StandardAnalyzer Filters {@link StandardTokenizer} with {@link StandardFilter}, {@link LowerCaseFilter} and {@link StopFilter}, using a list of English stop words.
StandardAnalyzer.SavedStreams
StandardAnalyzer.TokenStreamComponentsAnonymousInnerClassHelper
StandardFilter Normalizes tokens extracted with {@link StandardTokenizer}.
StandardFilterFactory Factory for StandardFilter.
 <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> </analyzer> </fieldType>
StandardTokenizer A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

StandardTokenizerFactory Factory for StandardTokenizer.
 <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/> </analyzer> </fieldType>
StandardTokenizerImpl This class is a scanner generated by JFlex 1.4.1 on 12/18/07 9:22 PM from the specification file /Volumes/User/grantingersoll/projects/lucene/java/lucene-clean/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
TestStandardFactories Simple tests to ensure the standard lucene factories are working.
TestUAX29URLEmailTokenizerFactory A few tests based on org.apache.lucene.analysis.TestUAX29URLEmailTokenizer
UAX29URLEmailAnalyzer Filters org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

You must specify the required org.apache.lucene.util.Version compatibility when creating UAX29URLEmailAnalyzer

UAX29URLEmailAnalyzer.TokenStreamComponentsAnonymousInnerClassHelper
UAX29URLEmailTokenizerFactory Factory for UAX29URLEmailTokenizer.
 <fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/> </analyzer> </fieldType>
UAX29URLEmailTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

  • <ALPHANUM>: A sequence of alphabetic and numeric characters
  • <NUM>: A number
  • <URL>: A URL
  • <EMAIL>: An email address
  • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
  • <IDEOGRAPHIC>: A single CJKV ideographic character
  • <HIRAGANA>: A single hiragana character
  • <KATAKANA>: A sequence of katakana characters
  • <HANGUL>: A sequence of Hangul characters