C# Class Lucene.Net.Analysis.Standard.ClassicTokenizer

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, StandardTokenizer implements Unicode text segmentation, as specified by UAX#29.

Inheritance: Tokenizer
Show file Open project: apache/lucenenet Class Usage Examples

Public Properties

Property Type Description
TOKEN_TYPES string[]

Public Methods

Method Description
ClassicTokenizer ( LuceneVersion matchVersion, AttributeFactory factory, System.IO.TextReader input ) : Lucene.Net.Analysis.Tokenattributes

Creates a new ClassicTokenizer with a given org.apache.lucene.util.AttributeSource.AttributeFactory

ClassicTokenizer ( LuceneVersion matchVersion, System.IO.TextReader input ) : Lucene.Net.Analysis.Tokenattributes

Creates a new instance of the ClassicTokenizer. Attaches the input to the newly created JFlex scanner.

Dispose ( ) : void
End ( ) : void
IncrementToken ( ) : bool
Reset ( ) : void

Private Methods

Method Description
Init ( LuceneVersion matchVersion ) : void

Method Details

ClassicTokenizer() public method

Creates a new ClassicTokenizer with a given org.apache.lucene.util.AttributeSource.AttributeFactory
public ClassicTokenizer ( LuceneVersion matchVersion, AttributeFactory factory, System.IO.TextReader input ) : Lucene.Net.Analysis.Tokenattributes
matchVersion LuceneVersion
factory AttributeFactory
input System.IO.TextReader
return Lucene.Net.Analysis.Tokenattributes

ClassicTokenizer() public method

Creates a new instance of the ClassicTokenizer. Attaches the input to the newly created JFlex scanner.
public ClassicTokenizer ( LuceneVersion matchVersion, System.IO.TextReader input ) : Lucene.Net.Analysis.Tokenattributes
matchVersion LuceneVersion
input System.IO.TextReader The input reader /// /// See http://issues.apache.org/jira/browse/LUCENE-1068
return Lucene.Net.Analysis.Tokenattributes

Dispose() public method

public Dispose ( ) : void
return void

End() public method

public End ( ) : void
return void

IncrementToken() public method

public IncrementToken ( ) : bool
return bool

Reset() public method

public Reset ( ) : void
return void

Property Details

TOKEN_TYPES public static property

String token types that correspond to token type int constants
public static string[] TOKEN_TYPES
return string[]