C# Class Lucene.Net.Analysis.Core.LetterTokenizer

A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate.

Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

You must specify the required LuceneVersion compatibility when creating LetterTokenizer:

Show file Open project: apache/lucenenet Class Usage Examples

Public Methods

Method Description
LetterTokenizer ( LuceneVersion matchVersion, Lucene.Net.Util.AttributeSource factory, TextReader @in ) : System.IO

Construct a new LetterTokenizer using a given org.apache.lucene.util.AttributeSource.AttributeFactory.

LetterTokenizer ( LuceneVersion matchVersion, TextReader @in ) : System.IO

Construct a new LetterTokenizer.

Protected Methods

Method Description
IsTokenChar ( int c ) : bool

Collects only characters which satisfy Character#isLetter(int).

Method Details

IsTokenChar() protected method

Collects only characters which satisfy Character#isLetter(int).
protected IsTokenChar ( int c ) : bool
c int
return bool

LetterTokenizer() public method

Construct a new LetterTokenizer using a given org.apache.lucene.util.AttributeSource.AttributeFactory.
public LetterTokenizer ( LuceneVersion matchVersion, Lucene.Net.Util.AttributeSource factory, TextReader @in ) : System.IO
matchVersion LuceneVersion /// Lucene version to match See above"/>
factory Lucene.Net.Util.AttributeSource /// the attribute factory to use for this
@in System.IO.TextReader
return System.IO

LetterTokenizer() public method

Construct a new LetterTokenizer.
public LetterTokenizer ( LuceneVersion matchVersion, TextReader @in ) : System.IO
matchVersion LuceneVersion /// Lucene version to match See above"/>
@in System.IO.TextReader
return System.IO