C# Class Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizerImpl

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

  • <ALPHANUM>: A sequence of alphabetic and numeric characters
  • <NUM>: A number
  • <URL>: A URL
  • <EMAIL>: An email address
  • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
  • <IDEOGRAPHIC>: A single CJKV ideographic character
  • <HIRAGANA>: A single hiragana character
  • <KATAKANA>: A sequence of katakana characters
  • <HANGUL>: A sequence of Hangul characters
Inheritance: IStandardTokenizerInterface
Exibir arquivo Open project: apache/lucenenet

Public Properties

Property Type Description
EMAIL_TYPE int
HANGUL_TYPE int
HIRAGANA_TYPE int
IDEOGRAPHIC_TYPE int
KATAKANA_TYPE int
NUMERIC_TYPE int
SOUTH_EAST_ASIAN_TYPE int
URL_TYPE int
WORD_TYPE int
YYEOF int

Public Methods

Method Description
GetNextToken ( ) : int
GetText ( ICharTermAttribute t ) : void
UAX29URLEmailTokenizerImpl ( TextReader @in ) : Lucene.Net.Analysis.Tokenattributes
YyBegin ( int newState ) : void
YyCharAt ( int pos ) : char
YyClose ( ) : void
YyPushBack ( int number ) : void
YyReset ( TextReader reader ) : void

Private Methods

Method Description
ZzRefill ( ) : bool
ZzScanError ( int errorCode ) : void
ZzUnpackAction ( string packed, int offset, int result ) : int
ZzUnpackAction ( ) : int[]
ZzUnpackAttribute ( string packed, int offset, int result ) : int
ZzUnpackAttribute ( ) : int[]
ZzUnpackCMap ( string packed ) : char[]
ZzUnpackRowMap ( string packed, int offset, int result ) : int
ZzUnpackRowMap ( ) : int[]
ZzUnpackTrans ( string packed, int offset, int result ) : int
ZzUnpackTrans ( ) : int[]

Method Details

GetNextToken() public method

public GetNextToken ( ) : int
return int

GetText() public method

public GetText ( ICharTermAttribute t ) : void
t ICharTermAttribute
return void

UAX29URLEmailTokenizerImpl() public method

public UAX29URLEmailTokenizerImpl ( TextReader @in ) : Lucene.Net.Analysis.Tokenattributes
@in System.IO.TextReader
return Lucene.Net.Analysis.Tokenattributes

YyBegin() public method

public YyBegin ( int newState ) : void
newState int
return void

YyCharAt() public method

public YyCharAt ( int pos ) : char
pos int
return char

YyClose() public method

public YyClose ( ) : void
return void

YyPushBack() public method

public YyPushBack ( int number ) : void
number int
return void

YyReset() public method

public YyReset ( TextReader reader ) : void
reader System.IO.TextReader
return void

Property Details

EMAIL_TYPE public_oe static_oe property

public static int EMAIL_TYPE
return int

HANGUL_TYPE public_oe static_oe property

public static int HANGUL_TYPE
return int

HIRAGANA_TYPE public_oe static_oe property

public static int HIRAGANA_TYPE
return int

IDEOGRAPHIC_TYPE public_oe static_oe property

public static int IDEOGRAPHIC_TYPE
return int

KATAKANA_TYPE public_oe static_oe property

public static int KATAKANA_TYPE
return int

NUMERIC_TYPE public_oe static_oe property

public static int NUMERIC_TYPE
return int

SOUTH_EAST_ASIAN_TYPE public_oe static_oe property

public static int SOUTH_EAST_ASIAN_TYPE
return int

URL_TYPE public_oe static_oe property

public static int URL_TYPE
return int

WORD_TYPE public_oe static_oe property

public static int WORD_TYPE
return int

YYEOF public_oe static_oe property

public static int YYEOF
return int