C# (CSharp) Lucene.Net.Analysis.CJK Namespace

Classes

Name Description
CJKAnalyzer Filters CJKTokenizer with StopFilter. Che, Dong
CJKAnalyzer.DefaultSetHolder
CJKAnalyzer.SavedStreams
CJKTokenizer

CJKTokenizer was modified from StopTokenizer which does a decent job for most European languages. and it perferm other token method for double-byte chars: the token will return at each two charactors with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it also need filter filter zero length token ""
for Digit: digit, '+', '#' will token as letter
for more info on Asia language(Chinese Japanese Korean) text segmentation: please search google

@author Che, Dong @version $Id: CJKTokenizer.java,v 1.3 2003/01/22 20:54:47 otis Exp $