C# (CSharp) Lucene.Net.Analysis.Cn Namespace

Classes

Name Description
ChineseAnalyzer An Analyzer that tokenizes text with ChineseTokenizer and filters with ChineseFilter
ChineseFilter A {@link TokenFilter} with a stop word table.
  • Numeric tokens are removed.
  • English tokens must be larger than 1 char.
  • One Chinese char as one Chinese word.
TO DO:
  1. Add Chinese stop words, such as \ue400
  2. Dictionary based Chinese word extraction
  3. Intelligent Chinese word extraction
ChineseTokenizer Tokenize Chinese text as individual chinese chars.

The difference between ChineseTokenizer and CJKTokenizer is that they have different token parsing logic.

For example, if the Chinese text "C1C2C3C4" is to be indexed:

  • The tokens returned from ChineseTokenizer are C1, C2, C3, C4
  • The tokens returned from the CJKTokenizer are C1C2, C2C3, C3C4.

Therefore the index created by CJKTokenizer is much larger.

The problem is that when searching for C1, C1C2, C1C3, C4C2, C1C2C3 ... the ChineseTokenizer works, but the CJKTokenizer will not work.

ChineseTokenizerFactory
TestChineseFilterFactory Simple tests to ensure the Chinese filter factory is working.
TestChineseTokenizer
TestChineseTokenizer.JustChineseFilterAnalyzer
TestChineseTokenizer.JustChineseTokenizerAnalyzer
TestChineseTokenizerFactory Simple tests to ensure the Chinese tokenizer factory is working.