C# (CSharp) Lucene.Net.Analysis.Core Namespace

Classes

Name	Description
KeywordTokenizer	Emits the entire input as a single token.
LetterTokenizer	A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces. You must specify the required LuceneVersion compatibility when creating LetterTokenizer: As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See CharTokenizer#isTokenChar(int) and CharTokenizer#normalize(int) for details.
LowerCaseFilter	Normalizes token text to lower case. You must specify the required LuceneVersion compatibility when creating LowerCaseFilter: As of 3.1, supplementary characters are properly lowercased.
LowerCaseFilterFactory	Factory for LowerCaseFilter. <fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
LowerCaseTokenizer	LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of LetterTokenizer and LowerCaseFilter, there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces. You must specify the required Version compatibility when creating LowerCaseTokenizer: As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See CharTokenizer#isTokenChar(int) and CharTokenizer#normalize(int) for details.
PayloadSetter
SimpleAnalyzer	An Analyzer that filters LetterTokenizer with LowerCaseFilter You must specify the required Version compatibility when creating CharTokenizer: As of 3.1, LowerCaseTokenizer uses an int based API to normalize and detect token codepoints. See CharTokenizer#isTokenChar(int) and CharTokenizer#normalize(int) for details.
StopAnalyzer	Filters LetterTokenizer with LowerCaseFilter and StopFilter. You must specify the required LuceneVersion compatibility when creating StopAnalyzer: As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords As of 2.9, position increments are preserved
StopFilterFactory	Factory for StopFilter. <fieldType name="text_stop" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" format="wordset" /> </analyzer> </fieldType> All attributes are optional: `ignoreCase` defaults to `false` `words` should be the name of a stopwords file to parse, if not specified the factory will use StopAnalyzer#ENGLISH_STOP_WORDS_SET `format` defines how the `words` file will be parsed, and defaults to `wordset`. If `words` is not specified, then `format` must not be specified. The valid values for the `format` option are: `wordset` - This is the default format, which supports one word per line (including any intra-word whitespace) and allows whole line comments begining with the "#" character. Blank lines are ignored. See WordlistLoader#getLines WordlistLoader.getLines for details. `snowball` - This format allows for multiple words specified on each line, and trailing comments may be specified using the vertical line ("\|"). Blank lines are ignored. See WordlistLoader#getSnowballWordSet WordlistLoader.getSnowballWordSet for details.
TestAnalyzers
TestAnalyzers.LowerCaseWhitespaceAnalyzer
TestAnalyzers.UpperCaseWhitespaceAnalyzer
TestBugInSomething
TestBugInSomething.AnalyzerAnonymousInnerClassHelper
TestBugInSomething.AnalyzerAnonymousInnerClassHelper100
TestBugInSomething.AnalyzerAnonymousInnerClassHelper2
TestBugInSomething.CharFilterAnonymousInnerClassHelper
TestBugInSomething.SopTokenFilter
TestClassicAnalyzer
TestDuelingAnalyzers	Compares MockTokenizer (which is simple with no optimizations) with equivalent core tokenizers (that have optimizations like buffering). Any tests here need to probably consider unicode version of the JRE (it could cause false fails).
TestDuelingAnalyzers.AnalyzerAnonymousInnerClassHelper
TestDuelingAnalyzers.AnalyzerAnonymousInnerClassHelper2
TestDuelingAnalyzers.AnalyzerAnonymousInnerClassHelper3
TestDuelingAnalyzers.AnalyzerAnonymousInnerClassHelper4
TestDuelingAnalyzers.AnalyzerAnonymousInnerClassHelper5
TestDuelingAnalyzers.AnalyzerAnonymousInnerClassHelper6
TestKeywordAnalyzer
TestStandardAnalyzer
TestStandardAnalyzer.AnalyzerAnonymousInnerClassHelper
TestStandardAnalyzer.AnalyzerAnonymousInnerClassHelper2
TestStandardAnalyzer.AnalyzerAnonymousInnerClassHelper3
TestStandardAnalyzer.AnalyzerAnonymousInnerClassHelper4
TestStopAnalyzer
TestStopFilter	Copyright 2005 The Apache Software Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
TestStopFilter.AnalyzerAnonymousInnerClassHelper
TestStopFilter.MockSynonymFilter
TestStopFilterFactory
TestTypeTokenFilter
TestTypeTokenFilterFactory	Testcase for TypeTokenFilterFactory
TestUAX29URLEmailAnalyzer
TestUAX29URLEmailTokenizer
TestUAX29URLEmailTokenizer.AnalyzerAnonymousInnerClassHelper
TestUAX29URLEmailTokenizer.AnalyzerAnonymousInnerClassHelper2
TestUAX29URLEmailTokenizer.AnalyzerAnonymousInnerClassHelper3
TestUAX29URLEmailTokenizer.AnalyzerAnonymousInnerClassHelper4
TestUAX29URLEmailTokenizer.AnalyzerAnonymousInnerClassHelper5
TestUAX29URLEmailTokenizer.AnalyzerAnonymousInnerClassHelper6
TestUAX29URLEmailTokenizer.EmailFilter	Passes through tokens with type "" and blocks all other types.
TestUAX29URLEmailTokenizer.URLFilter	Passes through tokens with type "" and blocks all other types.
TestUAX29URLEmailTokenizer.UrlAnalyzerAnonymousInnerClassHelper
TypeTokenFilterFactory	Factory class for TypeTokenFilter. <fieldType name="chars" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt" useWhitelist="false"/> </analyzer> </fieldType>
UpperCaseFilter	Normalizes token text to UPPER CASE. You must specify the required LuceneVersion compatibility when creating UpperCaseFilter NOTE: In Unicode, this transformation may lose information when the upper case character represents more than one lower case character. Use this filter when you Require uppercase tokens. Use the LowerCaseFilter for general search matching
UpperCaseFilterFactory	Factory for UpperCaseFilter. <fieldType name="text_uppercase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.UpperCaseFilterFactory"/> </analyzer> </fieldType> NOTE: In Unicode, this transformation may lose information when the upper case character represents more than one lower case character. Use this filter when you require uppercase tokens. Use the LowerCaseFilterFactory for general search matching
WhitespaceAnalyzer	An Analyzer that uses WhitespaceTokenizer. You must specify the required Version compatibility when creating CharTokenizer: As of 3.1, WhitespaceTokenizer uses an int based API to normalize and detect token codepoints. See CharTokenizer#isTokenChar(int) and CharTokenizer#normalize(int) for details.
WhitespaceTokenizer	A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens. You must specify the required LuceneVersion compatibility when creating WhitespaceTokenizer: As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See CharTokenizer#isTokenChar(int) and CharTokenizer#normalize(int) for details.
WordBreakTestUnicode_6_3_0	This class was automatically generated by generateJavaUnicodeWordBreakTest.pl from: http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt WordBreakTest.txt indicates the points in the provided character sequences at which conforming implementations must and must not break words. This class tests for expected token extraction from each of the test sequences in WordBreakTest.txt, where the expected tokens are those character sequences bounded by word breaks and containing at least one character from one of the following character sets: \p{Script = Han} (From http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt) \p{Script = Hiragana} \p{LineBreak = Complex_Context} (From http://www.unicode.org/Public/6.3.0/ucd/LineBreak.txt) \p{WordBreak = ALetter} (From http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakProperty.txt) \p{WordBreak = Hebrew_Letter} \p{WordBreak = Katakana} \p{WordBreak = Numeric} (Excludes full-width Arabic digits) [\uFF10-\uFF19] (Full-width Arabic digits)