C# (CSharp) Lucene.Net.Analysis.Pattern Namespace

Classes

Name Description
PatternCaptureGroupFilterFactory Factory for PatternCaptureGroupTokenFilter.
 <fieldType name="text_ptncapturegroup" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.PatternCaptureGroupFilterFactory" pattern="([^a-z])" preserve_original="true"/> </analyzer> </fieldType>
PatternCaptureGroupTokenFilter CaptureGroup uses Java regexes to emit multiple tokens - one for each capture group in one or more patterns.

For example, a pattern like:

"(https?://([a-zA-Z\-_0-9.]+))"

when matched against the string "http://www.foo.com/index" would return the tokens "https://www.foo.com" and "www.foo.com".

If none of the patterns match, or if preserveOriginal is true, the original token will be preserved.

Each pattern is matched as often as it can be, so the pattern "(...)", when matched against "abcdefghi" would produce ["abc","def","ghi"]

A camelCaseFilter could be written as:

"([A-Z]{2,})",
"(?<![A-Z])([A-Z][a-z]+)",
"(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
"([0-9]+)"

plus if #preserveOriginal is true, it would also return "camelCaseFilter

PatternReplaceCharFilter CharFilter that uses a regular expression for the target of replace string. The pattern match will be done in each "block" in char stream.

ex1) source="aa  bb aa bb", pattern="(aa)\\s+(bb)" replacement="$1#$2"
output="aa#bb aa#bb"

NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble.

ex2) source="aa123bb", pattern="(aa)\\d+(bb)" replacement="$1 $2"
output="aa bb"
and you want to search bb and highlight it, you will get
highlight snippet="aa1<em>23bb</em>"

@since Solr 1.5
PatternReplaceCharFilterFactory Factory for PatternReplaceCharFilter.
 <fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory"  pattern="([^a-z])" replacement=""/> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </fieldType>
@since Solr 3.1
PatternReplaceFilter A TokenFilter which applies a Pattern to each token in the stream, replacing match occurances with the specified replacement string.

Note: Depending on the input and the pattern used and the input TokenStream, this TokenFilter may produce Tokens whose text is the empty string.

PatternReplaceFilterFactory Factory for PatternReplaceFilter.
 <fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all"/> </analyzer> </fieldType>
PatternTokenizer This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

  • "pattern" is the regular expression.
  • "group" says which group to extract into tokens.

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): String#split(java.lang.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

 pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc' 
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Tokenizer does not output tokens that are of zero length.

PatternTokenizerFactory Factory for PatternTokenizer. This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

  • "pattern" is the regular expression.
  • "group" says which group to extract into tokens.

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): String#split(java.lang.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

 pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc' 
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Tokenizer does not output tokens that are of zero length.

 <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/> </analyzer> </fieldType>
TestPatternCaptureGroupTokenFilter
TestPatternCaptureGroupTokenFilter.AnalyzerAnonymousInnerClassHelper
TestPatternReplaceCharFilter Tests PatternReplaceCharFilter
TestPatternReplaceCharFilter.AnalyzerAnonymousInnerClassHelper
TestPatternReplaceFilter
TestPatternReplaceFilter.AnalyzerAnonymousInnerClassHelper
TestPatternReplaceFilter.AnalyzerAnonymousInnerClassHelper2
TestPatternReplaceFilter.AnalyzerAnonymousInnerClassHelper3
TestPatternReplaceFilterFactory Simple tests to ensure this factory is working
TestPatternTokenizer
TestPatternTokenizer.AnalyzerAnonymousInnerClassHelper
TestPatternTokenizer.AnalyzerAnonymousInnerClassHelper2
TestPatternTokenizerFactory Simple Tests to ensure this factory is working