C# Class Lucene.Net.Analysis.Shingle.ShingleMatrixFilter

A ShingleMatrixFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.

For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".

Using a shingle filter at index and query time can in some instances be used to replace phrase queries, especially them with 0 slop.

Without a spacer character it can be used to handle composition and decomposition of words such as searching for "multi dimensional" instead of "multidimensional". It is a rather common human problem at query time in several languages, notably the northern Germanic branch.

Shingles are amongst many things also known to solve problems in spell checking, language detection and document clustering.

This filter is backed by a three dimensional column oriented matrix used to create permutations of the second dimension, the rows, and leaves the third, the z-axis, for for multi token synonyms.

In order to use this filter you need to define a way of positioning the input stream tokens in the matrix. This is done using a ShingleMatrixFilter.TokenSettingsCodec. There are three simple implementations for demonstrational purposes, see ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec, ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec and ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec.

Consider this token matrix:

 Token[column][row][z-axis]{ {{hello}, {greetings, and, salutations}}, {{world}, {earth}, {tellus}} }; 
It would produce the following 2-3 gram sized shingles:
 "hello_world" "greetings_and" "greetings_and_salutations" "and_salutations" "and_salutations_world" "salutations_world" "hello_earth" "and_salutations_earth" "salutations_earth" "hello_tellus" "and_salutations_tellus" "salutations_tellus" 

This implementation can be rather heap demanding if (maximum shingle size - minimum shingle size) is a great number and the stream contains many columns, or if each column contains a great number of rows.

The problem is that in order avoid producing duplicates the filter needs to keep track of any shingle already produced and returned to the consumer.

There is a bit of resource management to handle this but it would of course be much better if the filter was written so it never created the same shingle more than once in the first place.

The filter also has basic support for calculating weights for the shingles based on the weights of the tokens from the input stream, output shingle size, etc. See CalculateShingleWeight.

NOTE: This filter might not behave correctly if used with custom Attributes, i.e. Attributes other than the ones located in org.apache.lucene.analysis.tokenattributes.

Inheritance: Lucene.Net.Analysis.TokenStream
Datei anzeigen Open project: synhershko/lucene.net Class Usage Examples

Public Properties

Property Type Description
DefaultSettingsCodec Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec
DefaultSpacerCharacter Char
IgnoringSinglePrefixOrSuffixShingleByDefault bool

Public Methods

Method Description
CalculateShingleWeight ( Lucene.Net.Analysis.Token shingleToken, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : float

Evaluates the new shingle token weight. for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed)) This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.

IncrementToken ( ) : bool
Reset ( ) : void
ShingleMatrixFilter ( Matrix matrix, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System

Creates a shingle filter based on a user defined matrix. The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a bool, set the input stream to null or something, and keep track of where in the matrix we are at.

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize ) : System

Creates a shingle filter using default settings. See ShingleMatrixFilter.DefaultSpacerCharacter, ShingleMatrixFilter.IgnoringSinglePrefixOrSuffixShingleByDefault, and ShingleMatrixFilter.DefaultSettingsCodec

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter ) : System

Creates a shingle filter using default settings. See IgnoringSinglePrefixOrSuffixShingleByDefault, and DefaultSettingsCodec

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle ) : System

Creates a shingle filter using the default TokenSettingsCodec. See DefaultSettingsCodec

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System

Creates a shingle filter with ad hoc parameter settings.

UpdateToken ( Lucene.Net.Analysis.Token token, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : void

Final touch of a shingle token before it is passed on to the consumer from method IncrementToken(). Calculates and sets type, flags, position increment, start/end offsets and weight.

Protected Methods

Method Description
Dispose ( bool disposing ) : void

Private Methods

Method Description
GetNextInputToken ( Lucene.Net.Analysis.Token token ) : Lucene.Net.Analysis.Token
GetNextToken ( Lucene.Net.Analysis.Token token ) : Lucene.Net.Analysis.Token
NextTokensPermutation ( ) : void

Get next permutation of row combinations, creates list of all tokens in the row and an index from each such token to what row they exist in. finally resets the current (next) shingle size and offset.

ProduceNextToken ( Lucene.Net.Analysis.Token reusableToken ) : Lucene.Net.Analysis.Token

This method exists in order to avoid recursive calls to the method as the complexity of a fairly small matrix then easily would require a gigabyte sized stack per thread.

ReadColumn ( ) : bool

Loads one column from the token stream. When the last token is read from the token stream it will column.setLast(true);

Method Details

CalculateShingleWeight() public method

Evaluates the new shingle token weight. for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed)) This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.
public CalculateShingleWeight ( Lucene.Net.Analysis.Token shingleToken, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : float
shingleToken Lucene.Net.Analysis.Token token returned to consumer
shingle List tokens the tokens used to produce the shingle token.
currentPermutationStartOffset int start offset in parameter currentPermutationRows and currentPermutationTokens.
currentPermutationRows List an index to what matrix row a token in parameter currentPermutationTokens exist.
currentPermuationTokens List all tokens in the current row permutation of the matrix. A sub list (parameter offset, parameter shingle.size) equals parameter shingle.
return float

Dispose() protected method

protected Dispose ( bool disposing ) : void
disposing bool
return void

IncrementToken() public final method

public final IncrementToken ( ) : bool
return bool

Reset() public method

public Reset ( ) : void
return void

ShingleMatrixFilter() public method

Creates a shingle filter based on a user defined matrix. The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a bool, set the input stream to null or something, and keep track of where in the matrix we are at.
public ShingleMatrixFilter ( Matrix matrix, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System
matrix Matrix the input based for creating shingles. Does not need to contain any information until ShingleMatrixFilter.IncrementToken() is called the first time.
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle bool if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodec Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec codec used to read input token weight and matrix positioning.
return System

ShingleMatrixFilter() public method

Creates a shingle filter using default settings. See ShingleMatrixFilter.DefaultSpacerCharacter, ShingleMatrixFilter.IgnoringSinglePrefixOrSuffixShingleByDefault, and ShingleMatrixFilter.DefaultSettingsCodec
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
return System

ShingleMatrixFilter() public method

Creates a shingle filter using default settings. See IgnoringSinglePrefixOrSuffixShingleByDefault, and DefaultSettingsCodec
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
return System

ShingleMatrixFilter() public method

Creates a shingle filter using the default TokenSettingsCodec. See DefaultSettingsCodec
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle bool if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
return System

ShingleMatrixFilter() public method

Creates a shingle filter with ad hoc parameter settings.
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle bool if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodec Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec codec used to read input token weight and matrix positioning.
return System

UpdateToken() public method

Final touch of a shingle token before it is passed on to the consumer from method IncrementToken(). Calculates and sets type, flags, position increment, start/end offsets and weight.
public UpdateToken ( Lucene.Net.Analysis.Token token, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : void
token Lucene.Net.Analysis.Token Shingle Token
shingle List Tokens used to produce the shingle token.
currentPermutationStartOffset int Start offset in parameter currentPermutationTokens
currentPermutationRows List index to Matrix.Column.Row from the position of tokens in parameter currentPermutationTokens
currentPermuationTokens List tokens of the current permutation of rows in the matrix.
return void

Property Details

DefaultSettingsCodec public_oe static_oe property

public static TokenSettingsCodec,Lucene.Net.Analysis.Shingle.Codec DefaultSettingsCodec
return Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec

DefaultSpacerCharacter public_oe static_oe property

public static Char DefaultSpacerCharacter
return Char

IgnoringSinglePrefixOrSuffixShingleByDefault public_oe static_oe property

public static bool IgnoringSinglePrefixOrSuffixShingleByDefault
return bool