C# Класс Lucene.Net.Analysis.Shingle.ShingleMatrixFilter

A ShingleMatrixFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.

For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".

Using a shingle filter at index and query time can in some instances be used to replace phrase queries, especially them with 0 slop.

Without a spacer character it can be used to handle composition and decomposition of words such as searching for "multi dimensional" instead of "multidimensional". It is a rather common human problem at query time in several languages, notably the northern Germanic branch.

Shingles are amongst many things also known to solve problems in spell checking, language detection and document clustering.

This filter is backed by a three dimensional column oriented matrix used to create permutations of the second dimension, the rows, and leaves the third, the z-axis, for for multi token synonyms.

In order to use this filter you need to define a way of positioning the input stream tokens in the matrix. This is done using a ShingleMatrixFilter.TokenSettingsCodec. There are three simple implementations for demonstrational purposes, see ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec, ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec and ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec.

Consider this token matrix:

 Token[column][row][z-axis]{ {{hello}, {greetings, and, salutations}}, {{world}, {earth}, {tellus}} }; 
It would produce the following 2-3 gram sized shingles:
 "hello_world" "greetings_and" "greetings_and_salutations" "and_salutations" "and_salutations_world" "salutations_world" "hello_earth" "and_salutations_earth" "salutations_earth" "hello_tellus" "and_salutations_tellus" "salutations_tellus" 

This implementation can be rather heap demanding if (maximum shingle size - minimum shingle size) is a great number and the stream contains many columns, or if each column contains a great number of rows.

The problem is that in order avoid producing duplicates the filter needs to keep track of any shingle already produced and returned to the consumer.

There is a bit of resource management to handle this but it would of course be much better if the filter was written so it never created the same shingle more than once in the first place.

The filter also has basic support for calculating weights for the shingles based on the weights of the tokens from the input stream, output shingle size, etc. See CalculateShingleWeight.

NOTE: This filter might not behave correctly if used with custom Attributes, i.e. Attributes other than the ones located in org.apache.lucene.analysis.tokenattributes.

Наследование: Lucene.Net.Analysis.TokenStream
Показать файл Открыть проект Примеры использования класса

Открытые свойства

Свойство Тип Описание
DefaultSettingsCodec Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec
DefaultSpacerCharacter Char
IgnoringSinglePrefixOrSuffixShingleByDefault bool

Открытые методы

Метод Описание
CalculateShingleWeight ( Lucene.Net.Analysis.Token shingleToken, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : float

Evaluates the new shingle token weight. for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed)) This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.

IncrementToken ( ) : bool
Reset ( ) : void
ShingleMatrixFilter ( Matrix matrix, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System

Creates a shingle filter based on a user defined matrix. The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a bool, set the input stream to null or something, and keep track of where in the matrix we are at.

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize ) : System

Creates a shingle filter using default settings. See ShingleMatrixFilter.DefaultSpacerCharacter, ShingleMatrixFilter.IgnoringSinglePrefixOrSuffixShingleByDefault, and ShingleMatrixFilter.DefaultSettingsCodec

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter ) : System

Creates a shingle filter using default settings. See IgnoringSinglePrefixOrSuffixShingleByDefault, and DefaultSettingsCodec

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle ) : System

Creates a shingle filter using the default TokenSettingsCodec. See DefaultSettingsCodec

ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System

Creates a shingle filter with ad hoc parameter settings.

UpdateToken ( Lucene.Net.Analysis.Token token, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : void

Final touch of a shingle token before it is passed on to the consumer from method IncrementToken(). Calculates and sets type, flags, position increment, start/end offsets and weight.

Защищенные методы

Метод Описание
Dispose ( bool disposing ) : void

Приватные методы

Метод Описание
GetNextInputToken ( Lucene.Net.Analysis.Token token ) : Lucene.Net.Analysis.Token
GetNextToken ( Lucene.Net.Analysis.Token token ) : Lucene.Net.Analysis.Token
NextTokensPermutation ( ) : void

Get next permutation of row combinations, creates list of all tokens in the row and an index from each such token to what row they exist in. finally resets the current (next) shingle size and offset.

ProduceNextToken ( Lucene.Net.Analysis.Token reusableToken ) : Lucene.Net.Analysis.Token

This method exists in order to avoid recursive calls to the method as the complexity of a fairly small matrix then easily would require a gigabyte sized stack per thread.

ReadColumn ( ) : bool

Loads one column from the token stream. When the last token is read from the token stream it will column.setLast(true);

Описание методов

CalculateShingleWeight() публичный Метод

Evaluates the new shingle token weight. for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed)) This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.
public CalculateShingleWeight ( Lucene.Net.Analysis.Token shingleToken, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : float
shingleToken Lucene.Net.Analysis.Token token returned to consumer
shingle List tokens the tokens used to produce the shingle token.
currentPermutationStartOffset int start offset in parameter currentPermutationRows and currentPermutationTokens.
currentPermutationRows List an index to what matrix row a token in parameter currentPermutationTokens exist.
currentPermuationTokens List all tokens in the current row permutation of the matrix. A sub list (parameter offset, parameter shingle.size) equals parameter shingle.
Результат float

Dispose() защищенный Метод

protected Dispose ( bool disposing ) : void
disposing bool
Результат void

IncrementToken() публичный закрытый Метод

public final IncrementToken ( ) : bool
Результат bool

Reset() публичный Метод

public Reset ( ) : void
Результат void

ShingleMatrixFilter() публичный Метод

Creates a shingle filter based on a user defined matrix. The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a bool, set the input stream to null or something, and keep track of where in the matrix we are at.
public ShingleMatrixFilter ( Matrix matrix, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System
matrix Matrix the input based for creating shingles. Does not need to contain any information until ShingleMatrixFilter.IncrementToken() is called the first time.
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle bool if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodec Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec codec used to read input token weight and matrix positioning.
Результат System

ShingleMatrixFilter() публичный Метод

Creates a shingle filter using default settings. See ShingleMatrixFilter.DefaultSpacerCharacter, ShingleMatrixFilter.IgnoringSinglePrefixOrSuffixShingleByDefault, and ShingleMatrixFilter.DefaultSettingsCodec
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
Результат System

ShingleMatrixFilter() публичный Метод

Creates a shingle filter using default settings. See IgnoringSinglePrefixOrSuffixShingleByDefault, and DefaultSettingsCodec
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
Результат System

ShingleMatrixFilter() публичный Метод

Creates a shingle filter using the default TokenSettingsCodec. See DefaultSettingsCodec
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle bool if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
Результат System

ShingleMatrixFilter() публичный Метод

Creates a shingle filter with ad hoc parameter settings.
public ShingleMatrixFilter ( TokenStream input, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec ) : System
input Lucene.Net.Analysis.TokenStream stream from which to construct the matrix
minimumShingleSize int minimum number of tokens in any shingle.
maximumShingleSize int maximum number of tokens in any shingle.
spacerCharacter Char character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle bool if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodec Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec codec used to read input token weight and matrix positioning.
Результат System

UpdateToken() публичный Метод

Final touch of a shingle token before it is passed on to the consumer from method IncrementToken(). Calculates and sets type, flags, position increment, start/end offsets and weight.
public UpdateToken ( Lucene.Net.Analysis.Token token, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens ) : void
token Lucene.Net.Analysis.Token Shingle Token
shingle List Tokens used to produce the shingle token.
currentPermutationStartOffset int Start offset in parameter currentPermutationTokens
currentPermutationRows List index to Matrix.Column.Row from the position of tokens in parameter currentPermutationTokens
currentPermuationTokens List tokens of the current permutation of rows in the matrix.
Результат void

Описание свойств

DefaultSettingsCodec публичное статическое свойство

public static TokenSettingsCodec,Lucene.Net.Analysis.Shingle.Codec DefaultSettingsCodec
Результат Lucene.Net.Analysis.Shingle.Codec.TokenSettingsCodec

DefaultSpacerCharacter публичное статическое свойство

public static Char DefaultSpacerCharacter
Результат Char

IgnoringSinglePrefixOrSuffixShingleByDefault публичное статическое свойство

public static bool IgnoringSinglePrefixOrSuffixShingleByDefault
Результат bool