Property | Type | Description | |
---|---|---|---|
DefaultSettingsCodec | |||
DefaultSpacerCharacter | Char | ||
IgnoringSinglePrefixOrSuffixShingleByDefault | bool |
Method | Description | |
---|---|---|
CalculateShingleWeight ( Lucene.Net.Analysis.Token shingleToken, List |
Evaluates the new shingle token weight. for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed)) This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.
|
|
IncrementToken ( ) : bool | ||
Reset ( ) : void | ||
ShingleMatrixFilter ( Matrix matrix, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, |
Creates a shingle filter based on a user defined matrix. The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a bool, set the input stream to null or something, and keep track of where in the matrix we are at.
|
|
ShingleMatrixFilter ( |
Creates a shingle filter using default settings. See ShingleMatrixFilter.DefaultSpacerCharacter, ShingleMatrixFilter.IgnoringSinglePrefixOrSuffixShingleByDefault, and ShingleMatrixFilter.DefaultSettingsCodec
|
|
ShingleMatrixFilter ( |
Creates a shingle filter using default settings. See IgnoringSinglePrefixOrSuffixShingleByDefault, and DefaultSettingsCodec
|
|
ShingleMatrixFilter ( |
Creates a shingle filter using the default TokenSettingsCodec. See DefaultSettingsCodec
|
|
ShingleMatrixFilter ( |
Creates a shingle filter with ad hoc parameter settings.
|
|
UpdateToken ( Lucene.Net.Analysis.Token token, List |
Final touch of a shingle token before it is passed on to the consumer from method IncrementToken(). Calculates and sets type, flags, position increment, start/end offsets and weight.
|
Method | Description | |
---|---|---|
Dispose ( bool disposing ) : void |
Method | Description | |
---|---|---|
GetNextInputToken ( Lucene.Net.Analysis.Token token ) : Lucene.Net.Analysis.Token | ||
GetNextToken ( Lucene.Net.Analysis.Token token ) : Lucene.Net.Analysis.Token | ||
NextTokensPermutation ( ) : void |
Get next permutation of row combinations, creates list of all tokens in the row and an index from each such token to what row they exist in. finally resets the current (next) shingle size and offset.
|
|
ProduceNextToken ( Lucene.Net.Analysis.Token reusableToken ) : Lucene.Net.Analysis.Token |
This method exists in order to avoid recursive calls to the method as the complexity of a fairly small matrix then easily would require a gigabyte sized stack per thread.
|
|
ReadColumn ( ) : bool |
Loads one column from the token stream. When the last token is read from the token stream it will column.setLast(true);
|
public CalculateShingleWeight ( Lucene.Net.Analysis.Token shingleToken, List |
||
shingleToken | Lucene.Net.Analysis.Token | token returned to consumer |
shingle | List |
tokens the tokens used to produce the shingle token. |
currentPermutationStartOffset | int | start offset in parameter currentPermutationRows and currentPermutationTokens. |
currentPermutationRows | List |
an index to what matrix row a token in parameter currentPermutationTokens exist. |
currentPermuationTokens | List |
all tokens in the current row permutation of the matrix. A sub list (parameter offset, parameter shingle.size) equals parameter shingle. |
return | float |
public ShingleMatrixFilter ( Matrix matrix, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, |
||
matrix | Matrix | the input based for creating shingles. Does not need to contain any information until ShingleMatrixFilter.IncrementToken() is called the first time. |
minimumShingleSize | int | minimum number of tokens in any shingle. |
maximumShingleSize | int | maximum number of tokens in any shingle. |
spacerCharacter | Char | character to use between texts of the token parts in a shingle. null for none. |
ignoringSinglePrefixOrSuffixShingle | bool | if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'. |
settingsCodec | codec used to read input token weight and matrix positioning. | |
return | System |
public ShingleMatrixFilter ( |
||
input | stream from which to construct the matrix | |
minimumShingleSize | int | minimum number of tokens in any shingle. |
maximumShingleSize | int | maximum number of tokens in any shingle. |
return | System |
public ShingleMatrixFilter ( |
||
input | stream from which to construct the matrix | |
minimumShingleSize | int | minimum number of tokens in any shingle. |
maximumShingleSize | int | maximum number of tokens in any shingle. |
spacerCharacter | Char | character to use between texts of the token parts in a shingle. null for none. |
return | System |
public ShingleMatrixFilter ( |
||
input | stream from which to construct the matrix | |
minimumShingleSize | int | minimum number of tokens in any shingle. |
maximumShingleSize | int | maximum number of tokens in any shingle. |
spacerCharacter | Char | character to use between texts of the token parts in a shingle. null for none. |
ignoringSinglePrefixOrSuffixShingle | bool | if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'. |
return | System |
public ShingleMatrixFilter ( |
||
input | stream from which to construct the matrix | |
minimumShingleSize | int | minimum number of tokens in any shingle. |
maximumShingleSize | int | maximum number of tokens in any shingle. |
spacerCharacter | Char | character to use between texts of the token parts in a shingle. null for none. |
ignoringSinglePrefixOrSuffixShingle | bool | if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'. |
settingsCodec | codec used to read input token weight and matrix positioning. | |
return | System |
public UpdateToken ( Lucene.Net.Analysis.Token token, List |
||
token | Lucene.Net.Analysis.Token | Shingle Token |
shingle | List |
Tokens used to produce the shingle token. |
currentPermutationStartOffset | int | Start offset in parameter currentPermutationTokens |
currentPermutationRows | List |
index to Matrix.Column.Row from the position of tokens in parameter currentPermutationTokens |
currentPermuationTokens | List |
tokens of the current permutation of rows in the matrix. |
return | void |
public static TokenSettingsCodec,Lucene.Net.Analysis.Shingle.Codec DefaultSettingsCodec | ||
return |
public static Char DefaultSpacerCharacter | ||
return | Char |