C# Class Lucene.Net.Analysis.Util.SegmentingTokenizerBase

Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing. @lucene.experimental

Inheritance: Tokenizer
Exibir arquivo Open project: apache/lucenenet

Protected Properties

Property Type Description
buffer char[]
offset int

Public Methods

Method Description
End ( ) : void
IncrementToken ( ) : bool
Reset ( ) : void

Protected Methods

Method Description
IncrementWord ( ) : bool

Returns true if another word is available

IsSafeEnd ( char ch ) : bool

For sentence tokenization, these are the unambiguous break positions.

SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET

Construct a new SegmenterBase, also supplying the AttributeFactory

SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET

Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

SetNextSentence ( int sentenceStart, int sentenceEnd ) : void

Provides the next input sentence for analysis

Private Methods

Method Description
FindSafeEnd ( ) : int

Returns the last unambiguous break position in the text.

IncrementSentence ( ) : bool

return true if there is a token from the buffer, or null if it is exhausted.

Read ( TextReader input, char buffer, int offset, int length ) : int

commons-io's readFully, but without bugs if offset != 0

Refill ( ) : void

Refill the buffer, accumulating the offset and setting usableLength to the last unambiguous break position

Method Details

End() public method

public End ( ) : void
return void

IncrementToken() public final method

public final IncrementToken ( ) : bool
return bool

IncrementWord() protected abstract method

Returns true if another word is available
protected abstract IncrementWord ( ) : bool
return bool

IsSafeEnd() protected method

For sentence tokenization, these are the unambiguous break positions.
protected IsSafeEnd ( char ch ) : bool
ch char
return bool

Reset() public method

public Reset ( ) : void
return void

SegmentingTokenizerBase() protected method

Construct a new SegmenterBase, also supplying the AttributeFactory
protected SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET
factory AttributeFactory
reader System.IO.TextReader
iterator BreakIterator
return ICU4NET

SegmentingTokenizerBase() protected method

Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

protected SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET
reader System.IO.TextReader
iterator BreakIterator
return ICU4NET

SetNextSentence() protected abstract method

Provides the next input sentence for analysis
protected abstract SetNextSentence ( int sentenceStart, int sentenceEnd ) : void
sentenceStart int
sentenceEnd int
return void

Property Details

buffer protected_oe property

protected char[] buffer
return char[]

offset protected_oe property

accumulated offset of previous buffers for this reader, for offsetAtt
protected int offset
return int