C# Класс Lucene.Net.Analysis.Util.SegmentingTokenizerBase

Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing. @lucene.experimental

Наследование: Tokenizer
Показать файл Открыть проект

Защищенные свойства (Protected)

Свойство Тип Описание
buffer char[]
offset int

Открытые методы

Метод Описание
End ( ) : void
IncrementToken ( ) : bool
Reset ( ) : void

Защищенные методы

Метод Описание
IncrementWord ( ) : bool

Returns true if another word is available

IsSafeEnd ( char ch ) : bool

For sentence tokenization, these are the unambiguous break positions.

SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET

Construct a new SegmenterBase, also supplying the AttributeFactory

SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET

Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

SetNextSentence ( int sentenceStart, int sentenceEnd ) : void

Provides the next input sentence for analysis

Приватные методы

Метод Описание
FindSafeEnd ( ) : int

Returns the last unambiguous break position in the text.

IncrementSentence ( ) : bool

return true if there is a token from the buffer, or null if it is exhausted.

Read ( TextReader input, char buffer, int offset, int length ) : int

commons-io's readFully, but without bugs if offset != 0

Refill ( ) : void

Refill the buffer, accumulating the offset and setting usableLength to the last unambiguous break position

Описание методов

End() публичный Метод

public End ( ) : void
Результат void

IncrementToken() публичный закрытый Метод

public final IncrementToken ( ) : bool
Результат bool

IncrementWord() защищенный абстрактный Метод

Returns true if another word is available
protected abstract IncrementWord ( ) : bool
Результат bool

IsSafeEnd() защищенный Метод

For sentence tokenization, these are the unambiguous break positions.
protected IsSafeEnd ( char ch ) : bool
ch char
Результат bool

Reset() публичный Метод

public Reset ( ) : void
Результат void

SegmentingTokenizerBase() защищенный Метод

Construct a new SegmenterBase, also supplying the AttributeFactory
protected SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET
factory AttributeFactory
reader System.IO.TextReader
iterator BreakIterator
Результат ICU4NET

SegmentingTokenizerBase() защищенный Метод

Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

protected SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET
reader System.IO.TextReader
iterator BreakIterator
Результат ICU4NET

SetNextSentence() защищенный абстрактный Метод

Provides the next input sentence for analysis
protected abstract SetNextSentence ( int sentenceStart, int sentenceEnd ) : void
sentenceStart int
sentenceEnd int
Результат void

Описание свойств

buffer защищенное свойство

protected char[] buffer
Результат char[]

offset защищенное свойство

accumulated offset of previous buffers for this reader, for offsetAtt
protected int offset
Результат int