C# Класс Lucene.Net.Analysis.Util.SegmentingTokenizerBase

Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing. @lucene.experimental

Наследование: Tokenizer

Показать файл Открыть проект

Защищенные свойства (Protected)

Свойство	Тип	Описание
buffer	char[]
offset	int

Открытые методы

Метод	Описание
End ( ) : void
IncrementToken ( ) : bool
Reset ( ) : void

Защищенные методы

Метод	Описание
IncrementWord ( ) : bool	Returns true if another word is available
IsSafeEnd ( char ch ) : bool	For sentence tokenization, these are the unambiguous break positions.
SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET	Construct a new SegmenterBase, also supplying the AttributeFactory
SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET	Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation. Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
SetNextSentence ( int sentenceStart, int sentenceEnd ) : void	Provides the next input sentence for analysis

Приватные методы

Метод	Описание
FindSafeEnd ( ) : int	Returns the last unambiguous break position in the text.
IncrementSentence ( ) : bool	return true if there is a token from the buffer, or null if it is exhausted.
Read ( TextReader input, char buffer, int offset, int length ) : int	commons-io's readFully, but without bugs if offset != 0
Refill ( ) : void	Refill the buffer, accumulating the offset and setting usableLength to the last unambiguous break position

Описание методов

End() публичный Метод

public End ( ) : void
Результат	void

IncrementToken() публичный закрытый Метод

public final IncrementToken ( ) : bool
Результат	bool

IncrementWord() защищенный абстрактный Метод

Returns true if another word is available

protected abstract IncrementWord ( ) : bool
Результат	bool

IsSafeEnd() защищенный Метод

For sentence tokenization, these are the unambiguous break positions.

protected IsSafeEnd ( char ch ) : bool
ch	char
Результат	bool

Reset() публичный Метод

public Reset ( ) : void
Результат	void

SegmentingTokenizerBase() защищенный Метод

Construct a new SegmenterBase, also supplying the AttributeFactory

protected SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET
factory	AttributeFactory
reader	System.IO.TextReader
iterator	BreakIterator
Результат	ICU4NET

SegmentingTokenizerBase() защищенный Метод

Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

protected SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET
reader	System.IO.TextReader
iterator	BreakIterator
Результат	ICU4NET

SetNextSentence() защищенный абстрактный Метод

Provides the next input sentence for analysis

protected abstract SetNextSentence ( int sentenceStart, int sentenceEnd ) : void
sentenceStart	int
sentenceEnd	int
Результат	void

Описание свойств

buffer защищенное свойство

protected char[] buffer
Результат	char[]

offset защищенное свойство

accumulated offset of previous buffers for this reader, for offsetAtt

protected int offset
Результат	int