C# Class Lucene.Net.Analysis.Util.SegmentingTokenizerBase

Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing. @lucene.experimental

Inheritance: Tokenizer

Mostrar archivo Open project: apache/lucenenet

Protected Properties

Property	Type	Description
buffer	char[]
offset	int

Public Methods

Method	Description
End ( ) : void
IncrementToken ( ) : bool
Reset ( ) : void

Protected Methods

Method	Description
IncrementWord ( ) : bool	Returns true if another word is available
IsSafeEnd ( char ch ) : bool	For sentence tokenization, these are the unambiguous break positions.
SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET	Construct a new SegmenterBase, also supplying the AttributeFactory
SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET	Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation. Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
SetNextSentence ( int sentenceStart, int sentenceEnd ) : void	Provides the next input sentence for analysis

Private Methods

Method	Description
FindSafeEnd ( ) : int	Returns the last unambiguous break position in the text.
IncrementSentence ( ) : bool	return true if there is a token from the buffer, or null if it is exhausted.
Read ( TextReader input, char buffer, int offset, int length ) : int	commons-io's readFully, but without bugs if offset != 0
Refill ( ) : void	Refill the buffer, accumulating the offset and setting usableLength to the last unambiguous break position

Method Details

End() public method

public End ( ) : void
return	void

IncrementToken() public final method

public final IncrementToken ( ) : bool
return	bool

IncrementWord() protected abstract method

Returns true if another word is available

protected abstract IncrementWord ( ) : bool
return	bool

IsSafeEnd() protected method

For sentence tokenization, these are the unambiguous break positions.

protected IsSafeEnd ( char ch ) : bool
ch	char
return	bool

Reset() public method

public Reset ( ) : void
return	void

SegmentingTokenizerBase() protected method

Construct a new SegmenterBase, also supplying the AttributeFactory

protected SegmentingTokenizerBase ( AttributeFactory factory, TextReader reader, BreakIterator iterator ) : ICU4NET
factory	AttributeFactory
reader	System.IO.TextReader
iterator	BreakIterator
return	ICU4NET

SegmentingTokenizerBase() protected method

Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.

Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

protected SegmentingTokenizerBase ( TextReader reader, BreakIterator iterator ) : ICU4NET
reader	System.IO.TextReader
iterator	BreakIterator
return	ICU4NET

SetNextSentence() protected abstract method

Provides the next input sentence for analysis

protected abstract SetNextSentence ( int sentenceStart, int sentenceEnd ) : void
sentenceStart	int
sentenceEnd	int
return	void

Property Details

buffer protected_oe property

protected char[] buffer
return	char[]

offset protected_oe property

accumulated offset of previous buffers for this reader, for offsetAtt

protected int offset
return	int