C# (CSharp) Lucene.Net.Codecs.Memory Namespace

Classes

Name Description
DirectDocValuesProducer Reader for DirectDocValuesFormat
DirectDocValuesProducer.BinaryDocValuesAnonymousInnerClassHelper
DirectDocValuesProducer.BinaryEntry
DirectDocValuesProducer.FSTEntry
DirectDocValuesProducer.NumericDocValuesAnonymousInnerClassHelper
DirectDocValuesProducer.NumericDocValuesAnonymousInnerClassHelper2
DirectDocValuesProducer.NumericDocValuesAnonymousInnerClassHelper3
DirectDocValuesProducer.NumericDocValuesAnonymousInnerClassHelper4
DirectDocValuesProducer.NumericEntry
DirectDocValuesProducer.RandomAccessOrdsAnonymousInnerClassHelper
DirectDocValuesProducer.SortedDocValuesAnonymousInnerClassHelper
DirectDocValuesProducer.SortedEntry
DirectDocValuesProducer.SortedSetEntry
DirectDocValuesProducer.SortedSetRawValues
DirectPostingsFormat Wraps Lucene41PostingsFormat format for on-disk storage, but then at read time loads and stores all terms & postings directly in RAM as byte[], int[].

WARNING: This is exceptionally RAM intensive: it makes no effort to compress the postings data, storing terms as separate byte[] and postings as separate int[], but as a result it gives substantial increase in search performance.

This postings format supports TermsEnum#ord and TermsEnum#seekExact(long).

Because this holds all term bytes as a single byte[], you cannot have more than 2.1GB worth of term bytes in a single segment. @lucene.experimental

DirectPostingsFormat.DirectField
DirectPostingsFormat.DirectField.DirectIntersectTermsEnum
DirectPostingsFormat.DirectField.DirectIntersectTermsEnum.State
DirectPostingsFormat.DirectField.DirectTermsEnum
DirectPostingsFormat.DirectField.HighFreqTerm
DirectPostingsFormat.DirectField.IntArrayWriter
DirectPostingsFormat.DirectField.LowFreqTerm
DirectPostingsFormat.DirectField.TermAndSkip
DirectPostingsFormat.DirectFields
DirectPostingsFormat.HighFreqDocsAndPositionsEnum
DirectPostingsFormat.HighFreqDocsEnum
DirectPostingsFormat.LowFreqDocsAndPositionsEnum
DirectPostingsFormat.LowFreqDocsEnum
DirectPostingsFormat.LowFreqDocsEnumNoPos
DirectPostingsFormat.LowFreqDocsEnumNoTF
FSTOrdTermsWriter FST-based term dict, using ord as FST output. The FST holds the mapping between <term, ord>, and term's metadata is delta encoded into a single byte block. Typically the byte block consists of four parts: 1. term statistics: docFreq, totalTermFreq; 2. monotonic long[], e.g. the pointer to the postings list for that term; 3. generic byte[], e.g. other information customized by postings base. 4. single-level skip list to speed up metadata decoding by ord.

Files:

Term Index

The .tix contains a list of FSTs, one for each field. The FST maps a term to its corresponding order in current field.

  • TermIndex(.tix) --> Header, TermFSTNumFields, Footer
  • TermFST --> FST
  • Header --> CodecUtil#writeHeader CodecHeader
  • Footer --> CodecUtil#writeFooter CodecFooter

Notes:

  • Since terms are already sorted before writing to Term Block, their ords can directly used to seek term metadata from term block.

Term Block

The .tbk contains all the statistics and metadata for terms, along with field summary (e.g. per-field data like number of documents in current field). For each field, there are four blocks:

  • statistics bytes block: contains term statistics;
  • metadata longs block: delta-encodes monotonic part of metadata;
  • metadata bytes block: encodes other parts of metadata;
  • skip block: contains skip data, to speed up metadata seeking and decoding

File Format:

  • TermBlock(.tbk) --> Header, PostingsHeader, FieldSummary, DirOffset
  • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, DataBlock > NumFields, Footer
  • DataBlock --> StatsBlockLength, MetaLongsBlockLength, MetaBytesBlockLength, SkipBlock, StatsBlock, MetaLongsBlock, MetaBytesBlock
  • SkipBlock --> < StatsFPDelta, MetaLongsSkipFPDelta, MetaBytesSkipFPDelta, MetaLongsSkipDeltaLongsSize >NumTerms
  • StatsBlock --> < DocFreq[Same?], (TotalTermFreq-DocFreq) ? > NumTerms
  • MetaLongsBlock --> < LongDeltaLongsSize, BytesSize > NumTerms
  • MetaBytesBlock --> Byte MetaBytesBlockLength
  • Header --> CodecUtil#writeHeader CodecHeader
  • DirOffset --> DataOutput#writeLong Uint64
  • NumFields, FieldNumber, DocCount, DocFreq, LongsSize, FieldNumber, DocCount --> DataOutput#writeVInt VInt
  • NumTerms, SumTotalTermFreq, SumDocFreq, StatsBlockLength, MetaLongsBlockLength, MetaBytesBlockLength, StatsFPDelta, MetaLongsSkipFPDelta, MetaBytesSkipFPDelta, MetaLongsSkipStart, TotalTermFreq, LongDelta,--> DataOutput#writeVLong VLong
  • Footer --> CodecUtil#writeFooter CodecFooter

Notes:

  • The format of PostingsHeader and MetaBytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
  • During initialization the reader will load all the blocks into memory. SkipBlock will be decoded, so that during seek term dict can lookup file pointers directly. StatsFPDelta, MetaLongsSkipFPDelta, etc. are file offset for every SkipInterval's term. MetaLongsSkipDelta is the difference from previous one, which indicates the value of preceding metadata longs for every SkipInterval's term.
  • DocFreq is the count of documents which contain the term. TotalTermFreq is the total number of occurrences of the term. Usually these two values are the same for long tail terms, therefore one bit is stole from DocFreq to check this case, so that encoding of TotalTermFreq may be omitted.
@lucene.experimental
FSTOrdTermsWriter.FieldMetaData
FSTOrdTermsWriter.TermsWriter
FSTTermsWriter FST-based term dict, using metadata as FST output. The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts: 1. term statistics: docFreq, totalTermFreq; 2. monotonic long[], e.g. the pointer to the postings list for that term; 3. generic byte[], e.g. other information need by postings reader.

File:

Term Dictionary

The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

Typically the metadata is separated into two parts:

  • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
  • Generic byte array: Used to store non-monotonic metadata.

File format:
  • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
  • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
  • TermFST TermData
  • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
  • Header --> CodecUtil#writeHeader CodecHeader
  • DirOffset --> DataOutput#writeLong Uint64
  • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> DataOutput#writeVInt VInt
  • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> DataOutput#writeVLong VLong

Notes:

  • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
  • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
  • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
  • Since LongsSize is per-field fixed, it is only written once in field summary.
@lucene.experimental
FSTTermsWriter.FieldMetaData
FSTTermsWriter.TermsWriter
MemoryPostingsFormat
MemoryPostingsFormat.FSTDocsAndPositionsEnum
MemoryPostingsFormat.FSTDocsEnum
MemoryPostingsFormat.FSTTermsEnum
MemoryPostingsFormat.FieldsConsumerAnonymousInnerClassHelper
MemoryPostingsFormat.FieldsProducerAnonymousInnerClassHelper
MemoryPostingsFormat.TermsReader
MemoryPostingsFormat.TermsWriter
MemoryPostingsFormat.TermsWriter.PostingsWriter
TestDirectDocValuesFormat Tests DirectDocValuesFormat
TestFSTPulsing41PostingsFormat Tests FSTPulsing41PostingsFormat
TestMemoryPostingsFormat Tests MemoryPostingsFormat