Name |
Description |
KeepLargestFulltextBlockFilter |
Keeps the largest NBoilerpipe.Document.TextBlock only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as NBoilerpipe.Labels.DefaultLabels.MIGHT_BE_CONTENT . As opposed to NBoilerpipe.Filters.Heuristics.KeepLargestBlockFilter , the number of words are computed using HeuristicFilterBase.GetNumFullTextWords(NBoilerpipe.Document.TextBlock) , which only counts words that occur in text elements with at least 9 words and are thus believed to be full text. NOTE: Without language-specific fine-tuning (i.e., running the default instance), this filter may lead to suboptimal results. You better use NBoilerpipe.Filters.Heuristics.KeepLargestBlockFilter instead, which works at the level of number-of-words instead of text densities. |
MinFulltextWordsFilter |
Keeps only those content blocks which contain at least k full-text words (measured by HeuristicFilterBase.GetNumFullTextWords(NBoilerpipe.Document.TextBlock) ). k is 30 by default. |
NumWordsRulesClassifier |
Classifies NBoilerpipe.Document.TextBlock s as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block. |
TerminatingBlocksFinder |
Finds blocks which are potentially indicating the end of an article text and marks them with NBoilerpipe.Labels.DefaultLabels.INDICATES_END_OF_TEXT . This can be used in conjunction with a downstream IgnoreBlocksAfterContentFilter . |