C# Класс NClassifier.SimpleHtmlTokenizer

Simple HTML tokenizer. Its goal is to tokenize words that would be displayed in a normal web browser.
It does not handle meta tags, alt or text attributes, but it does remove CSS style definitions and javascript code. It handles entity reference by replacing them with a space. This can be overridden.
Наследование: DefaultTokenizer
Показать файл Открыть проект

Открытые методы

Метод Описание
ResolveEntities ( string contentsWithUnresolvedEntityReferences ) : string

Replaces entity references with spaces.

SimpleHtmlTokenizer ( ) : System

Constructor uses BREAK_ON_WORD_BREAKS tokenizer config by default.

SimpleHtmlTokenizer ( int tokenizerConfig ) : System
SimpleHtmlTokenizer ( string regularExpression ) : System
Tokenize ( string input ) : string[]

Описание методов

ResolveEntities() публичный Метод

Replaces entity references with spaces.
public ResolveEntities ( string contentsWithUnresolvedEntityReferences ) : string
contentsWithUnresolvedEntityReferences string The contents with the entity references.
Результат string

SimpleHtmlTokenizer() публичный Метод

Constructor uses BREAK_ON_WORD_BREAKS tokenizer config by default.
public SimpleHtmlTokenizer ( ) : System
Результат System

SimpleHtmlTokenizer() публичный Метод

public SimpleHtmlTokenizer ( int tokenizerConfig ) : System
tokenizerConfig int
Результат System

SimpleHtmlTokenizer() публичный Метод

public SimpleHtmlTokenizer ( string regularExpression ) : System
regularExpression string
Результат System

Tokenize() публичный Метод

public Tokenize ( string input ) : string[]
input string
Результат string[]