C# Class NClassifier.SimpleHtmlTokenizer

Simple HTML tokenizer. Its goal is to tokenize words that would be displayed in a normal web browser.
It does not handle meta tags, alt or text attributes, but it does remove CSS style definitions and javascript code. It handles entity reference by replacing them with a space. This can be overridden.
Inheritance: DefaultTokenizer
Mostra file Open project: colin-dumitru/Proiect-AI-2012---GUI

Public Methods

Method Description
ResolveEntities ( string contentsWithUnresolvedEntityReferences ) : string

Replaces entity references with spaces.

SimpleHtmlTokenizer ( ) : System

Constructor uses BREAK_ON_WORD_BREAKS tokenizer config by default.

SimpleHtmlTokenizer ( int tokenizerConfig ) : System
SimpleHtmlTokenizer ( string regularExpression ) : System
Tokenize ( string input ) : string[]

Method Details

ResolveEntities() public method

Replaces entity references with spaces.
public ResolveEntities ( string contentsWithUnresolvedEntityReferences ) : string
contentsWithUnresolvedEntityReferences string The contents with the entity references.
return string

SimpleHtmlTokenizer() public method

Constructor uses BREAK_ON_WORD_BREAKS tokenizer config by default.
public SimpleHtmlTokenizer ( ) : System
return System

SimpleHtmlTokenizer() public method

public SimpleHtmlTokenizer ( int tokenizerConfig ) : System
tokenizerConfig int
return System

SimpleHtmlTokenizer() public method

public SimpleHtmlTokenizer ( string regularExpression ) : System
regularExpression string
return System

Tokenize() public method

public Tokenize ( string input ) : string[]
input string
return string[]