C# 클래스 NClassifier.SimpleHtmlTokenizer

Simple HTML tokenizer. Its goal is to tokenize words that would be displayed in a normal web browser.
It does not handle meta tags, alt or text attributes, but it does remove CSS style definitions and javascript code. It handles entity reference by replacing them with a space. This can be overridden.
상속: DefaultTokenizer
파일 보기 프로젝트 열기: colin-dumitru/Proiect-AI-2012---GUI

공개 메소드들

메소드 설명
ResolveEntities ( string contentsWithUnresolvedEntityReferences ) : string

Replaces entity references with spaces.

SimpleHtmlTokenizer ( ) : System

Constructor uses BREAK_ON_WORD_BREAKS tokenizer config by default.

SimpleHtmlTokenizer ( int tokenizerConfig ) : System
SimpleHtmlTokenizer ( string regularExpression ) : System
Tokenize ( string input ) : string[]

메소드 상세

ResolveEntities() 공개 메소드

Replaces entity references with spaces.
public ResolveEntities ( string contentsWithUnresolvedEntityReferences ) : string
contentsWithUnresolvedEntityReferences string The contents with the entity references.
리턴 string

SimpleHtmlTokenizer() 공개 메소드

Constructor uses BREAK_ON_WORD_BREAKS tokenizer config by default.
public SimpleHtmlTokenizer ( ) : System
리턴 System

SimpleHtmlTokenizer() 공개 메소드

public SimpleHtmlTokenizer ( int tokenizerConfig ) : System
tokenizerConfig int
리턴 System

SimpleHtmlTokenizer() 공개 메소드

public SimpleHtmlTokenizer ( string regularExpression ) : System
regularExpression string
리턴 System

Tokenize() 공개 메소드

public Tokenize ( string input ) : string[]
input string
리턴 string[]