C# 클래스 Majestic12.HTMLparser

Allows to parse HTML by splitting it into small token (HTMLchunks) such as tags, text, comments etc. Do NOT create multiple instances of this class - REUSE single instance Do NOT call same instance from multiple threads - it is NOT thread safe
상속: IDisposable
파일 보기 프로젝트 열기: arktronic/sevenauth 1 사용 예제들

공개 프로퍼티들

프로퍼티 타입 설명
bAutoExtractBetweenTagsOnly bool
bAutoKeepComments bool
bAutoKeepScripts bool
bAutoMarkClosedTagsWithParamsAsOpen bool
bCompressWhiteSpaceBeforeTag bool
bKeepRawHTML bool
bThrowExceptionOnEncodingSetFailure bool
oEnc System.Text.Encoding
oHE HTMLheuristics

공개 메소드들

메소드 설명
CalculateWidth ( string sWidth, int iAvailWidth, bool &bRelative ) : int

Parses WIDTH param and calculates width

ChangeToEntities ( string sLine ) : string
ChangeToEntities ( string sLine, bool bChangeDangerousCharsOnly ) : string

Parses line and changes known entiry characters into proper HTML entiries

CleanUp ( ) : void

Cleans up parser in preparation for next parsing

Close ( ) : void

Closes object and releases all allocated resources

DecodeEntities ( string sData ) : string

This function will decode any entities found in a string - not fast!

Dispose ( ) : void
HTMLparser ( string p_oHTML ) : System

Constructs parser object using provided HTML as source for parsing

HandleMetaEncoding ( HTMLparser oP, HTMLchunk oChunk, bool &bEncodingSet ) : bool

Handles META tags that set page encoding

Init ( byte p_bHTML ) : void

Initialises parses with HTML to be parsed from provided data buffer: this is best in terms of correctness of parsing of various encodings that can be used in HTML

Init ( byte p_bHTML, int p_iHtmlLength ) : void

Inits parsing

Init ( string p_oHTML ) : void

Initialises parses with HTML to be parsed from provided string

InitMiniEntities ( ) : void

Inits mini-entities mode: only "nbsp" will be converted into space, all other entities will be left as is

IsBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool

Checks if first font is bigger than the second

IsEqualOrBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool

Checks if first font is equal or bigger than the second

LoadFromFile ( string sFileName ) : void

Loads HTML from file

ParseFontSize ( string sSize, FontSize oCurSize ) : FontSize

Parses font's tag size param

ParseNext ( ) : HTMLchunk

Parses next chunk and returns it with

ParseNextTag ( ) : HTMLchunk

Returns next tag or null if end of document, text will be ignored completely

Reset ( ) : void

Resets current parsed data to start

SetChunkHashMode ( bool bHashMode ) : void

Sets chunk param hash mode

SetEncoding ( string p_sCharSet ) : bool

Sets current encoding in format used in HTTP headers and HTML META tags

SetEncoding ( Encoding p_oEnc ) : void

Sets encoding

SetRawHTML ( HTMLchunk oChunk ) : void

Sets oHTML variable in a chunk to the raw HTML that was parsed for that chunk.

비공개 메소드들

메소드 설명
Dispose ( bool bDisposing ) : void
GetCharSet ( string sData ) : string

Retrieves charset information from format used in HTTP headers and META descriptions

GetNextTag ( ) : HTMLchunk

Internally parses tag and returns it from point when '<' was found

HTMLparser ( ) : System
ParseTextWithEntities ( ) : HTMLchunk

메소드 상세

CalculateWidth() 공개 정적인 메소드

Parses WIDTH param and calculates width
public static CalculateWidth ( string sWidth, int iAvailWidth, bool &bRelative ) : int
sWidth string WIDTH param from tag
iAvailWidth int Currently available width for relative calculations, if negative width will be returned as is
bRelative bool Flag that will be set to true if width was relative
리턴 int

ChangeToEntities() 공개 메소드

public ChangeToEntities ( string sLine ) : string
sLine string
리턴 string

ChangeToEntities() 공개 메소드

Parses line and changes known entiry characters into proper HTML entiries
public ChangeToEntities ( string sLine, bool bChangeDangerousCharsOnly ) : string
sLine string Line of text
bChangeDangerousCharsOnly bool
리턴 string

CleanUp() 공개 메소드

Cleans up parser in preparation for next parsing
public CleanUp ( ) : void
리턴 void

Close() 공개 메소드

Closes object and releases all allocated resources
public Close ( ) : void
리턴 void

DecodeEntities() 공개 정적인 메소드

This function will decode any entities found in a string - not fast!
public static DecodeEntities ( string sData ) : string
sData string
리턴 string

Dispose() 공개 메소드

public Dispose ( ) : void
리턴 void

HTMLparser() 공개 메소드

Constructs parser object using provided HTML as source for parsing
public HTMLparser ( string p_oHTML ) : System
p_oHTML string
리턴 System

HandleMetaEncoding() 공개 정적인 메소드

Handles META tags that set page encoding
public static HandleMetaEncoding ( HTMLparser oP, HTMLchunk oChunk, bool &bEncodingSet ) : bool
oP HTMLparser HTML parser object that is used for parsing
oChunk HTMLchunk Parsed chunk that should contain tag META
bEncodingSet bool Your own flag that shows whether encoding was already set or not, if set /// once then it should not be changed - this is the logic applied by major browsers
리턴 bool

Init() 공개 메소드

Initialises parses with HTML to be parsed from provided data buffer: this is best in terms of correctness of parsing of various encodings that can be used in HTML
public Init ( byte p_bHTML ) : void
p_bHTML byte Data buffer with HTML in it
리턴 void

Init() 공개 메소드

Inits parsing
public Init ( byte p_bHTML, int p_iHtmlLength ) : void
p_bHTML byte Data buffer
p_iHtmlLength int Length of data (buffer itself can be longer) - start offset assumed to be 0
리턴 void

Init() 공개 메소드

Initialises parses with HTML to be parsed from provided string
public Init ( string p_oHTML ) : void
p_oHTML string String with HTML in it
리턴 void

InitMiniEntities() 공개 메소드

Inits mini-entities mode: only "nbsp" will be converted into space, all other entities will be left as is
public InitMiniEntities ( ) : void
리턴 void

IsBiggerFont() 공개 정적인 메소드

Checks if first font is bigger than the second
public static IsBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool
oFont1 FontSize Font #1
oFont2 FontSize Font #2
리턴 bool

IsEqualOrBiggerFont() 공개 정적인 메소드

Checks if first font is equal or bigger than the second
public static IsEqualOrBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool
oFont1 FontSize Font #1
oFont2 FontSize Font #2
리턴 bool

LoadFromFile() 공개 메소드

Loads HTML from file
public LoadFromFile ( string sFileName ) : void
sFileName string Full filename
리턴 void

ParseFontSize() 공개 정적인 메소드

Parses font's tag size param
public static ParseFontSize ( string sSize, FontSize oCurSize ) : FontSize
sSize string String value of the size param
oCurSize FontSize
리턴 FontSize

ParseNext() 공개 메소드

Parses next chunk and returns it with
public ParseNext ( ) : HTMLchunk
리턴 HTMLchunk

ParseNextTag() 공개 메소드

Returns next tag or null if end of document, text will be ignored completely
public ParseNextTag ( ) : HTMLchunk
리턴 HTMLchunk

Reset() 공개 메소드

Resets current parsed data to start
public Reset ( ) : void
리턴 void

SetChunkHashMode() 공개 메소드

Sets chunk param hash mode
public SetChunkHashMode ( bool bHashMode ) : void
bHashMode bool If true then tag's params will be kept in Chunk's hashtable (slower), otherwise kept in arrays (sParams/sValues)
리턴 void

SetEncoding() 공개 메소드

Sets current encoding in format used in HTTP headers and HTML META tags
public SetEncoding ( string p_sCharSet ) : bool
p_sCharSet string
리턴 bool

SetEncoding() 공개 메소드

Sets encoding
public SetEncoding ( Encoding p_oEnc ) : void
p_oEnc System.Text.Encoding Encoding object
리턴 void

SetRawHTML() 공개 메소드

Sets oHTML variable in a chunk to the raw HTML that was parsed for that chunk.
public SetRawHTML ( HTMLchunk oChunk ) : void
oChunk HTMLchunk Chunk returned by ParseNext function, it must belong to the same HTMLparser that /// was initiated with the same HTML data that this chunk belongs to
리턴 void

프로퍼티 상세

bAutoExtractBetweenTagsOnly 공개적으로 프로퍼티

If true (and either bAutoKeepComments or bAutoKeepScripts is true), then oHTML will be set to data BETWEEN tags excluding those tags themselves, as otherwise FULL HTML will be set, ie: '' but if this is set to true then only ' comments ' will be returned
public bool bAutoExtractBetweenTagsOnly
리턴 bool

bAutoKeepComments 공개적으로 프로퍼티

If true (default) then HTML for comments tags themselves AND between them will be set to oHTML variable, otherwise it will be empty but you can always set it later
public bool bAutoKeepComments
리턴 bool

bAutoKeepScripts 공개적으로 프로퍼티

If true (default: false) then HTML for script tags themselves AND between them will be set to oHTML variable, otherwise it will be empty but you can always set it later
public bool bAutoKeepScripts
리턴 bool

bAutoMarkClosedTagsWithParamsAsOpen 공개적으로 프로퍼티

Long winded name... by default if tag is closed BUT it has got parameters then we will consider it open tag, this is not right for proper XML parsing
public bool bAutoMarkClosedTagsWithParamsAsOpen
리턴 bool

bCompressWhiteSpaceBeforeTag 공개적으로 프로퍼티

If true (default), then all whitespace before TAG starts will be compressed to single space char (32 or 0x20) this makes parser run a bit faster, if you need exact whitespace before tags then change this flag to FALSE
public bool bCompressWhiteSpaceBeforeTag
리턴 bool

bKeepRawHTML 공개적으로 프로퍼티

If true (default: false) then parsed tag chunks will contain raw HTML, otherwise only comments will have it set

Performance hint: keep it as false, you can always get to original HTML as each chunk contains offset from which parsing started and finished, thus allowing to set exact HTML that was parsed

public bool bKeepRawHTML
리턴 bool

bThrowExceptionOnEncodingSetFailure 공개적으로 프로퍼티

If true then exception will be thrown in case of inability to set encoding taken from HTML - this is possible if encoding was incorrect or not supported, this would lead to abort in processing. Default behavior is to use Default encoding that should keep symbols as is - most likely garbage looking things if encoding was not supported.
public bool bThrowExceptionOnEncodingSetFailure
리턴 bool

oEnc 공개적으로 프로퍼티

Encoding used to convert binary data into string
public Encoding,System.Text oEnc
리턴 System.Text.Encoding

oHE 공개적으로 프로퍼티

Heuristics engine used by Tag Parser to quickly match known tags and attribute names, can be disabled or you can add more tags to it to fit your most likely cases, it is currently tuned for HTML
public HTMLheuristics,Majestic12 oHE
리턴 HTMLheuristics