C# Class Majestic12.HTMLparser

Allows to parse HTML by splitting it into small token (HTMLchunks) such as tags, text, comments etc. Do NOT create multiple instances of this class - REUSE single instance Do NOT call same instance from multiple threads - it is NOT thread safe
Inheritance: IDisposable
Mostrar archivo Open project: arktronic/sevenauth Class Usage Examples

Public Properties

Property Type Description
bAutoExtractBetweenTagsOnly bool
bAutoKeepComments bool
bAutoKeepScripts bool
bAutoMarkClosedTagsWithParamsAsOpen bool
bCompressWhiteSpaceBeforeTag bool
bKeepRawHTML bool
bThrowExceptionOnEncodingSetFailure bool
oEnc System.Text.Encoding
oHE HTMLheuristics

Public Methods

Method Description
CalculateWidth ( string sWidth, int iAvailWidth, bool &bRelative ) : int

Parses WIDTH param and calculates width

ChangeToEntities ( string sLine ) : string
ChangeToEntities ( string sLine, bool bChangeDangerousCharsOnly ) : string

Parses line and changes known entiry characters into proper HTML entiries

CleanUp ( ) : void

Cleans up parser in preparation for next parsing

Close ( ) : void

Closes object and releases all allocated resources

DecodeEntities ( string sData ) : string

This function will decode any entities found in a string - not fast!

Dispose ( ) : void
HTMLparser ( string p_oHTML ) : System

Constructs parser object using provided HTML as source for parsing

HandleMetaEncoding ( HTMLparser oP, HTMLchunk oChunk, bool &bEncodingSet ) : bool

Handles META tags that set page encoding

Init ( byte p_bHTML ) : void

Initialises parses with HTML to be parsed from provided data buffer: this is best in terms of correctness of parsing of various encodings that can be used in HTML

Init ( byte p_bHTML, int p_iHtmlLength ) : void

Inits parsing

Init ( string p_oHTML ) : void

Initialises parses with HTML to be parsed from provided string

InitMiniEntities ( ) : void

Inits mini-entities mode: only "nbsp" will be converted into space, all other entities will be left as is

IsBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool

Checks if first font is bigger than the second

IsEqualOrBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool

Checks if first font is equal or bigger than the second

LoadFromFile ( string sFileName ) : void

Loads HTML from file

ParseFontSize ( string sSize, FontSize oCurSize ) : FontSize

Parses font's tag size param

ParseNext ( ) : HTMLchunk

Parses next chunk and returns it with

ParseNextTag ( ) : HTMLchunk

Returns next tag or null if end of document, text will be ignored completely

Reset ( ) : void

Resets current parsed data to start

SetChunkHashMode ( bool bHashMode ) : void

Sets chunk param hash mode

SetEncoding ( string p_sCharSet ) : bool

Sets current encoding in format used in HTTP headers and HTML META tags

SetEncoding ( Encoding p_oEnc ) : void

Sets encoding

SetRawHTML ( HTMLchunk oChunk ) : void

Sets oHTML variable in a chunk to the raw HTML that was parsed for that chunk.

Private Methods

Method Description
Dispose ( bool bDisposing ) : void
GetCharSet ( string sData ) : string

Retrieves charset information from format used in HTTP headers and META descriptions

GetNextTag ( ) : HTMLchunk

Internally parses tag and returns it from point when '<' was found

HTMLparser ( ) : System
ParseTextWithEntities ( ) : HTMLchunk

Method Details

CalculateWidth() public static method

Parses WIDTH param and calculates width
public static CalculateWidth ( string sWidth, int iAvailWidth, bool &bRelative ) : int
sWidth string WIDTH param from tag
iAvailWidth int Currently available width for relative calculations, if negative width will be returned as is
bRelative bool Flag that will be set to true if width was relative
return int

ChangeToEntities() public method

public ChangeToEntities ( string sLine ) : string
sLine string
return string

ChangeToEntities() public method

Parses line and changes known entiry characters into proper HTML entiries
public ChangeToEntities ( string sLine, bool bChangeDangerousCharsOnly ) : string
sLine string Line of text
bChangeDangerousCharsOnly bool
return string

CleanUp() public method

Cleans up parser in preparation for next parsing
public CleanUp ( ) : void
return void

Close() public method

Closes object and releases all allocated resources
public Close ( ) : void
return void

DecodeEntities() public static method

This function will decode any entities found in a string - not fast!
public static DecodeEntities ( string sData ) : string
sData string
return string

Dispose() public method

public Dispose ( ) : void
return void

HTMLparser() public method

Constructs parser object using provided HTML as source for parsing
public HTMLparser ( string p_oHTML ) : System
p_oHTML string
return System

HandleMetaEncoding() public static method

Handles META tags that set page encoding
public static HandleMetaEncoding ( HTMLparser oP, HTMLchunk oChunk, bool &bEncodingSet ) : bool
oP HTMLparser HTML parser object that is used for parsing
oChunk HTMLchunk Parsed chunk that should contain tag META
bEncodingSet bool Your own flag that shows whether encoding was already set or not, if set /// once then it should not be changed - this is the logic applied by major browsers
return bool

Init() public method

Initialises parses with HTML to be parsed from provided data buffer: this is best in terms of correctness of parsing of various encodings that can be used in HTML
public Init ( byte p_bHTML ) : void
p_bHTML byte Data buffer with HTML in it
return void

Init() public method

Inits parsing
public Init ( byte p_bHTML, int p_iHtmlLength ) : void
p_bHTML byte Data buffer
p_iHtmlLength int Length of data (buffer itself can be longer) - start offset assumed to be 0
return void

Init() public method

Initialises parses with HTML to be parsed from provided string
public Init ( string p_oHTML ) : void
p_oHTML string String with HTML in it
return void

InitMiniEntities() public method

Inits mini-entities mode: only "nbsp" will be converted into space, all other entities will be left as is
public InitMiniEntities ( ) : void
return void

IsBiggerFont() public static method

Checks if first font is bigger than the second
public static IsBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool
oFont1 FontSize Font #1
oFont2 FontSize Font #2
return bool

IsEqualOrBiggerFont() public static method

Checks if first font is equal or bigger than the second
public static IsEqualOrBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool
oFont1 FontSize Font #1
oFont2 FontSize Font #2
return bool

LoadFromFile() public method

Loads HTML from file
public LoadFromFile ( string sFileName ) : void
sFileName string Full filename
return void

ParseFontSize() public static method

Parses font's tag size param
public static ParseFontSize ( string sSize, FontSize oCurSize ) : FontSize
sSize string String value of the size param
oCurSize FontSize
return FontSize

ParseNext() public method

Parses next chunk and returns it with
public ParseNext ( ) : HTMLchunk
return HTMLchunk

ParseNextTag() public method

Returns next tag or null if end of document, text will be ignored completely
public ParseNextTag ( ) : HTMLchunk
return HTMLchunk

Reset() public method

Resets current parsed data to start
public Reset ( ) : void
return void

SetChunkHashMode() public method

Sets chunk param hash mode
public SetChunkHashMode ( bool bHashMode ) : void
bHashMode bool If true then tag's params will be kept in Chunk's hashtable (slower), otherwise kept in arrays (sParams/sValues)
return void

SetEncoding() public method

Sets current encoding in format used in HTTP headers and HTML META tags
public SetEncoding ( string p_sCharSet ) : bool
p_sCharSet string
return bool

SetEncoding() public method

Sets encoding
public SetEncoding ( Encoding p_oEnc ) : void
p_oEnc System.Text.Encoding Encoding object
return void

SetRawHTML() public method

Sets oHTML variable in a chunk to the raw HTML that was parsed for that chunk.
public SetRawHTML ( HTMLchunk oChunk ) : void
oChunk HTMLchunk Chunk returned by ParseNext function, it must belong to the same HTMLparser that /// was initiated with the same HTML data that this chunk belongs to
return void

Property Details

bAutoExtractBetweenTagsOnly public_oe property

If true (and either bAutoKeepComments or bAutoKeepScripts is true), then oHTML will be set to data BETWEEN tags excluding those tags themselves, as otherwise FULL HTML will be set, ie: '' but if this is set to true then only ' comments ' will be returned
public bool bAutoExtractBetweenTagsOnly
return bool

bAutoKeepComments public_oe property

If true (default) then HTML for comments tags themselves AND between them will be set to oHTML variable, otherwise it will be empty but you can always set it later
public bool bAutoKeepComments
return bool

bAutoKeepScripts public_oe property

If true (default: false) then HTML for script tags themselves AND between them will be set to oHTML variable, otherwise it will be empty but you can always set it later
public bool bAutoKeepScripts
return bool

bAutoMarkClosedTagsWithParamsAsOpen public_oe property

Long winded name... by default if tag is closed BUT it has got parameters then we will consider it open tag, this is not right for proper XML parsing
public bool bAutoMarkClosedTagsWithParamsAsOpen
return bool

bCompressWhiteSpaceBeforeTag public_oe property

If true (default), then all whitespace before TAG starts will be compressed to single space char (32 or 0x20) this makes parser run a bit faster, if you need exact whitespace before tags then change this flag to FALSE
public bool bCompressWhiteSpaceBeforeTag
return bool

bKeepRawHTML public_oe property

If true (default: false) then parsed tag chunks will contain raw HTML, otherwise only comments will have it set

Performance hint: keep it as false, you can always get to original HTML as each chunk contains offset from which parsing started and finished, thus allowing to set exact HTML that was parsed

public bool bKeepRawHTML
return bool

bThrowExceptionOnEncodingSetFailure public_oe property

If true then exception will be thrown in case of inability to set encoding taken from HTML - this is possible if encoding was incorrect or not supported, this would lead to abort in processing. Default behavior is to use Default encoding that should keep symbols as is - most likely garbage looking things if encoding was not supported.
public bool bThrowExceptionOnEncodingSetFailure
return bool

oEnc public_oe property

Encoding used to convert binary data into string
public Encoding,System.Text oEnc
return System.Text.Encoding

oHE public_oe property

Heuristics engine used by Tag Parser to quickly match known tags and attribute names, can be disabled or you can add more tags to it to fit your most likely cases, it is currently tuned for HTML
public HTMLheuristics,Majestic12 oHE
return HTMLheuristics