C# Class Majestic12.HTMLparser

Allows to parse HTML by splitting it into small token (HTMLchunks) such as tags, text, comments etc. Do NOT create multiple instances of this class - REUSE single instance Do NOT call same instance from multiple threads - it is NOT thread safe
Inheritance: IDisposable
Afficher le fichier Open project: arktronic/sevenauth Class Usage Examples

Méthodes publiques

Свойство Type Description
bAutoExtractBetweenTagsOnly bool
bAutoKeepComments bool
bAutoKeepScripts bool
bAutoMarkClosedTagsWithParamsAsOpen bool
bCompressWhiteSpaceBeforeTag bool
bKeepRawHTML bool
bThrowExceptionOnEncodingSetFailure bool
oEnc System.Text.Encoding
oHE HTMLheuristics

Méthodes publiques

Méthode Description
CalculateWidth ( string sWidth, int iAvailWidth, bool &bRelative ) : int

Parses WIDTH param and calculates width

ChangeToEntities ( string sLine ) : string
ChangeToEntities ( string sLine, bool bChangeDangerousCharsOnly ) : string

Parses line and changes known entiry characters into proper HTML entiries

CleanUp ( ) : void

Cleans up parser in preparation for next parsing

Close ( ) : void

Closes object and releases all allocated resources

DecodeEntities ( string sData ) : string

This function will decode any entities found in a string - not fast!

Dispose ( ) : void
HTMLparser ( string p_oHTML ) : System

Constructs parser object using provided HTML as source for parsing

HandleMetaEncoding ( HTMLparser oP, HTMLchunk oChunk, bool &bEncodingSet ) : bool

Handles META tags that set page encoding

Init ( byte p_bHTML ) : void

Initialises parses with HTML to be parsed from provided data buffer: this is best in terms of correctness of parsing of various encodings that can be used in HTML

Init ( byte p_bHTML, int p_iHtmlLength ) : void

Inits parsing

Init ( string p_oHTML ) : void

Initialises parses with HTML to be parsed from provided string

InitMiniEntities ( ) : void

Inits mini-entities mode: only "nbsp" will be converted into space, all other entities will be left as is

IsBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool

Checks if first font is bigger than the second

IsEqualOrBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool

Checks if first font is equal or bigger than the second

LoadFromFile ( string sFileName ) : void

Loads HTML from file

ParseFontSize ( string sSize, FontSize oCurSize ) : FontSize

Parses font's tag size param

ParseNext ( ) : HTMLchunk

Parses next chunk and returns it with

ParseNextTag ( ) : HTMLchunk

Returns next tag or null if end of document, text will be ignored completely

Reset ( ) : void

Resets current parsed data to start

SetChunkHashMode ( bool bHashMode ) : void

Sets chunk param hash mode

SetEncoding ( string p_sCharSet ) : bool

Sets current encoding in format used in HTTP headers and HTML META tags

SetEncoding ( Encoding p_oEnc ) : void

Sets encoding

SetRawHTML ( HTMLchunk oChunk ) : void

Sets oHTML variable in a chunk to the raw HTML that was parsed for that chunk.

Private Methods

Méthode Description
Dispose ( bool bDisposing ) : void
GetCharSet ( string sData ) : string

Retrieves charset information from format used in HTTP headers and META descriptions

GetNextTag ( ) : HTMLchunk

Internally parses tag and returns it from point when '<' was found

HTMLparser ( ) : System
ParseTextWithEntities ( ) : HTMLchunk

Method Details

CalculateWidth() public static méthode

Parses WIDTH param and calculates width
public static CalculateWidth ( string sWidth, int iAvailWidth, bool &bRelative ) : int
sWidth string WIDTH param from tag
iAvailWidth int Currently available width for relative calculations, if negative width will be returned as is
bRelative bool Flag that will be set to true if width was relative
Résultat int

ChangeToEntities() public méthode

public ChangeToEntities ( string sLine ) : string
sLine string
Résultat string

ChangeToEntities() public méthode

Parses line and changes known entiry characters into proper HTML entiries
public ChangeToEntities ( string sLine, bool bChangeDangerousCharsOnly ) : string
sLine string Line of text
bChangeDangerousCharsOnly bool
Résultat string

CleanUp() public méthode

Cleans up parser in preparation for next parsing
public CleanUp ( ) : void
Résultat void

Close() public méthode

Closes object and releases all allocated resources
public Close ( ) : void
Résultat void

DecodeEntities() public static méthode

This function will decode any entities found in a string - not fast!
public static DecodeEntities ( string sData ) : string
sData string
Résultat string

Dispose() public méthode

public Dispose ( ) : void
Résultat void

HTMLparser() public méthode

Constructs parser object using provided HTML as source for parsing
public HTMLparser ( string p_oHTML ) : System
p_oHTML string
Résultat System

HandleMetaEncoding() public static méthode

Handles META tags that set page encoding
public static HandleMetaEncoding ( HTMLparser oP, HTMLchunk oChunk, bool &bEncodingSet ) : bool
oP HTMLparser HTML parser object that is used for parsing
oChunk HTMLchunk Parsed chunk that should contain tag META
bEncodingSet bool Your own flag that shows whether encoding was already set or not, if set /// once then it should not be changed - this is the logic applied by major browsers
Résultat bool

Init() public méthode

Initialises parses with HTML to be parsed from provided data buffer: this is best in terms of correctness of parsing of various encodings that can be used in HTML
public Init ( byte p_bHTML ) : void
p_bHTML byte Data buffer with HTML in it
Résultat void

Init() public méthode

Inits parsing
public Init ( byte p_bHTML, int p_iHtmlLength ) : void
p_bHTML byte Data buffer
p_iHtmlLength int Length of data (buffer itself can be longer) - start offset assumed to be 0
Résultat void

Init() public méthode

Initialises parses with HTML to be parsed from provided string
public Init ( string p_oHTML ) : void
p_oHTML string String with HTML in it
Résultat void

InitMiniEntities() public méthode

Inits mini-entities mode: only "nbsp" will be converted into space, all other entities will be left as is
public InitMiniEntities ( ) : void
Résultat void

IsBiggerFont() public static méthode

Checks if first font is bigger than the second
public static IsBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool
oFont1 FontSize Font #1
oFont2 FontSize Font #2
Résultat bool

IsEqualOrBiggerFont() public static méthode

Checks if first font is equal or bigger than the second
public static IsEqualOrBiggerFont ( FontSize oFont1, FontSize oFont2 ) : bool
oFont1 FontSize Font #1
oFont2 FontSize Font #2
Résultat bool

LoadFromFile() public méthode

Loads HTML from file
public LoadFromFile ( string sFileName ) : void
sFileName string Full filename
Résultat void

ParseFontSize() public static méthode

Parses font's tag size param
public static ParseFontSize ( string sSize, FontSize oCurSize ) : FontSize
sSize string String value of the size param
oCurSize FontSize
Résultat FontSize

ParseNext() public méthode

Parses next chunk and returns it with
public ParseNext ( ) : HTMLchunk
Résultat HTMLchunk

ParseNextTag() public méthode

Returns next tag or null if end of document, text will be ignored completely
public ParseNextTag ( ) : HTMLchunk
Résultat HTMLchunk

Reset() public méthode

Resets current parsed data to start
public Reset ( ) : void
Résultat void

SetChunkHashMode() public méthode

Sets chunk param hash mode
public SetChunkHashMode ( bool bHashMode ) : void
bHashMode bool If true then tag's params will be kept in Chunk's hashtable (slower), otherwise kept in arrays (sParams/sValues)
Résultat void

SetEncoding() public méthode

Sets current encoding in format used in HTTP headers and HTML META tags
public SetEncoding ( string p_sCharSet ) : bool
p_sCharSet string
Résultat bool

SetEncoding() public méthode

Sets encoding
public SetEncoding ( Encoding p_oEnc ) : void
p_oEnc System.Text.Encoding Encoding object
Résultat void

SetRawHTML() public méthode

Sets oHTML variable in a chunk to the raw HTML that was parsed for that chunk.
public SetRawHTML ( HTMLchunk oChunk ) : void
oChunk HTMLchunk Chunk returned by ParseNext function, it must belong to the same HTMLparser that /// was initiated with the same HTML data that this chunk belongs to
Résultat void

Property Details

bAutoExtractBetweenTagsOnly public_oe property

If true (and either bAutoKeepComments or bAutoKeepScripts is true), then oHTML will be set to data BETWEEN tags excluding those tags themselves, as otherwise FULL HTML will be set, ie: '' but if this is set to true then only ' comments ' will be returned
public bool bAutoExtractBetweenTagsOnly
Résultat bool

bAutoKeepComments public_oe property

If true (default) then HTML for comments tags themselves AND between them will be set to oHTML variable, otherwise it will be empty but you can always set it later
public bool bAutoKeepComments
Résultat bool

bAutoKeepScripts public_oe property

If true (default: false) then HTML for script tags themselves AND between them will be set to oHTML variable, otherwise it will be empty but you can always set it later
public bool bAutoKeepScripts
Résultat bool

bAutoMarkClosedTagsWithParamsAsOpen public_oe property

Long winded name... by default if tag is closed BUT it has got parameters then we will consider it open tag, this is not right for proper XML parsing
public bool bAutoMarkClosedTagsWithParamsAsOpen
Résultat bool

bCompressWhiteSpaceBeforeTag public_oe property

If true (default), then all whitespace before TAG starts will be compressed to single space char (32 or 0x20) this makes parser run a bit faster, if you need exact whitespace before tags then change this flag to FALSE
public bool bCompressWhiteSpaceBeforeTag
Résultat bool

bKeepRawHTML public_oe property

If true (default: false) then parsed tag chunks will contain raw HTML, otherwise only comments will have it set

Performance hint: keep it as false, you can always get to original HTML as each chunk contains offset from which parsing started and finished, thus allowing to set exact HTML that was parsed

public bool bKeepRawHTML
Résultat bool

bThrowExceptionOnEncodingSetFailure public_oe property

If true then exception will be thrown in case of inability to set encoding taken from HTML - this is possible if encoding was incorrect or not supported, this would lead to abort in processing. Default behavior is to use Default encoding that should keep symbols as is - most likely garbage looking things if encoding was not supported.
public bool bThrowExceptionOnEncodingSetFailure
Résultat bool

oEnc public_oe property

Encoding used to convert binary data into string
public Encoding,System.Text oEnc
Résultat System.Text.Encoding

oHE public_oe property

Heuristics engine used by Tag Parser to quickly match known tags and attribute names, can be disabled or you can add more tags to it to fit your most likely cases, it is currently tuned for HTML
public HTMLheuristics,Majestic12 oHE
Résultat HTMLheuristics