C# Class Lucene.Net.Util.UnicodeUtil

Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes("UTF-8") does.

WARNING: This API is a new and experimental and may suddenly change.

Mostra file Open project: paulirwin/lucene.net

Public Properties

Property	Type	Description
BIG_TERM	BytesRef

Public Methods

Method	Description
CodePointCount ( BytesRef utf8 ) : int	Returns the number of code points in this UTF8 sequence. this method assumes valid UTF8 input. this method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).
NewString ( int codePoints, int offset, int count ) : string	Cover JDK 1.5 API. Create a String from an array of codePoints.
ToCharArray ( int codePoints, int offset, int count ) : char[]	Generates char array that represents the provided input code points
ToHexString ( string s ) : string
UTF16toUTF8 ( CharsRef source, int offset, int length, BytesRef result ) : void	Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.
UTF16toUTF8 ( char s, int offset, int length, BytesRef result ) : void	Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.
UTF8toUTF16 ( BytesRef bytesRef, CharsRef chars ) : void	Utility method for #UTF8toUTF16(byte[], int, int, CharsRef)
UTF8toUTF16 ( byte utf8, int offset, int length, CharsRef chars ) : void	Interprets the given byte array as UTF-8 and converts to UTF-16. The CharsRef will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint. NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.
UTF8toUTF32 ( BytesRef utf8, IntsRef utf32 ) : void	this method assumes valid UTF8 input. this method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).
ValidUTF16String ( char s ) : bool
ValidUTF16String ( char s, int size ) : bool

Private Methods

Method	Description
UnicodeUtil ( ) : System

Method Details

CodePointCount() public static method

Returns the number of code points in this UTF8 sequence.

this method assumes valid UTF8 input. this method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).

If invalid codepoint header byte occurs or the /// content is prematurely truncated.

public static CodePointCount ( BytesRef utf8 ) : int
utf8	BytesRef
return	int

NewString() public static method

Cover JDK 1.5 API. Create a String from an array of codePoints.

If an invalid code point is encountered If the offset or count are out of bounds.

public static NewString ( int codePoints, int offset, int count ) : string
codePoints	int	The code array
offset	int	The start of the text in the code point array
count	int	The number of code points
return	string

ToCharArray() public static method

Generates char array that represents the provided input code points

public static ToCharArray ( int codePoints, int offset, int count ) : char[]
codePoints	int	The code array
offset	int	The start of the text in the code point array
count	int	The number of code points
return	char[]

ToHexString() public static method

public static ToHexString ( string s ) : string
s	string
return	string

UTF16toUTF8() public static method

Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.

public static UTF16toUTF8 ( CharsRef source, int offset, int length, BytesRef result ) : void
source	CharsRef
offset	int
length	int
result	BytesRef
return	void

UTF16toUTF8() public static method

Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.

public static UTF16toUTF8 ( char s, int offset, int length, BytesRef result ) : void
s	char
offset	int
length	int
result	BytesRef
return	void

UTF8toUTF16() public static method

Utility method for #UTF8toUTF16(byte[], int, int, CharsRef)

public static UTF8toUTF16 ( BytesRef bytesRef, CharsRef chars ) : void
bytesRef	BytesRef
chars	CharsRef
return	void

UTF8toUTF16() public static method

Interprets the given byte array as UTF-8 and converts to UTF-16. The CharsRef will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.

NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.

public static UTF8toUTF16 ( byte utf8, int offset, int length, CharsRef chars ) : void
utf8	byte
offset	int
length	int
chars	CharsRef
return	void

UTF8toUTF32() public static method

If invalid codepoint header byte occurs or the /// content is prematurely truncated.

public static UTF8toUTF32 ( BytesRef utf8, IntsRef utf32 ) : void
utf8	BytesRef
utf32	IntsRef
return	void

ValidUTF16String() public static method

public static ValidUTF16String ( char s ) : bool
s	char
return	bool

ValidUTF16String() public static method

public static ValidUTF16String ( char s, int size ) : bool
s	char
size	int
return	bool

Property Details

BIG_TERM public_oe static_oe property

A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.

WARNING: this is not a valid UTF8 Term

public static BytesRef BIG_TERM
return	BytesRef