C# Class Lucene.Net.Util.UnicodeUtil

Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes("UTF-8") does.

WARNING: This API is a new and experimental and may suddenly change.

Mostra file Open project: paulirwin/lucene.net

Public Properties

Property Type Description
BIG_TERM BytesRef

Public Methods

Method Description
CodePointCount ( BytesRef utf8 ) : int

Returns the number of code points in this UTF8 sequence.

this method assumes valid UTF8 input. this method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).

NewString ( int codePoints, int offset, int count ) : string

Cover JDK 1.5 API. Create a String from an array of codePoints.

ToCharArray ( int codePoints, int offset, int count ) : char[]

Generates char array that represents the provided input code points

ToHexString ( string s ) : string
UTF16toUTF8 ( CharsRef source, int offset, int length, BytesRef result ) : void

Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.

UTF16toUTF8 ( char s, int offset, int length, BytesRef result ) : void

Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.

UTF8toUTF16 ( BytesRef bytesRef, CharsRef chars ) : void

Utility method for #UTF8toUTF16(byte[], int, int, CharsRef)

UTF8toUTF16 ( byte utf8, int offset, int length, CharsRef chars ) : void

Interprets the given byte array as UTF-8 and converts to UTF-16. The CharsRef will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.

NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.

UTF8toUTF32 ( BytesRef utf8, IntsRef utf32 ) : void

this method assumes valid UTF8 input. this method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).

ValidUTF16String ( char s ) : bool
ValidUTF16String ( char s, int size ) : bool

Private Methods

Method Description
UnicodeUtil ( ) : System

Method Details

CodePointCount() public static method

Returns the number of code points in this UTF8 sequence.

this method assumes valid UTF8 input. this method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).

If invalid codepoint header byte occurs or the /// content is prematurely truncated.
public static CodePointCount ( BytesRef utf8 ) : int
utf8 BytesRef
return int

NewString() public static method

Cover JDK 1.5 API. Create a String from an array of codePoints.
If an invalid code point is encountered If the offset or count are out of bounds.
public static NewString ( int codePoints, int offset, int count ) : string
codePoints int The code array
offset int The start of the text in the code point array
count int The number of code points
return string

ToCharArray() public static method

Generates char array that represents the provided input code points
public static ToCharArray ( int codePoints, int offset, int count ) : char[]
codePoints int The code array
offset int The start of the text in the code point array
count int The number of code points
return char[]

ToHexString() public static method

public static ToHexString ( string s ) : string
s string
return string

UTF16toUTF8() public static method

Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.
public static UTF16toUTF8 ( CharsRef source, int offset, int length, BytesRef result ) : void
source CharsRef
offset int
length int
result BytesRef
return void

UTF16toUTF8() public static method

Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.
public static UTF16toUTF8 ( char s, int offset, int length, BytesRef result ) : void
s char
offset int
length int
result BytesRef
return void

UTF8toUTF16() public static method

Utility method for #UTF8toUTF16(byte[], int, int, CharsRef)
public static UTF8toUTF16 ( BytesRef bytesRef, CharsRef chars ) : void
bytesRef BytesRef
chars CharsRef
return void

UTF8toUTF16() public static method

Interprets the given byte array as UTF-8 and converts to UTF-16. The CharsRef will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.

NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.

public static UTF8toUTF16 ( byte utf8, int offset, int length, CharsRef chars ) : void
utf8 byte
offset int
length int
chars CharsRef
return void

UTF8toUTF32() public static method

this method assumes valid UTF8 input. this method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).

If invalid codepoint header byte occurs or the /// content is prematurely truncated.
public static UTF8toUTF32 ( BytesRef utf8, IntsRef utf32 ) : void
utf8 BytesRef
utf32 IntsRef
return void

ValidUTF16String() public static method

public static ValidUTF16String ( char s ) : bool
s char
return bool

ValidUTF16String() public static method

public static ValidUTF16String ( char s, int size ) : bool
s char
size int
return bool

Property Details

BIG_TERM public_oe static_oe property

A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.

WARNING: this is not a valid UTF8 Term

public static BytesRef BIG_TERM
return BytesRef