C# Class Lucene.Net.Analysis.Ckb.SoraniNormalizer

Normalizes the Unicode representation of Sorani text.

Normalization consists of:

  • Alternate forms of 'y' (0064, 0649) are converted to 06CC (FARSI YEH)
  • Alternate form of 'k' (0643) is converted to 06A9 (KEHEH)
  • Alternate forms of vowel 'e' (0647+200C, word-final 0647, 0629) are converted to 06D5 (AE)
  • Alternate (joining) form of 'h' (06BE) is converted to 0647
  • Alternate forms of 'rr' (0692, word-initial 0631) are converted to 0695 (REH WITH SMALL V BELOW)
  • Harakat, tatweel, and formatting characters such as directional controls are removed.

Exibir arquivo Open project: apache/lucenenet

Public Methods

Method Description
normalize ( char s, int len ) : int

Normalize an input buffer of Sorani text

Method Details

normalize() public method

Normalize an input buffer of Sorani text
public normalize ( char s, int len ) : int
s char input buffer
len int length of input buffer
return int