Search

The proposal to encode Sinhala Rakaaraanshaya and Yanshaya in Unicode

The proposal to encode Sinhala Rakaaraanshaya and Yanshaya in Unicode

Harsha Wijayawardhana B.S. in Biochemistry (Miami), CITP (UK), FBCS (UK)

In collaboration with LK Domain Registry

1. Introduction

The Sinhala script is primarily employed for writing the Sinhala language. Additionally, it serves as a script for the Pali language (the ecclesiastical language used in Theravada Buddhist Scriptures) and the Sanskrit language in Sri Lanka. One very distinctive feature of the Sinhala writing system is the use of numerous consonant conjunctions. Among these, a select few are noteworthy, as they are derived for consonant modifiers and can be regarded as allographs belonging to a single phoneme in a grapho-phonemic study of Sinhala.

Accordingly, when the consonant /r/ [ර] is preceded by any other consonant with a Halant or Hal kirima, it is typically represented with Rakaaraanshaya, except for a few specific cases, such as certain consonant clusters spanning two different words. The statistics from the 10-million-word UCSC Sinhala corpus reveal that Rakaaraanshaya occurs in 503k words, while the same consonant cluster without Rakaaraanshaya appears in the corpus only 8k times. Furthermore, when the particular consonant with a Halant is followed by another consonant, it can be represented either with or without Rephaya interchangeably and occurs in such consonant clusters with Repahaya around 500 times whereas without Rephaya 328k times. Thus, Rakaaraanshay and Rephaya are two allographs of the consonant Rakaraya [ර] in Sinhala. Rakaaraanshaya exhibits complementary distribution, while Rephaya shows free variation, as both forms with Rephaya and without Rephaya can be used interchangeably. Similar to the occurrence of Rakaaraanshaya, Yanshaya is also an allograph of the phoneme /y/ [ය] with complementary distribution. According to the corpus statistics, the particular consonant cluster with Yanshaya occurs 251k times, whereas it appears without Yanshaya in 45k words. Therefore, it is justified to retain Rakaaraanshaya and Yanshaya in the Sinhala writing system, as they serve as allographs with complementary distribution for specific phonemes. Their representation on computer screens should also be supported, as non-conjunct forms are not considered the correct writing style.

Rakaaraanshaya

ක්‍ර(Kra) = ක් ( K – Pure Consonant or without the inherent vowel) + ර (Ra)

Yanshaya

ක්‍ය (Kya) = ක් ( K – Pure Consonant or without the inherent vowel) + ය (Ya)

The above two forms were represented by the combinations of the following code points using Unicode Code points with the Zero Width Joiner (ZWJ).

SINHALA MODIFIER RAKAARAANSAYA; 0DCA 200D 0DBB

SINHALA MODIFIER YANSAYA; 0DCA 200D 0DBA

2. Issues about Rakaraanshaya and Yansha forms

In the mid-2000s, or the early days of Sinhala Unicode Implementation, most applications online and offline used to strip off ZWJ rendering words with Rakaaraanshaya and Yanshaya in the following manner:

Correct FormWrong Form 
ශ්‍රී ලංකාශ්රී ලංකාSri Lanka, the name of the country.
සත්‍යසත්යTruth in Sinhala

Although most applications online and offline have taken necessary steps not to remove or strip off ZWJ after ICTA complained to the WG2 of ISO in 2012, the problem persists in applications such as the Desktop version of Facebook. On rare occasions, ZWJ either gets stripped off or not recognized when switching fonts.

However, the problem became aggravated with the advent of Internationalized Domain Names (IDNs). In IDNA 2003, words with ZWJ and ZWNJ were barred, which in turn disallowed Sri (ශ්‍රී ) from being used in the root level or second-level labels. Although IDNA 2008 relaxed the above, allowing ZWJ and ZWNJ to be used, the current LGR ruleset for Sinhala does not allow ZWJ and ZWNJ to be used in Generic Top Level Domains (gTLDs). ZWJ is permitted in the Second-Level Labels. Almost all browsers do not render Rakaaraashaya and Yanshaya forms correctly in the address bar, sticking still to IDNA 2003.

3. Encoding of Rakaaraanshaya and Yanshaya in Unicode

Due to the issues mentioned above, Sri Lanka has begun to moot to encode the above two Rakaaraanshaya and Yanshaya in Unicode with backward compatibility to the text already existing on the Internet. NFD and NFC forms will coexist in font rules when the above two are encoded. Myanmar encoded four dependent consonant signs and Malayalam recently encoded Chillu letter signs.

The following is the plan for encoding and the implementation:

  1. Sinhala Yanshaya and Rakaaraanshaya will be encoded in the Basic Multilingual Plane (BMP).
  2. The above two forms will occupy 0DE1 to ODE3.
  3. Yansaya will have the Code Point 0DE1, and 0DE0 will be reserved.
  4. 0DE2 will be reserved for Rephaya if Sri Lanka decides to encode it in the future.
  5. 0DE3 will be allocated for the Rakaaraanshaya.
  6. After the encoding, when pressed, two keys represent Yanshaya, and Rakaaraanshaya currently will store new code points digitally (NFC Form) instead of long code point strings with the ZWJ (NFD Form).
  7. When pressed, the Conjunct Key and keys for the above two forms, respectively will store the older code points with the ZWJ.
  8. All fonts must carry both rules for visual rendering after encoding two forms and older rules with ZWJ for backward compatibility.

As mentioned earlier, other non-conjuncts forms of Consonant Conjuncts, including Rephaya, are considered acceptable forms in the Sinhala Writing System. Even if the ZWJ stripped off, they would not appear awkward.