Introduction to Unicode and how to type and store in Sinhala using Unicode Fonts

Introduction to Unicode and how to type and store in Sinhala using Unicode Fonts

Harsha Wijayawardhana B.Sc., FBCS

COO & CTO of Theekshana R & D

  1. Introduction

Sinhala and Tamil are categorized under complex scripts at the time when technology was created to render them on computers. Sinhala had taken a long and arduous computerization journey; as an outcome, Sinhala stands on its own shoulder to shoulder with English in the operations of Sinhala Script on computers. Sinhala can be stored with any other script in parallel in a document and a list of Sinhala words is sorted and arranged according Sinhala collation sequence like any other Latin Script and it has become an ordinary day to day occurrence that Sinhala Data is stored in a Database Management Systems (DBMS) like data in English. Spell checkers are already developed and grammar checkers are on the way for Sinhala. Theekshana R & D and Language Technology Research Laboratory (LTRL) of University of Colombo School of Computing (UCSC) have been working on Optical Character Recognition (OCR) and Text to Speech for more than ten years and this research output is about to be released as commercial products.

Sri Lanka created history in 2011 by releasing Non-Latin Country Code Top Level Domain Names (ccTLDs) “.ලංකා” and “.இலங்கை” and they are now active ccTLDs for the last eight years managed by LK Domain Registry. Also it was last year that I co-chaired ICANN Sinhala Generation Panel of ICANN to release rules for Top Level Domains in Sinhala. All these are great achievements for a nation which are not known by ordinary public using Sinhala on Computers; and this was not also possible without Unicode. The objective of this article is to give an introduction to Unicode while paying attention to SLS 1134 3rd revision which provides the standard for input of Sinhala using Wijesekera Keyboard.

  • Unicode Standard or Universal Encoding Standard

Before the introduction of Unicode Standard, numerous encoding standards which were unique to each language or group of languages had been introduced worldwide. American Standard Code for Information Interchange (ASCII) which was also known as ISO 646 group of standards was a 7 bit character encoding which was developed in 1964 and released its first version in 1967 to store Latin Character set. Based upon ASCII several other countries also released similar standards for their countries. India released Indian Standard Code for Information Interchange or ISCII which was an 8 bit code supporting several Indian character sets such as Assamease, Bengali, Devanagri, Gujarati, Marati, Malayalam, Tamil, Telagu etc. Following in the footstep of India, Sri Lanka had also released Sri Lankan Standard Code for Information Interchange or SLASCII in 1990. In SLASCII, lower 128 code points had the Latin character set encoded and the upper code points had the supporting of Sinhala character set.

World has more than fifty written scripts with different characteristics. Most of the European languages are written in Latin script. Russian which belongs to Slavic family of languages is written in a script which is known as Cyrillic script and both of these scripts belong to family of scripts which are called Alphabetic group of scripts.  Hebrew and Arabic which belong to a family of Sematic Languages are written from right to left and these scripts are bidirectional. Arabic and Hebrew are known as Abjad scripts and Arabic had been a challenge to computerize unlike Alphabetic scripts.  Sinhala, Tamil and Hindi are written with scripts that had originated from Brahmi script and are phonetical in nature. These scripts are also known as complex scripts or Abiguda Scripts which have vowel modifiers modifying consonants. Some scripts in this family are very complex like Dzonka which is the official script of Bhutan. Dzonka  which is one of the most complex scripts in the world can stack up to six levels whereas Sinhala has only single level stacking with vowel modifiers occurring together or alone either side of a consonant and/or top and below of it.  

To have in a single page texts in different languages like Sinhala and Tamil had been challenge before the introduction of Unicode or Universal Character Encoding standard. Before the advent of Unicode, texts were stored using fonts which had numerous encoding standards as mentioned in the beginning of this article. When written using a particular font in the early days of Text rendering, the font has to be shipped with the text or URL has to be given to download the font. Without the specific font with which the texts were created, gibberish would be shown.  

This problem to have all scripts encoded was solved in 1990 after a long deliberation of initial Unicode group which consisted of Joe Decker from Xerox, Mark Davis from Apple and Ken Whistler etc. The solution initially had been simple with 16bit character model. In this model, the first byte identifies the language and second byte stores the character. For instance, “0D” gives Sinhala as the Language and 85 indicates “Ayanna” as the character. In this solution, Unicode is not only a font based solution but it was designed as a technology where Unicode supporting fonts both True Type Fonts (TTF) and Open Type Fonts (OTF) require a rendering engine. Several Rendering engines have been released by different operating systems at present where rendering engine of Microsoft Windows is known as Uniscribe, whereas  Linux has two namely Pango and QT.

 Today, Unicode has become more versatile and the standard had been expanded to include character sets of dead languages as well. Chinese, Japanese and Korean (CJK) have been encoded with more than 50,000 characters for each language since these are pictorial scripts. Basic Multi lingual Plane (BMP) or plane 0 which has all the living characters –Sinhala is in the OD segments of BMP- and supplementary levels consists of many characters that are at present not in use. For instance, Supplementary plane has Sinhala Illakkam which is a set of numerals that is not in use in modern Sinhala.

Presently, UTF 8, UTF 16 and UTF 32 dominate the web. In daily basis, languages are added to the Unicode. Sinhala was slow to be encoded initially and had to convince Unicode Consortium at several ISO/IEC Working Group 2 (WG2) meetings to have proper alphabetical sequence for Sinhala encoded in the Unicode. Originally it had been proposed to remove two vowels which are not found in the other languages of the Indic family to remove breaking the Alphabetic sequence and move to the bottom of the Sinhala code chart. Subsequently, the government intervened and proposed that the sequence has to be maintained and Sinhala was finally encoded in 1999. The government of Sri Lanka intervened again  for encoding hitherto forgotten numeral sets in Unicode by 2013 where Information and Communication Technology Agency of Sri Lanka (ICTA) became the  pivotal government agency in the encoding as well as releasing of 2nd and 3rd revisions of SLS 1134.  The emphasis has to be made that on the contrary to popular belief that Sinhala Unicode is only an encoding for Sinhala characters and it is not digital alphabet for Sinhala.

  • Sinhala Keyboard and SLS 1134

 In 1990, when SLASCII was released as the standard for storing of Sinhala characters, the keyboard for Sinhala character input was also decided in the first version of SLS 1134. The committee which was responsible for designing of SLASCII decided to use the same layout of Wijesekera keyboard layout which was standardized for typewriters. The keyboard layout which was released for computer input was known as Wijesekera extended Keyboard for Sinhala input. In the second and third revisions of 1134, Sinhala encoding as well as to how Sinhala characters are stored on digital devices are given with the keyboard layout. Especially, in the third revision, Sinhala numerals had been standardized with input for Lith Illakkam and Sinhala Illakkam had been given. In the Sinhala Keyboard layout, it has been designed to input Sinhala Bandi or conjunct letters, touching letters to write Pali if fonts are supported.