October 21, 2014

Malayalam opentype specification – part 1

Post by Rajeesh Nambiar. Crossposted from his blog

This post is a promised followup from last November documenting intricacies of opentype specification for Indic languages, specifically for Malayalam. There is an initiative to document similar details in the IndicFontbook, this series might make its way into it. A Malayalam unicode font supporting traditional orthography is required to correctly display most of the examples described in this article, some can be obtained from here.

Malayalam has a complex script, which in general means the shape and position of glyphs are determined in relation with other surrounding glyphs, for example a single glyph can be formed out of a combination of independent glyphs in a specific sequence forming a conjunct. Take an example: ക + ്‌ ‌+ ത + ്‌ + ര => ക്ത്ര in traditional orthography. Note that in almost all the cases glyph shaping and positioning change such as this example is due to the involvement of Virama diacritic ” ്‌ “. The important rules on glyph forming are:

  1. When Virama is used to combine two Consonants, it usually forms a Conjunct, such as ക + ്‌ ‌+ ത => ക്ത. This is known as C₁ conjoining as a half form of first consonant is joined with second consonant.
  2. The notable exceptions to point 1 are when the followed Consonants are either of യ, ര, ല, വ. In those cases, they form the ‘Mark’ shapes of യ, ര, ല, വ =>  ്യ, ്ര,  ്ല,  ്വ. This is known as C₂ conjoining as a modified form of second consonant is attached to the first consonant.
  3. When Virama is used to combine a Consonant with Vowel, the Vowel forms a Vowel Mark => such as ാ, ി, ീ.

Opentype organizes these glyph forming and shaping logic by a sequence of ‘Lookup tables (or rules)’ to be defined in the font. The first part gives an overview of the relevant lookup rules used for glyph processing by shaping engine such as Harfbuzz or Uniscribe.

Only those opentype features applicable for Malayalam are discussed. The features (or lookups) are applied in the following order:

  1. akhn (Akhand – used for conjuncts like ക്ക, ക്ഷ, ല്ക്ക, യ്യ, വ്വ, ല്ല etc)
  2. pref (Pre-base form – used for pre base form of Ra –  ്‌ + ര =   ്ര)
  3. blwf (Below base form – used for below base form of La – virama+La – ്‌ + ല =  ്ല)
  4. half (Half form – Not used in mlm2 spec by Rachana and Meera, but used in mlym spec and might be useful later. For now, ignore)
  5. pstf (Post base form – used for post base forms of Ya and Va – ്‌ +യ =  ്യ, ്‌ + വ = ്വ. Note that  യ്യ & വ്വ are under akhn rule)
  6. pres (Pre-base substitution – mostly used for ligatures involving pref Ra – like ക്ര, പ്ര, ക്ത്ര, ഗ്ദ്ധ്ര  etc)
  7. blws (Below base substitution – used for ligatures involving blwf La – like ക്ല, പ്ല, ത്സ്ല etc. Note that  ല്ല is under akhn rule)
  8. psts (Post base substitution – used for ligatures involving post base Matras – like കു, ക്കൂ, മൃ etc)
  9. abvm (Above base Mark  positioning – used for dot Reph – ൎ)

Last 3 forms (pres, blws, psts) are presentation forms, they have lower priority in the glyph formation. They usually form the large number of secondary glyphs. The final one (abvm) is not a GSUB (glyph substitution lookup) but a GPOS (glyph position lookup) – this is used to position dotreph correctly above the glyphs.

  • akhn: Use this for conjuncts (കൂട്ടക്ഷരങ്ങള്‍) like ക്ക, ട്ട, ണ്ണ, ക്ഷ, യ്യ, വ്വ, ല്ല, മ്പ. This rule has the highest priority, so akhn glyphs won’t be broken by the shaping engine.
  • pref: Used only for pre-base form of Ra ര –  ്ര
  • blwf: Used only for below base form of La ല –  ്ല
  • pstf: Used for the post base forms of Ya, Va യ, വ – ്യ, ്വ
  • pres: One of the presentation forms, mostly used for ligatures/glyphs with pref Ra ര – like ക്ര, പ്ര, ക്ത്ര, ഗ്ദ്ധ്ര etc. This could also used together with the ‘half’ forms in certain situations, but that is for later.
  • blws: Used for ligatures/glyphs with blwf La ല – like ക്ല, പ്ല, ത്സ്ല etc.
  • psts: Used by a large number of ligatures/glyphs due to the post base Matras (ു,ൂ,ൃ etc) – like  കു, ക്കൂ, മൃ etc. Other Matras (ാ,ി,ീ,േ,ൈ,ൈ,ൊ,ോ,ൌ,ൗ) are implicitly handled by the shaping engine based on their Unicode properties (pre-base, post-base etc) as they don’t form a different glyph together with a consonant – there is no need to define lookup rules for those matras in the font.

I will discuss these lookup rules and how they fit in the glyph shaping sequence with detailed examples in next episodes.