SMC Monthly Newsletter: May 2024
Tributes to Dr. V. Sasikumar
Swathanthra Malayalam Computing mourns the demise of Dr. V. Sasikumar, a free software activist and science writer.
Blog post: https://blog.smc.org.in/obitury-dr-sasikumar/
Dr Sasikumar was a pioneer in the free software movement in Kerala. His friends fondly called him 'Kerala Stallman'. He was a scientist at the National Center for Earth Science Studies. His research topics were related to the lightning and sea coast of Kerala (January 1979 - December 2007). After resigning from the job in 2007, he fully immersed himself in the mission of spreading free software. He was the first director of the Free Software Foundation of India. He has written many books on the earth, including ഭൂമിയുടെ ആവരണം, ആകാശത്തിലെ അത്ഭുതക്കാഴ്ചകൾ, മിന്നലും ഇടിയും. He shared his thoughts and ideas through the blogs Free as in Freedom, Sചിതറിയ ചിന്തകൾ, കേരളചിന്തകള്. In his 2007 article, The Story of Free Software in Kerala, India, he describes the early days of free software in India and Kerala in particular.
Dr Sasikumar was always a mentor and friend of Swathanthra Malayalam Computing.
Text Normalization and Indic languages
Kavya Manohar writes about how the normalization routine in the popular ASR engine Whisper, removes essential characters like vowel signs in Indian languages while evaluating the performance.
Text Normalization in natural language processing (NLP) refers to the conversion of different written forms of text to one standardised form. The definition of the standard form depends largely on the problem at hand.
She also explains the implications.
During the Whisper fine-tuning event hosted by Hugging Face in December 2022, researchers and practitioners worldwide collaborated to enhance the basic Whisper model using target language datasets, aiming to elevate the speech recognition system for all languages.
The winning models in various Indian languages showcased an unbelievably low word error rate (WER), with figures like 8% for Tamil, 11.49% for Malayalam, 10.05% for Hindi, and 11.11% for Bengali. The leaderboard’s promising results initially led the speech research community to believe we have achieved the peak of possibilities.
Organization Update
Despite being a registered organization under the Travancore-Cochin Literary, Scientific, and Charitable Societies Registration Act since 2010, our offline activities have not maintained their previous momentum. To reignite engagement, we are conducting an internal audit of our financial spending and ensure compliance. This will help us restart activities associated with our official organization.
Type Design Competition by Rachana Institute of Typography
Rachana Institute of Typography, in association with KaChaTaThaPa Foundation and Sayahna Foundation, is launching a Malayalam font design competition for students, professionals, and amateur designers.
Timelines, regulations, prizes and more details are available at the below URLs
English: https://sayahna.net/fcomp-en
Malayalam: https://sayahna.net/fcomp-ml
Large Language Models
OpenAI released gpt4-o model and comes with larger vocabulary size. Their tokenizer- tiktoken can be used to analyse how Malayalam text is tokenized. Santhosh Thottingal prepared a python notebook that calculates the tokens in various models of OpenAI.
Following is the Malayalam tokenization for GPT4-o mod
Example string: "പൂച്ചയ്ക്കാരു മണികെട്ടും?"
o200k_base: 11 tokens
token integers: [6263, 7929, 19997, 61818, 9899, 1591, 75509, 5925, 33267, 5990, 30]
tokens: ['പ', 'ൂ', 'ച്ച', 'യ്ക്ക', 'ാര', 'ു', ' മണ', 'ിക', 'െട്ട', 'ും', '?']
Compared to previous models of OpenAI, number of tokens reduced, however, the tokens does not make any sense from the distributional semantics perspective. However, it is unknown whether large scale data helps figure to figure out words and relations from these tokens.