SMC Monthly Newsletter: February 2024

payyans as a C library

payyans, a tool to convert Malayalam/other languages written in ASCII encoding to Unicode has been ported to Go. You can find the source code and development here: https://github.com/smc/payyans-go. A web version of it is available at https://payyans.smc.org.in

payyans was initially released in 2008 by Santhosh Thottingal and has helped in converting many documents written in ASCII Malayalam to Unicode.

Many tools have been built on top of payyans logic, the problem is that all of those tools reimplemented the same logic again and again. The Go port makes a single standard library for payyans which can be linked from any other languages using C bindings. Alen Paul Varghese and Subin Siby is working on this new port.

Malayalam in Google Gemma LLM model

Last month Google released Gemma LLM model. We analysed the Malayalam tokens in the 7B model. There are 197 Malayalam tokens in it. The model has 256000 tokens in total for all languages. Malayalam is 0.076% in it. The full list of Malayalam tokens can be seen here: https://gist.github.com/santhoshtr/ec3959a5fb6dc3c5810552e03093f1f7

Meanwhile Andrej Karpathi published a video tutorial on tokenization: "Let's build the GPT Tokenizer"(video)

In this tweet, he said:

We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Indic language LLMs

There are many experiments on indic language large language models. Here is a collection of them:

Events

Kerala University

Santhosh Thottingal did a talk at Kerala university titled “മലയാളത്തിന്റെ ഡിജിറ്റൽ സൗന്ദര്യം”. Slides are available here: https://santhoshtr.github.io/malayalam-digital-aesthetics/

If the talk video becomes available, we will post it in the SMC Telegram group.

International Mother Language Day

Manoj Karingamadathil talked at KKTM College Kodungallur as part of International Mother Language Day and Inaugration of Malayalappacha (UGC Care listed Academic Journal in Malayalam). News link.

Global Science Festival Kerala

3 presentations from OpenDataKerala Community was presented as part of Citizen Science Congress at Global Science Festival Kerala 2024

Citizen Science Initiative Analyzing Galactic Data on the Zooniverse Platform by ATHUL R T

iNaturalist, citizen biodiversity monitoring platform by Manoj Karingamadathil

Elevating Geographic insights : Harnessing Citizen Science Through OpenStreetMaps by AKHIL KRISHNAN

MBIFL

Talks on Olam dictionary and Grandhapura was part of Mathrubhumi International Festival of Letters. https://www.mbifl.com/schedule.html (February 9). Jisso Jose, Kailash Nadh and Shiju Alex were the speakers.

Workshop on OpenStreetMap at ICFOSS

Jaisen Nedumpala conducted an introductory workshop to OpenStreetMap for government employees at ICFOSS Thiruvananthapuram on January 18.

Google Summer of Code

GSoC has announced the list of participating organizations.

Zendalona

Zendalona, an organization from Kerala providing accessibility solutions for visually impaired is one among the participating organizations of GSoC. Nalin Sathyan and Sathyan Mash are the mentors. Nalin was a GSoC student of SMC in 2014 and a mentor in 2016.

More information: https://zendalona.com/gsoc/

Prav

Instant chat application Prav is participating in GSoC this year under XMPP Foundation: https://wiki.xmpp.org/web/Google_Summer_of_Code_2024#Student_Proposal

Other