SMC Monthly Report: September 2024
EventsSMC WebsiteSMC new website launched in Bengaluru. Riya Sabu volunteered to initiate the new website,
payyans, a tool to convert Malayalam/other languages written in ASCII encoding to Unicode has been ported to Go. You can find the source code and development here: https://github.com/smc/payyans-go. A web version of it is available at https://payyans.smc.org.in
payyans was initially released in 2008 by Santhosh Thottingal and has helped in converting many documents written in ASCII Malayalam to Unicode.
Many tools have been built on top of payyans logic, the problem is that all of those tools reimplemented the same logic again and again. The Go port makes a single standard library for payyans which can be linked from any other languages using C bindings. Alen Paul Varghese and Subin Siby is working on this new port.
Last month Google released Gemma LLM model. We analysed the Malayalam tokens in the 7B model. There are 197 Malayalam tokens in it. The model has 256000 tokens in total for all languages. Malayalam is 0.076% in it. The full list of Malayalam tokens can be seen here: https://gist.github.com/santhoshtr/ec3959a5fb6dc3c5810552e03093f1f7
Meanwhile Andrej Karpathi published a video tutorial on tokenization: "Let's build the GPT Tokenizer"(video)
In this tweet, he said:
We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
There are many experiments on indic language large language models. Here is a collection of them:
Santhosh Thottingal did a talk at Kerala university titled “മലയാളത്തിന്റെ ഡിജിറ്റൽ സൗന്ദര്യം”. Slides are available here: https://santhoshtr.github.io/malayalam-digital-aesthetics/
If the talk video becomes available, we will post it in the SMC Telegram group.
Manoj Karingamadathil talked at KKTM College Kodungallur as part of International Mother Language Day and Inaugration of Malayalappacha (UGC Care listed Academic Journal in Malayalam). News link.
3 presentations from OpenDataKerala Community was presented as part of Citizen Science Congress at Global Science Festival Kerala 2024
Citizen Science Initiative Analyzing Galactic Data on the Zooniverse Platform by ATHUL R T
iNaturalist, citizen biodiversity monitoring platform by Manoj Karingamadathil
Elevating Geographic insights : Harnessing Citizen Science Through OpenStreetMaps by AKHIL KRISHNAN
Talks on Olam dictionary and Grandhapura was part of Mathrubhumi International Festival of Letters. https://www.mbifl.com/schedule.html (February 9). Jisso Jose, Kailash Nadh and Shiju Alex were the speakers.
Jaisen Nedumpala conducted an introductory workshop to OpenStreetMap for government employees at ICFOSS Thiruvananthapuram on January 18.
GSoC has announced the list of participating organizations.
Zendalona, an organization from Kerala providing accessibility solutions for visually impaired is one among the participating organizations of GSoC. Nalin Sathyan and Sathyan Mash are the mentors. Nalin was a GSoC student of SMC in 2014 and a mentor in 2016.
More information: https://zendalona.com/gsoc/
Instant chat application Prav is participating in GSoC this year under XMPP Foundation: https://wiki.xmpp.org/web/Google_Summer_of_Code_2024#Student_Proposal