March 2, 2019

Malayalam Corpus

On this Opendata day, we are starting a project to build a free licensed corpus of Malayalam content to facilitate various needs of Malayalam computing related research.

The corpus, available at https://gitlab.com/smc/corpus/ is licensed under Creative Commons Attribution-ShareAlike is avilable for anybody to use and improve.

You can help us to enhance this collection of content in different ways:

  1. If you know or have a text collection with compatible license(CC by SA), we can add that to this collection. Just create an issue and let us know about it. We will help. We are looking for content in diverse topics.
  2. We are also collecting person names, place names etc in Malayalam. You can see the existing words by just browsing to the words folder. If you like to expand that collection, create an issue with details or create a merge request.

Make sure to respect the copyright of the content. We are trying to provide a corpus of free licensed content.

We plan to add some scripts and tools to work with the content as well. Currently only text data is present, but in future we might extend this to handwriting images, voice samples too.