March 11, 2019

Malayalam Named Entity Recognition using morphology analyser

A detailed note by Santhosh Thottingal.

Named Entity Recognition,  a task of identifying and classifying real world objects such as  persons, places, organizations from a given text is a well known NLP  problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research.    

The morphological characteristics of Malayalam has been always a  challenge to solve this problem. When the named entities appear in an  inflected or  agglutinated complex word, the first step is to analyse  such words and arrive at the root words.    

As the Malayalam morphology analyser is progressing well,  I attempted to build a first version of Malayalam  NER on top of it. Since mlmorph gives the POS tagging and analysis,  there is not much to do in NER. We just need to look for tags  corresponding to proper nouns and report.  

You can try the system at https://morph.smc.org.in/ner

Malayalam named entity recognition example using https://morph.smc.org.in/ner

Known Limitations

  • The recognition is limited by the current lexicon of mlmorph. To  recognize out of lexicon entities, a POS guesser would be needed. But  this is a general problem not limited to NER. A morphology analyser  should also have a POS guesser. In other words as the mlmorph improves,  this system also improves automatically.
  • Currently the  recognition is at word level. But sometimes, the entities are written in  multiple consecutive words. To resolve that we will need to write a  wrapper on top of word level detection system.
  • The current system is a javascript wrapper on top the mlmorph analyse api. I think NER deserve its own api.
This was originally written by Santhosh Thottingal and published at thottingal.in