Malayalam Named Entity Recognition using morphology analyser
A detailed note by Santhosh Thottingal.
Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research.
The morphological characteristics of Malayalam has been always a challenge to solve this problem. When the named entities appear in an inflected or agglutinated complex word, the first step is to analyse such words and arrive at the root words.
As the Malayalam morphology analyser is progressing well, I attempted to build a first version of Malayalam NER on top of it. Since mlmorph gives the POS tagging and analysis, there is not much to do in NER. We just need to look for tags corresponding to proper nouns and report.
You can try the system at https://morph.smc.org.in/ner
Known Limitations
- The recognition is limited by the current lexicon of mlmorph. To recognize out of lexicon entities, a POS guesser would be needed. But this is a general problem not limited to NER. A morphology analyser should also have a POS guesser. In other words as the mlmorph improves, this system also improves automatically.
- Currently the recognition is at word level. But sometimes, the entities are written in multiple consecutive words. To resolve that we will need to write a wrapper on top of word level detection system.
- The current system is a javascript wrapper on top the mlmorph analyse api. I think NER deserve its own api.
This was originally written by Santhosh Thottingal and published at thottingal.in