Amazon Releases MASSIVE Language Dataset

Amazon has announced the release of a 51 language dataset to help developers of Natural Language Understanding (NLU) systems such as Alexa and other chatbots.

The dataset is named MASSIVE, a rather tortuous and recursive acronym for Multilingual Amazon SLURP for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation. The acronym-within-an-acronym SLURP means SLU Resource Package. SLU in turn stands for Spoken Language Understanding.

MASSIVE contains over 19,000 utterances in English which have been human translated into “50 typologically diverse languages from 29 genera, including low-resource languages”. The hope is that the dataset will enable chatbot developers and researchers to better multinationalise their work. Prem Natarajan, vice president of Alexa AI Natural Understanding, says:

We are very excited to share this large multilingual dataset with the worldwide language research community. We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.

To accompany the dataset release, Amazon has also announced a Massively Multilingual NLU (MMNLU-2 ) competition and workshop intended to “help researchers scale natural-language-understanding technology to every language on Earth”.

The full MASSIVE dataset can be downloaded here and tools for its use are available on Github.