Actions
  • shareshare
  • link
  • cite
  • add
add
auto_awesome_motion View all 4 versions
Publication . Conference object . Article . Preprint . 2020

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Wilhelmina Nekoto; Vukosi Marivate; Tshinondiwa Matsila; Timi E. Fasubaa; Tajudeen Kolawole; Taiwo Fagbohungbe; Solomon Oluwole Akinola; +41 Authors
Open Access
Published: 05 Oct 2020
Publisher: Association for Computational Linguistics
Country: United Kingdom
Abstract

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.

Comment: Findings of EMNLP 2020; updated benchmarks

Subjects by Vocabulary

Microsoft Academic Graph classification: Focus (linguistics) Data science Task (project management) Participatory action research Languages of Africa Machine translation computer.software_genre computer Process (engineering) Computer science

Subjects

Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning

74 references, page 1 of 8

Lishan Adam. 1997. Content and the web for african development. Journal of information science, 23(1):91-97. [OpenAIRE]

Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. 2017. Crosslingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 937-947, Valencia, Spain. Association for Computational Linguistics.

Zˇ eljko Agic´ and Ivan Vulic´. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204-3210, Florence, Italy. Association for Computational Linguistics. [OpenAIRE]

Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874-3884, Minneapolis, Minnesota. Association for Computational Linguistics.

Neville Alexander. 2009. Evolving african approaches to the management of linguistic diversity: The acalan project. Language Matters, 40(2):117-132.

Vamshi Ambati and Stephan Vogel. 2010. Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 62-65, Los Angeles. Association for Computational Linguistics.

Vamshi Ambati, Stephan Vogel, and Jaime Carbonell. 2010. Active learning-based elicitation for semi-supervised word alignment. In Proceedings of the ACL 2010 Conference Short Papers, pages 365-370, Uppsala, Sweden. Association for Computational Linguistics.

Antonios Anastasopoulos and Graham Neubig. 2019. Should all cross-lingual embeddings speak english?

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George F. Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019. Massively multilingual neural machine translation in the wild: Findings and challenges. CoRR, abs/1907.05019. [OpenAIRE]

Amittai Axelrod, Diyi Yang, Rossana Cunha, Samira Shaikh, and Zeerak Waseem, editors. 2019. Proceedings of the 2019 Workshop on Widening NLP. Association for Computational Linguistics, Florence, Italy.

Related to Research communities
Digital Humanities and Cultural Heritage
Download fromView all 5 sources
lock_open
https://www.aclweb.org/antholo...
Conference object
License: cc-by
Providers: UnpayWall
moresidebar