Actions
  • shareshare
  • link
  • cite
  • add
Powered by OpenAIRE graph
Found an issue? Give us feedback
add
auto_awesome_motion View all 4 versions
Publication . Conference object . Preprint . Article . Part of book or chapter of book . 2020 . Embargo end date: 01 Jan 2020

FinEst BERT and CroSloEngual BERT: less is more in multilingual models

Less Is More in Multilingual Models
Matej Ulčar; Marko Robnik-Šikonja;
Open Access
Abstract

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations

Comment: 10 pages, accepted at TSD 2020 conference

Subjects by Vocabulary

Microsoft Academic Graph classification: English language Artificial intelligence business.industry business Language model Estonian language.human_language language Natural language processing computer.software_genre computer Downstream (software development) Computer science Dependency grammar

Subjects

Computation and Language (cs.CL), FOS: Computer and information sciences, Computer Science - Computation and Language, Contextual embeddings, BERT model, Less-resourced languages, NLP

Related Organizations
20 references, page 1 of 2

[1] Zˇeljko Agi´c and Nikola Ljubeˇsi´c. Universal dependencies for Croatian (that work for Serbian, too). In The 5th Workshop on Balto-Slavic Natural Language Processing, pages 1-8, 2015.

[2] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzma´n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.

[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[4] Kaja Dobrovoljc, Tomaˇz Erjavec, and Simon Krek. The universal dependencies treebank for Slovenian. In Proceeding of the 6th Workshop on BaltoSlavic Natural Language Processing (BSNLP 2017), 2017.

[5] K. Haverinen, J. Nyblom, T. Viljanen, V. Laippala, S. Kohonen, A. Missila¨, S. Ojala, T. Salakoski, and F. Ginter. Building the essential resources for Finnish: the Turku dependency treebank. LREC, 2013.

[6] Dan Kondratyuk and Milan Straka. 75 languages, 1 model: Parsing universal dependencies universally. In Proceedings of the 2019 EMNLP-IJCNLP, pages 2779-2795, 2019. [OpenAIRE]

[7] Simon Krek, Kaja Dobrovoljc, Tomaˇz Erjavec, Sara Moˇze, Nina Ledinek, Nanika Holz, Katja Zupan, Polona Gantar, Taja Kuzman, Jaka Cˇibej, Sˇpela Arhar Holdt, Teja Kavˇciˇc, Iza Sˇkrjanec, Dafne Marko, Lucija Jezerˇsek, and Anja Zajc. Training corpus ssj500k 2.2, 2019. Slovenian language resource repository CLARIN.SI.

[8] Sven Laur. Nimeu¨ksuste korpus. Center of Estonian Language Resources, 2013.

[9] Nikola Ljubeˇsi´c, Filip Klubiˇcka, Zˇeljko Agi´c, and Ivo-Pavao Jazbec. New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Proceedings of the LREC 2016, 2016.

[10] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Sua´rez, Yoann Dupont, Laurent Romary, E´ric Villemonte de la Clergerie, Djam´e Seddah, and Benoˆıt Sagot. CamemBERT: a tasty French language model. arXiv preprint arXiv:1911.03894, 2019.

Powered by OpenAIRE graph
Found an issue? Give us feedback
Funded by
EC| EMBEDDIA
Project
EMBEDDIA
Cross-Lingual Embeddings for Less-Represented Languages in European News Media
  • Funder: European Commission (EC)
  • Project Code: 825153
  • Funding stream: H2020 | RIA
Related to Research communities
Digital Humanities and Cultural Heritage
Download fromView all 5 sources
lock_open
ZENODO
Conference object . 2020
Data sources: ZENODO
moresidebar