shareshare link cite add Please grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added 0 works in your ORCID record related to the merged Research product.

You have already added 0 works in your ORCID record related to the merged Research product.
FinEst BERT and CroSloEngual BERT: less is more in multilingual models
Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations
Comment: 10 pages, accepted at TSD 2020 conference
Microsoft Academic Graph classification: English language Artificial intelligence business.industry business Language model Estonian language.human_language language Natural language processing computer.software_genre computer Downstream (software development) Computer science Dependency grammar
Computation and Language (cs.CL), FOS: Computer and information sciences, Computer Science - Computation and Language, Contextual embeddings, BERT model, Less-resourced languages, NLP
Computation and Language (cs.CL), FOS: Computer and information sciences, Computer Science - Computation and Language, Contextual embeddings, BERT model, Less-resourced languages, NLP
Microsoft Academic Graph classification: English language Artificial intelligence business.industry business Language model Estonian language.human_language language Natural language processing computer.software_genre computer Downstream (software development) Computer science Dependency grammar
- University of Ljubljana Slovenia
- UNIVERZA V LJUBLJANI Slovenia
20 references, page 1 of 2
[1] Zˇeljko Agi´c and Nikola Ljubeˇsi´c. Universal dependencies for Croatian (that work for Serbian, too). In The 5th Workshop on Balto-Slavic Natural Language Processing, pages 1-8, 2015.
[2] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzma´n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[4] Kaja Dobrovoljc, Tomaˇz Erjavec, and Simon Krek. The universal dependencies treebank for Slovenian. In Proceeding of the 6th Workshop on BaltoSlavic Natural Language Processing (BSNLP 2017), 2017.
[5] K. Haverinen, J. Nyblom, T. Viljanen, V. Laippala, S. Kohonen, A. Missila¨, S. Ojala, T. Salakoski, and F. Ginter. Building the essential resources for Finnish: the Turku dependency treebank. LREC, 2013.
[6] Dan Kondratyuk and Milan Straka. 75 languages, 1 model: Parsing universal dependencies universally. In Proceedings of the 2019 EMNLP-IJCNLP, pages 2779-2795, 2019. [OpenAIRE]
[7] Simon Krek, Kaja Dobrovoljc, Tomaˇz Erjavec, Sara Moˇze, Nina Ledinek, Nanika Holz, Katja Zupan, Polona Gantar, Taja Kuzman, Jaka Cˇibej, Sˇpela Arhar Holdt, Teja Kavˇciˇc, Iza Sˇkrjanec, Dafne Marko, Lucija Jezerˇsek, and Anja Zajc. Training corpus ssj500k 2.2, 2019. Slovenian language resource repository CLARIN.SI.
[8] Sven Laur. Nimeu¨ksuste korpus. Center of Estonian Language Resources, 2013.
[9] Nikola Ljubeˇsi´c, Filip Klubiˇcka, Zˇeljko Agi´c, and Ivo-Pavao Jazbec. New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Proceedings of the LREC 2016, 2016.
[10] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Sua´rez, Yoann Dupont, Laurent Romary, E´ric Villemonte de la Clergerie, Djam´e Seddah, and Benoˆıt Sagot. CamemBERT: a tasty French language model. arXiv preprint arXiv:1911.03894, 2019.
1 Research products, page 1 of 1

- Funder: European Commission (EC)
- Project Code: 825153
- Funding stream: H2020 | RIA
Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations
Comment: 10 pages, accepted at TSD 2020 conference