publication . Article . 2014

Domain adaptation of statistical machine translation with domain-focused web crawling

Pavel Pecina; Antonio Toral; Vassilis Papavassiliou; Prokopis Prokopidis; Aleš Tamchyna; Andy Way; Josef van Genabith;
Open Access
  • Published: 03 Dec 2014 Journal: Language Resources and Evaluation, volume 49, pages 147-193 (issn: 1574-020X, eissn: 1574-0218, Copyright policy)
  • Publisher: Springer Science and Business Media LLC
  • Country: Czech Republic
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English---French and English---Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.
Sustainable Development Goals (SDG)
16. Peace & justice
free text keywords: Library and Information Sciences, Linguistics and Language, Education, Language and Linguistics, Original Paper, Statistical machine translation, Domain adaptation, Web crawling, Optimisation, Adaptation (computer science), Evaluation of machine translation, Artificial intelligence, business.industry, business, Domain (software engineering), Crawling, Machine translation, computer.software_genre, computer, Computer science, Transfer-based machine translation, BLEU, Web crawler, Phrase, Natural language processing
  • Digital Humanities and Cultural Heritage
Funded by
Knowledge Helper for Medical and Other Information users
  • Funder: European Commission (EC)
  • Project Code: 257528
  • Funding stream: FP7 | SP1 | ICT
SFI| CSET CNGL: Next Generation Localisation (CNGL)
  • Funder: Science Foundation Ireland (SFI)
  • Project Code: 07/CE/I1142
  • Funding stream: SFI Centre for Science Engineering and Technology (CSET)
Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies
  • Funder: European Commission (EC)
  • Project Code: 248064
  • Funding stream: FP7 | SP1 | ICT
Automatic building of Machine Translation
  • Funder: European Commission (EC)
  • Project Code: 324414
  • Funding stream: FP7 | SP3 | PEOPLE
Any information missing or wrong?Report an Issue