publication . Article . 2014

Domain adaptation of statistical machine translation with domain-focused web crawling

Pavel Pecina; Antonio Toral; Vassilis Papavassiliou; Prokopis Prokopidis; Aleš Tamchyna; Andy Way; Josef van Genabith;
Open Access English
  • Published: 03 Dec 2014 Journal: Language Resources and Evaluation, volume 49, issue 1, pages 147-193 (issn: 1574-020X, Copyright policy)
  • Publisher: Springer Nature
  • Country: Czech Republic
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English---French and English---Greek...
free text keywords: Linguistics and Language, Library and Information Sciences, Original Paper, Statistical machine translation, Domain adaptation, Web crawling, Optimisation, Natural language processing, computer.software_genre, computer, Computer science, Crawling, Evaluation of machine translation, Computational linguistics, Machine translation, Web crawler, Artificial intelligence, business.industry, business, Phrase, Domain adaptation, Transfer-based machine translation
Funded by
Automatic building of Machine Translation
  • Funder: European Commission (EC)
  • Project Code: 324414
  • Funding stream: FP7 | SP3 | PEOPLE
Knowledge Helper for Medical and Other Information users
  • Funder: European Commission (EC)
  • Project Code: 257528
  • Funding stream: FP7 | SP1 | ICT
SFI| CSET CNGL: Next Generation Localisation (CNGL)
  • Funder: Science Foundation Ireland (SFI)
  • Project Code: 07/CE/I1142
  • Funding stream: SFI Centre for Science Engineering and Technology (CSET)
Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies
  • Funder: European Commission (EC)
  • Project Code: 248064
  • Funding stream: FP7 | SP1 | ICT
Digital Humanities and Cultural Heritage
60 references, page 1 of 4

conference of the association for machine translation in the Americas. Denver, Colorado, USA, pp. 141-150.

Banerjee, P., Naskar, S.K., Roturier, J., Way, A., & van Genabith, J. (2011). Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In Proceedings of the machine translation summit XIII. Xiamen, China, pp. 285-292.

Banerjee, P., Rubino, R., Roturier, J., & van Genabith, J. (2013). Quality estimation-guided data selection for domain adaptation of smt. In Proceedings of the XIV machine translation summit. Nice, France, pp. 101-108.

Barbosa, L., Rangarajan Sridhar, V.K., Yarmohammadi, M., & Bangalore, S. (2012). Harvesting parallel text in multiple languages with limited supervision. In Proceedings of the 24th international conference on computational linguistics. Mumbai, India, pp. 201-214.

Baroni, M., Kilgarriff, A., Pomika´lek, J., & Rychly´, P. (2006). WebBootCaT: Instant domain-specific corpora to support human translators. In Proceedings of the 11th annual conference of the european association for machine translation. Oslo, Norway, pp. 47-252.

Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209-226. [OpenAIRE]

Bergmark, D., Lagoze, C., & Sbityakov, A. (2002). Focused crawls, tunneling, and digital libraries. In M. Agosti & C. Thanos (Eds.), Research and advanced technology for digital libraries, lecture notes in computer science. Berlin: Heidelberg, Vol. 2458, pp. 49-70.

Bertoldi, N., & Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp. 182-189.

Bertoldi, N., Haddow, B., & Fouet, J. B. (2009). Improved minimum error rate training in Moses. The Prague Bulletin of Mathematical Linguistics, 91, 7-16.

Bisazza, A., Ruiz, N., & Federico, M. (2011). Fill-up versus interpolation methods for phrase-based SMT adaptation. In Proceedings of the international workshop on spoken language translation. San Francisco, California, USA, pp. 136-143.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30, 107-117.

Carpuat, M., & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning. Prague, Czech Republic, pp. 61-72.

Carpuat, M., Daume´ III, H., Fraser, A., Quirk, C., Braune, F., Clifton, A., et al. (2012). Domain adaptation in machine translation: Final report. In 2012 Johns Hopkins summer workshop final report. Baltimore, MD: Johns Hopkins University.

Chen, J., Chau, R., & Yeh, C.H. (2004). Discovering parallel text from the World Wide Web. In Proceedings of the 2nd workshop on Australasian information security, data mining and web intelligence, and software internationalisation. Darlinghurst, Australia, Vol. 32, pp. 157-161.

Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30, 161-172.

60 references, page 1 of 4
Any information missing or wrong?Report an Issue