- home
- Advanced Search
- Digital Humanities and Cultural Heritage
- 2013-2022
- Publications
- Science Foundation Ireland
- SFI|SFI Centre for Science Engineer...
- KHRESMOI
- CZ
- IE
- Digital Humanities and Cultural Heritage
- 2013-2022
- Publications
- Science Foundation Ireland
- SFI|SFI Centre for Science Engineer...
- KHRESMOI
- CZ
- IE
Loading
description Publicationkeyboard_double_arrow_right Article 2014 Czech RepublicPublisher:Springer Science and Business Media LLC Publicly fundedFunded by:EC | KHRESMOI, EC | ABU-MATRAN, SFI | CSET CNGL: Next Generatio... +1 projectsEC| KHRESMOI ,EC| ABU-MATRAN ,SFI| CSET CNGL: Next Generation Localisation (CNGL) ,EC| PANACEAPecina, Pavel; Toral, Antonio; Papavassiliou, Vassilis; Prokopidis, Prokopis; Tamchyna, Aleš; Way, Andy; van Genabith, Josef;pmc: PMC4479164
pmid: 26120290
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English---French and English---Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.
Europe PubMed Centra... arrow_drop_down Europe PubMed CentralArticle . 2014Full-Text: http://europepmc.org/articles/PMC4479164Data sources: PubMed CentralBiblio at Institute of Formal and Applied LinguisticsArticle . 2015Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1007/s10579-014-9282-3&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.euAccess RoutesGreen hybrid 12 citations 12 popularity Top 10% influence Top 10% impulse Average Powered by BIP!more_vert Europe PubMed Centra... arrow_drop_down Europe PubMed CentralArticle . 2014Full-Text: http://europepmc.org/articles/PMC4479164Data sources: PubMed CentralBiblio at Institute of Formal and Applied LinguisticsArticle . 2015Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1007/s10579-014-9282-3&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.eudescription Publicationkeyboard_double_arrow_right Article 2014 France, Czech Republic, IrelandPublisher:Elsevier BV Publicly fundedFunded by:EC | KHRESMOI, SFI | CSET CNGL: Next Generatio...EC| KHRESMOI ,SFI| CSET CNGL: Next Generation Localisation (CNGL)Pecina, Pavel; Tamchyna, Aleš; Urešová, Zdeňka; Hlaváčová, Jaroslava; Hajič, Jan; Rosa, Rudolf; Kelly, Liadh; Novák, Michal; Leveling, Johannes; Goeuriot, Lorraine; Popel, Martin; Dušek, Ondřej; Jones, Gareth; Mareček, David;Objective: We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. Methods and data: Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. Results: The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. Conclusions: Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.
DCU Online Research ... arrow_drop_down DCU Online Research Access ServiceArticle . 2014 . Peer-reviewedData sources: DCU Online Research Access ServiceArtificial Intelligence in MedicineArticle . 2014 . Peer-reviewedLicense: Elsevier TDMData sources: CrossrefBiblio at Institute of Formal and Applied LinguisticsArticle . 2014Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1016/j.artmed.2014.01.004&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.eu23 citations 23 popularity Top 10% influence Top 10% impulse Top 10% Powered by BIP!more_vert DCU Online Research ... arrow_drop_down DCU Online Research Access ServiceArticle . 2014 . Peer-reviewedData sources: DCU Online Research Access ServiceArtificial Intelligence in MedicineArticle . 2014 . Peer-reviewedLicense: Elsevier TDMData sources: CrossrefBiblio at Institute of Formal and Applied LinguisticsArticle . 2014Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1016/j.artmed.2014.01.004&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.eu
Loading
description Publicationkeyboard_double_arrow_right Article 2014 Czech RepublicPublisher:Springer Science and Business Media LLC Publicly fundedFunded by:EC | KHRESMOI, EC | ABU-MATRAN, SFI | CSET CNGL: Next Generatio... +1 projectsEC| KHRESMOI ,EC| ABU-MATRAN ,SFI| CSET CNGL: Next Generation Localisation (CNGL) ,EC| PANACEAPecina, Pavel; Toral, Antonio; Papavassiliou, Vassilis; Prokopidis, Prokopis; Tamchyna, Aleš; Way, Andy; van Genabith, Josef;pmc: PMC4479164
pmid: 26120290
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English---French and English---Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.
Europe PubMed Centra... arrow_drop_down Europe PubMed CentralArticle . 2014Full-Text: http://europepmc.org/articles/PMC4479164Data sources: PubMed CentralBiblio at Institute of Formal and Applied LinguisticsArticle . 2015Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1007/s10579-014-9282-3&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.euAccess RoutesGreen hybrid 12 citations 12 popularity Top 10% influence Top 10% impulse Average Powered by BIP!more_vert Europe PubMed Centra... arrow_drop_down Europe PubMed CentralArticle . 2014Full-Text: http://europepmc.org/articles/PMC4479164Data sources: PubMed CentralBiblio at Institute of Formal and Applied LinguisticsArticle . 2015Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1007/s10579-014-9282-3&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.eudescription Publicationkeyboard_double_arrow_right Article 2014 France, Czech Republic, IrelandPublisher:Elsevier BV Publicly fundedFunded by:EC | KHRESMOI, SFI | CSET CNGL: Next Generatio...EC| KHRESMOI ,SFI| CSET CNGL: Next Generation Localisation (CNGL)Pecina, Pavel; Tamchyna, Aleš; Urešová, Zdeňka; Hlaváčová, Jaroslava; Hajič, Jan; Rosa, Rudolf; Kelly, Liadh; Novák, Michal; Leveling, Johannes; Goeuriot, Lorraine; Popel, Martin; Dušek, Ondřej; Jones, Gareth; Mareček, David;Objective: We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. Methods and data: Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. Results: The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. Conclusions: Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.
DCU Online Research ... arrow_drop_down DCU Online Research Access ServiceArticle . 2014 . Peer-reviewedData sources: DCU Online Research Access ServiceArtificial Intelligence in MedicineArticle . 2014 . Peer-reviewedLicense: Elsevier TDMData sources: CrossrefBiblio at Institute of Formal and Applied LinguisticsArticle . 2014Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1016/j.artmed.2014.01.004&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.eu23 citations 23 popularity Top 10% influence Top 10% impulse Top 10% Powered by BIP!more_vert DCU Online Research ... arrow_drop_down DCU Online Research Access ServiceArticle . 2014 . Peer-reviewedData sources: DCU Online Research Access ServiceArtificial Intelligence in MedicineArticle . 2014 . Peer-reviewedLicense: Elsevier TDMData sources: CrossrefBiblio at Institute of Formal and Applied LinguisticsArticle . 2014Data sources: Biblio at Institute of Formal and Applied Linguisticsadd ClaimPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product.All Research productsarrow_drop_down <script type="text/javascript"> <!-- document.write('<div id="oa_widget"></div>'); document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=10.1016/j.artmed.2014.01.004&type=result"></script>'); --> </script>
For further information contact us at helpdesk@openaire.eu