research data . Dataset . 2019

doclevel-MT-benchmark-discoMT2019

Tiedemann, Jörg; Scherrer, Yves;
Open Access English
  • Published: 01 Nov 2019
  • Publisher: Zenodo
Abstract
This release contains data sets for experiments with document-level machine translation. The data sets have been used in previous studies and provided here for replicability and comparison with other systems. The data sets are taken from the English-German news translation task at WMT 2019 and the English-German bitext in the OpenSubtitles collection v2016 from OPUS. All data sets are sentence aligned with corresponding lines being aligned to each other. Document boundaries are marked with empty lines (on both sides of the parallel corpus). The data set has been used in the following publication: @inproceedings{scherrer-tiedemann-loaiciga-2019, title = "Analysing concatenation approaches to document-level NMT in two different domains", author = {Scherrer, Yves and Tiedemann, J{\"o}rg and Lo{\'a}iciga, Sharid}, booktitle = "Proceedings of the Third Workshop on Discourse in Machine Translation", month = nov, year = "2019", address = "Hong-Kong", publisher = "Association for Computational Linguistics", } Please, cite that paper if you use the data set in your own work.
Persistent Identifiers
Subjects
free text keywords: natural language processing, machine translation, language technology, NLP
Related Organizations
Communities
  • Digital Humanities and Cultural Heritage
Funded by
EC| MeMAD
Project
MeMAD
Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy
  • Funder: European Commission (EC)
  • Project Code: 780069
  • Funding stream: H2020 | RIA
,
EC| FoTran
Project
FoTran
Found in Translation – Natural Language Understanding with Cross-Lingual Grounding
  • Funder: European Commission (EC)
  • Project Code: 771113
  • Funding stream: H2020 | ERC | ERC-COG
Download from
Open Access
ZENODO
Dataset . 2019
Providers: ZENODO
1 research outcomes, page 1 of 1
Any information missing or wrong?Report an Issue