research data . Dataset . 2019 . Embargo end date: 31 Oct 2019

OAGSX Title Generation Dataset

Çano, Erion;
Open Access
  • Published: 01 Nov 2019
  • Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection ( which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license ( If using it, please consider citing also the following paper: Çano Erion, Bojar Ondřej. Two Huge Title a...
Persistent Identifiers
Funded by
European Live Translator
  • Funder: European Commission (EC)
  • Project Code: 825460
  • Funding stream: H2020 | RIA
Digital Humanities and Cultural Heritage
Download from
Any information missing or wrong?Report an Issue