research data . Dataset . 2019 . Embargo end date: 08 Mar 2019

OAGK Keyword Generation Dataset

Çano, Erion;
Open Access
  • Published: 01 Apr 2019
  • Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection ( which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence ( If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej...
Funded by
European Live Translator
  • Funder: European Commission (EC)
  • Project Code: 825460
  • Funding stream: H2020 | RIA
Digital Humanities and Cultural Heritage
Download from
Any information missing or wrong?Report an Issue