research data . Dataset . 2020 . Embargo end date: 16 Jul 2020

OdiEnCorp 2.0

Parida, Shantipriya; Bojar, Ondřej;
Open Access
  • Published: 08 Apr 2020
  • Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Data ----- We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel corpora such as OdiEnCorp 1.0 and PMIndia, and books which contain both English and Odia text such as grammar and bilingual literature books. We also included parallel text from multiple public websites such as Odia Wikipedia, Odia digital library, and Odisha Government websites. The parallel corpus covers many domains: the Bible, other literature, Wiki data relating to many topics, Government policies, and general conversation. We have processed the raw data collected from the books,...
Funded by
Real time network, text, and speaker analytics for combating organized crime
  • Funder: European Commission (EC)
  • Project Code: 833635
  • Funding stream: H2020 | RIA
Digital Humanities and Cultural Heritage
Download from
Any information missing or wrong?Report an Issue