A significant percentage of COVID-19 survivors experience ongoing multisystemic symptoms that often affect daily living, a condition known as Long Covid or post-acute-sequelae of SARS-CoV-2 infection. However, identifying scientific articles relevant to Long Covid is challenging since there is no standardized or consensus terminology. We developed an iterative human-in-the-loop machine learning framework combining data programming with active learning into a robust ensemble model, demonstrating higher specificity and considerably higher sensitivity than other methods. This dataset contains the source code (python and shell scripts) used to create the Long Covid collection, along with a snapshot of processed data and predictions. This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.
Using a novel database of 189,000+ Colombian firms and 500,000+ firm executives' names, I study the effect of financial factors, CEOs' centrality (corporate power), and political connections on access to a government bailouts program launched to subsidy wages in the first stages of COVID 19 crisis. Natural Language Processing algorithms and complex networks metrics are used to unveil ownership and control links of politic/economic elites and gauge their closeness to the Colombian President. I find that firm size factors and firm age, instead of political-connections or being run by prominent CEOs/shareholders, explain access to the program. In addition, I find that impacts of the program are positive in terms of salaries and liquidity, but they do not increase with firm size and age. These findings suggest a preference for protecting systemically-important firms (without ex-post economic efficiency) rather than special interests of elites.
The data file is a spreadsheet used to record queries made via CQPweb (https://cqpweb.lancs.ac.uk). Search Terms For clarity, in the ensuing descriptions, we use bold font for search terms and italic font for collocates and other quotations. Based on clinical descriptions of COVID-19 (reviewed by Cevik et al., 2020), we identified the following search terms: 1) “cough”, 2) “fever”, 3) “pneumonia”. To avoid confusion with years when influenza pandemics may have occurred, we added 4) “influenza” and 5) “epidemic”. Any combination of terms 1 to 3 co-occurring with term 4 alone or terms 4 and 5 together, would be indicative of a respiratory outbreak caused by, or at the least attributed to, influenza. By contrast, any combination of terms 1 to 3 co-occurring with term 5 alone, or without either of terms 4 and 5, would suggest a respiratory disease that was not confidently identified as influenza at the time. This outbreak would provide a candidate coronavirus epidemic for further investigation. Newspapers Newspapers and years searched were as follows: Belfast Newsletter (1828-1900), The Era (1838-1900), Glasgow Herald (1820-1900), Hampshire & Portsmouth Telegraph (1799-1900), Ipswich Journal (1800-1900), Liverpool Mercury (1811-1900), Northern Echo (1870-1900) Pall Mall Gazette (1865-1900), Reynold’s Daily (1850-1900), Western Mail (1869-1900) and The Times (1785-2009). The search in The Times was extended to 2009 in order to provide a comparison with the 20th century. Searches were performed using Lancaster University’s instance of the CQPweb (Corpus Query Processor) corpus analysis software (https://cqpweb.lancs.ac.uk/; Hardie, 2012). CQPweb’s database is populated from the newspapers listed, using optical character recognition (OCR), so for older publications in particular, some errors may be present (McEnery et al., 2019). Statistics The occurrence of each of the five search terms was calculated per million words within the annual output of each publication, in CQPweb. This is compared to a background distribution constituting the corresponding words per million for each search term over the total year range for each newspaper. Within the annual distributions, for each search term and each newspaper, we determined the years lying in the top 1% (i.e. p<0.05 after application of a Bonferroni correction), following Gabrielatos et al. (2012). These are deemed to be years when that search term was in statistically significant usage above its background level for the newspaper in which it occurs. For years when search terms were significantly elevated, we also calculated collocates at range n. Collocates, in corpus linguistics, are other words found at statistically significant usage, over their own background levels, in a window from n positions to the left to n positions to the right of the search term. In other words, they are found in significant proximity to the search term. A default value of n=10 was used throughout, unless specified. Collocation analysis therefore assists in showing how a search term associates with other words within a corpus, providing information about the context in which that search term is used. CQPweb provides a log ratio method for the quantification of the strength of collocation. COVID-19 is the first known coronavirus pandemic. Nevertheless, the seasonal circulation of the four milder coronaviruses of humans – OC43, NL63, 229E and HKU1 – raises the possibility that these viruses are the descendants of more ancient coronavirus pandemics. This proposal arises by analogy to the observed descent of seasonal influenza subtypes H2N2 (now extinct), H3N2 and H1H1 from the pandemic strains of 1957, 1968 and 2009, respectively. Recent historical revisionist speculation has focussed on the influenza pandemic of 1889-1892, based on molecular phylogenetic reconstructions that show the emergence of human coronavirus OC43 around that time, probably by zoonosis from cattle. If the “Russian influenza”, as The Times named it in early 1890, was not influenza but caused by a coronavirus, the origins of the other three milder human coronaviruses may also have left a residue of clinical evidence in the 19th century medical literature and popular press. In this paper, we search digitised 19th century British newspapers for evidence of previously unsuspected coronavirus pandemics. We conclude that there is little or no corpus linguistic signal in the UK national press for large-scale outbreaks of unidentified respiratory disease for the period 1785 to 1890. To view data, open in Microsoft Excel. To reproduce the data from scratch, a login is needed to CQPweb (https://cqpweb.lancs.ac.uk). This is free of charge but requires authorization, which can be applied for at the URL given.
Project: UKRI | Learning from COVID-19: A... (EP/V048597/1), UKRI | Learning from COVID-19: A... (EP/V048597/1)
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset. This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim. The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation). The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019). The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English. The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy. The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels. The data sources used are: - The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/ - CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID - MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID - CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data - TREC Health Misinformation track https://trec-health-misinfo.github.io/ - TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True). The entries in the dataset contain the following information: - Claim. Text of the claim. - Claim label. The labels are: False, and True. - Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals. - Original information source. Information about which general information source was used to obtain the claim. - Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities. Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1). References - Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596 - Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109. - Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552. - Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. - Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718. - Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations. - Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. - Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23. - Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885. - Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation. - Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics. - Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
The objectives of this research were to examine the impact of the rapid shift to online teaching due to Covid 19 on Fine Arts performance (Music and Drama) Students and Faculty and then to investigate the efficacy of integrating online strategies and resources to support traditional face to face teaching. As teaching/learning performance in drama and music aim at embodied communication and interpretation within live performance, our disciplines are highly impacted by online learning. Findings of this qualitative data combined with research on the experience of other Fine Arts institutions and surveys of available resources will be applied to further research. This dataset is restricted. Please consult the access guidelines document in order to learn more about why this is, under what conditions access will be allowed, and the process for requesting access.
Covid-19 je na svetovni ravni okrepil državni nadzor, vse pogosteje moramo posegati po osebnih dokumentih, pri čemer je osrednji subjekt identifikacije postal človeški obraz – na katerem temeljijo tudi najnovejše tehnologije nadzora.
Contributing metadata to the COVID-19 collection of the Ethnic and Migrant Minorities (EMM) Survey Registry as a data producer A training video targeting COVID-19 survey producers to entice contributions to the COVID-19 collection of the EMM Survey Registry Target Audience for the video: Survey producers (academic and non-academic) of COVID-19 surveys with EMM respondents
L'industrie de la traduction utilise de plus en plus des modèles de traduction automatique. Des modèles dits « universels » sont capables d'obtenir de bonnes performances lorsqu'évalués sur un large ensemble de domaines, mais leurs performances sont souvent limitées lorsqu'ils sont testés sur des domaines précis. Or, les traductions doivent être adaptées au style, au sujet et au vocabulaire des différents domaines, en particulier ceux des nouveaux (pensons aux textes reliés à la COVID-19). Entrainer un nouveau modèle pour chaque domaine demande du temps, des outils technologiques spécialisés et de grands ensembles de données. De telles ressources ne sont généralement pas disponibles. Nous proposons, dans ce mémoire, d'évaluer une nouvelle technique de transfert d'apprentissage pour l'adaptation à un domaine précis. La technique peut s'adapter rapidement à tout nouveau domaine, sans entrainement supplémentaire et de façon non supervisée. À partir d'un échantillon de phrases du nouveau domaine, le modèle lui calcule une représentation vectorielle qu'il utilise ensuite pour guider ses traductions. Pour calculer ce plongement de domaine, nous testons cinq différentes techniques. Nos expériences démontrent qu'un modèle qui utilise un tel plongement réussit à extraire l'information qui s'y trouve pour guider ses traductions. Nous obtenons des résultats globalement supérieurs à un modèle de traduction qui aurait été entrainé sur les mêmes données, mais sans utiliser le plongement. Notre modèle est plus avantageux que d'autres techniques d'adaptation de domaine puisqu'il est non supervisé, qu'il ne requiert aucun entrainement supplémentaire pour s'adapter et qu'il s'adapte très rapidement (en quelques secondes) uniquement à partir d'un petit ensemble de phrases. Machine translation models usage is increasing in the translation industry. What we could call "universal" models attain good performances when evaluated over a wide set of domains, but their performance is often limited when tested on specific domains. Translations must be adapted to the style, subjects and vocabulary of different domains, especially new ones (the COVID-19 texts, for example). Training a new model on each domain requires time, specialized technological tools and large data sets. Such resources are generally not available. In this master's thesis, we propose to evaluate a novel learning transfer technique for domain adaptation. The technique can adapt quickly to any new domain, without additional training, and in an unsupervised manner. Given a sample of sentences from the new domain, the model computes a vector representation for the domain that is then used to guide its translations. To compute this domain embedding, we test five different techniques. Our experiments show that a model that uses this embedding obtains globally superior performances than a translation model that would have been trained on the same data, but without the embedding. Our model is more advantageous than other domain adaptation techniques since it is unsupervised, requires no additional training to adapt, and adapts very quickly (within seconds) from a small set of sentences only.