Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
24 Research products, page 1 of 3

  • Digital Humanities and Cultural Heritage
  • Research data
  • Research software
  • 2018-2022
  • GB
  • ZENODO
  • Spiral - Imperial College Digital Repository
  • Digital Humanities and Cultural Heritage

10
arrow_drop_down
Date (most recent)
arrow_drop_down
  • Research data . Image . 2022
    Open Access English
    Authors: 
    Fort, Molly; Gibson, Adam;
    Publisher: Zenodo
    Project: EC | IPERION HS (871034), UKRI | Multimodal hyperspectral,... (2579355)

    The following data sets were collected to support the potential uses of opensource data in the context of digital humanities and heritage sciences. Photographs X-Ray Fluorescence Hyperspectral Imaging Multispectral Imaging This proposed experiment is conducted by the UCL Institute for Sustainable Heritage in collaboration with the Centre for Digital Humanities. Imaging methods including Photography, Multispectral Imaging, Hyperspectral Imaging and Xray Fluorescence Mapping have been collected along with the complete readout metadata of the instrumentation. We have collected this as an example of typical, unprocessed imaging datasets that would be found in standard image conditions. This data is not optimized, nor do we claim it to be perfect quality, our aim is to provide users with access to a range of imaging data sets. We have included the data with minimum processing, as it is read straight from our systems, with the accompanying metadata provided from capture alone. We hope that you find the data helpful, and we welcome you to use the data in any way you wish, for all and any analysis development purposes. For us to build upon this research, we ask that in return you would be willing to share in some regard your experiences in using open-source data, using our data, successes and issues. If you would be willing to engage with us in this endeavor, please feel free to contact us so that we may be able to follow up with you. E: molly.fort.21@ucl.ac.uk Object Paradata; Postcard – c. Early 1900's Language – Eng. Materials – colour print on card, metallic leafing. Front transcription - ‘Greetings’ ‘May your Birthday bring you Peace & perfect Happiness, Golden hopes & Love of Friends, And every Happiness this world can send.’ Object Dimensions – 138mm X 88mm The postcard is an item of ephemera donated to the UCLDH Digitisation Suite by Prof Melissa Terras, for teaching and training purposes in 2015.

  • Open Access English
    Authors: 
    Arana-Catania, Miguel; Kochkina, Elena; Zubiaga, Arkaitz; Liakata, Maria; Procter, Rob; He, Yulan;
    Publisher: Zenodo
    Project: UKRI | Learning from COVID-19: A... (EP/V048597/1)

    The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset. This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim. The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation). The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019). The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English. The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy. The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels. The data sources used are: - The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/ - CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID - MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID - CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data - TREC Health Misinformation track https://trec-health-misinfo.github.io/ - TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True). The entries in the dataset contain the following information: - Claim. Text of the claim. - Claim label. The labels are: False, and True. - Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals. - Original information source. Information about which general information source was used to obtain the claim. - Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities. Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1). References - Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596 - Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109. - Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552. - Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. - Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718. - Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations. - Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. - Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23. - Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885. - Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation. - Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics. - Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.

  • Open Access
    Authors: 
    Zhu, Lixing; He, Yulan; Zhou, Deyu;
    Publisher: Zenodo
    Project: UKRI | Twenty20Insight (EP/T017112/1), UKRI | Turing AI Fellowship: Eve... (EP/V020579/1)

    topical_wordvec_models You first need to create a save folder for training. Download the [saved model](https://topicvecmodels.s3.eu-west-2.amazonaws.com/save/47/model) and place it in ./save/47/ to run the trained model. To construct the training set, refer to https://github.com/somethingx02/topical_wordvec_model please. Trained [wordvecs](https://topicvecmodels.s3.eu-west-2.amazonaws.com/save/47/aggrd_all_wordrep.txt).

  • Open Access
    Authors: 
    Lixing Zhu; Pergola, Gabriele; Gui, Lin; Deyu Zhou; Yulan He;
    Publisher: Zenodo
    Project: UKRI | Turing AI Fellowship: Eve... (EP/V020579/1), UKRI | Twenty20Insight (EP/T017112/1), UKRI | Learning from COVID-19: A... (EP/V048597/1)

    Transformer encoder-decoder for emotion detection in dialogues

  • Open Access
    Authors: 
    Xingwei Tan; Pergola, Gabriele; Yulan He;
    Publisher: Zenodo
    Project: UKRI | Learning from COVID-19: A... (EP/V048597/1), UKRI | Twenty20Insight (EP/T017112/1), UKRI | Turing AI Fellowship: Eve... (EP/V020579/1)

    This is the code of EMNLP 2021 main track long paper "Extracting Event Temporal Relations via Hyperbolic Geometry". The paper proposed two hyperbolic-based approaches for the event temporal relation extraction task, which is an Event-centric Natural Language Understanding task.

  • Open Access
    Authors: 
    van Strien, Daniel;
    Publisher: Zenodo
    Project: UKRI | Living with Machines (AH/S01179X/1)

    The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project. source: https://news-navigator.labs.loc.gov/ One of these categories is 'photographs'. This dataset contains a sample of these images with additional labels indicating if the photograph has one or more of the following labels: "human", "animal", "human-structure" and "landscape" The data is organised as follows: The images themselves can be found in `images.zip` `newspaper-navigator-sample-metadata.csv` contains metadata about each image drawn from the Newspaper Navigator Dataset. `multi_label.csv` contains the labels for the images as a CSV file `annotations.csv` conains the labels for the images with additional metadata This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt2) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly. The metadata CSV file contains the following columns: - filepath - pub_date - page_seq_num - edition_seq_num - batch - lccn - box - score - ocr - place_of_publication - geographic_coverage - name - publisher - url - page_url - month - year - iiif_url

  • Research software . 2021
    Open Access
    Authors: 
    Joseph Padfield;
    Publisher: Zenodo
    Project: UKRI | Practical applications of... (AH/T011084/1), EC | SSHOC (823782), EC | IPERION HS (871034)

    This code demonstrates generic public examples of a Simple IIIF Discovery system, based on tools used within the National Gallery to provide access to images from multiple institutions and present them together in IIIF compatible viewers. The system is based on a website requesting the details of IIIF images from a Simple IIIF Discovery end-point, based on a simple keyword search. The website does not need to understand the complexities of any underlying APIs, just the simple structure of the results returned by the end-point. The website then just needs to be able to reformat the IIIF results and feed them into a IIIF based viewer of choice. This version includes a number of updates to the original demonstrator related to improving the user interface, including adding a toggle option to jump between IIIF viewers, and updating the administration process of creating new end-points and related web-pages, this is all achieved via a simple JSON config files now. A working demo of this system can be seen at: https://research.ng-london.org.uk/ss-iiif

  • Open Access English
    Authors: 
    Ardanuy, Mariona Coll; Beelen, Kaspar; Lawrence, Jon; McDonough, Katherine; Nanni, Federico; Rhodes, Joshua; Tolfo, Giorgia; Wilson, Daniel C.S.;
    Publisher: Zenodo
    Project: UKRI | Living with Machines (AH/S01179X/1)

    Supplementary material for the station-to-station Github repository, containing the underlying code and materials for the paper 'Station to Station: Linking and Enriching Historical British Railway Data', accepted to CHR2021 (Computational Humanities Research). Mariona Coll Ardanuy, Kaspar Beelen, Jon Lawrence, Katherine McDonough, Federico Nanni, Joshua Rhodes, Giorgia Tolfo, and Daniel C.S. Wilson. "Station to Station: Linking and Enriching Historical British Railway Data." In Computational Humanities Research (CHR2021). 2021.

  • Open Access English
    Authors: 
    van Strien, Daniel;
    Publisher: Zenodo
    Project: UKRI | Living with Machines (AH/S01179X/1)

    Model description This model is intended to predict, from the title of a book, whether it is 'fiction' or 'non-fiction'. This model was trained on data created from the Digitised printed books (18th-19th Century) book collection. The datasets in this collection are comprised and derived from 49,455 digitised books (65,227 volumes), mainly from the 19th Century. This dataset is dominated by English language books and includes books in several other languages in much smaller numbers. This model was originally developed for use as part of the Living with Machines project to be able to 'segment' this large dataset of books into different categories based on a 'crude' classification of genre i.e. whether the title was `fiction` or `non-fiction`. The model's training data (discussed more below) primarily consists of 19th Century book titles from the British Library Digitised printed books (18th-19th century) collection. These books have been catalogued according to British Library cataloguing practices. The model is likely to perform worse on any book titles from earlier or later periods. While the model is multilingual, it has training data in non-English book titles; these appear much less frequently. How to use To use this within fastai, first install version 2 of the fastai library. Following the documentation instructions. Once you have fastai installed, you can use the model as follows: from fastai.text.all import load_learner learn = load_learner("20210928-model.pkl") learn.predict("Oliver Twist") Limitations and bias The model was developed based on data from the British Library's Digitised printed books (18th-19th Century) collection. This dataset is not representative of books from the period covered with biases towards certain types (travel) and a likely absence of books that were difficult to digitise. The formatting of the British Library books corpus titles may differ from other collections, resulting in worse performance on other collections. It is recommended to evaluate the performance of the model before applying it to your own data. Likely, this model won't perform well for contemporary book titles without further fine-tuning. Training data The training data for this model will be available from the British Libary Research Repository shortly. The training data was created using the Zooniverse platform. British Library cataloguers carried out the majority of the annotations used as training data. More information on the process of creating the training data will be available soon. Training procedure Model training was carried out using the fastai library version 2.5.2. The notebook using for training the model will be available at: https://github.com/Living-with-machines/bl-books-genre-prediction Eval result The model was evaluated on a held out test set: precision recall f1-score support Fiction 0.91 0.88 0.90 296 Non-fiction 0.94 0.95 0.95 554 accuracy 0.93 850 macro avg 0.93 0.92 0.92 850 weighted avg 0.93 0.93 0.93 850

  • Open Access English
    Authors: 
    Joseph Padfield;
    Publisher: Zenodo
    Project: EC | SSHOC (823782), UKRI | Practical applications of... (AH/T011084/1), UKRI | ARTICT | Art Through the ... (EP/R032785/1), UKRI | Persistent Identifiers as... (AH/T011092/1)

    This is a simple set of processes for creating a standard set of webpages based on a simple set of json files. This project is intended to work along side other projects to provide a simple way of creating a set of consistent webpages, which can be delivered as part of your own GitHub project using GitHub pages. It has been extended to allow more complex features such as presenting IIIIF (https://iiif.io) viewers, Timelines, and ordered Lists & Galleries. Various updates to the related JavaScript libraries have been added along with an option to use the whole system to create dynamic as well as static websites. For more detail and the most current version of the code see: https://github.com/jpadfield/simple-site

Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
24 Research products, page 1 of 3
  • Research data . Image . 2022
    Open Access English
    Authors: 
    Fort, Molly; Gibson, Adam;
    Publisher: Zenodo
    Project: EC | IPERION HS (871034), UKRI | Multimodal hyperspectral,... (2579355)

    The following data sets were collected to support the potential uses of opensource data in the context of digital humanities and heritage sciences. Photographs X-Ray Fluorescence Hyperspectral Imaging Multispectral Imaging This proposed experiment is conducted by the UCL Institute for Sustainable Heritage in collaboration with the Centre for Digital Humanities. Imaging methods including Photography, Multispectral Imaging, Hyperspectral Imaging and Xray Fluorescence Mapping have been collected along with the complete readout metadata of the instrumentation. We have collected this as an example of typical, unprocessed imaging datasets that would be found in standard image conditions. This data is not optimized, nor do we claim it to be perfect quality, our aim is to provide users with access to a range of imaging data sets. We have included the data with minimum processing, as it is read straight from our systems, with the accompanying metadata provided from capture alone. We hope that you find the data helpful, and we welcome you to use the data in any way you wish, for all and any analysis development purposes. For us to build upon this research, we ask that in return you would be willing to share in some regard your experiences in using open-source data, using our data, successes and issues. If you would be willing to engage with us in this endeavor, please feel free to contact us so that we may be able to follow up with you. E: molly.fort.21@ucl.ac.uk Object Paradata; Postcard – c. Early 1900's Language – Eng. Materials – colour print on card, metallic leafing. Front transcription - ‘Greetings’ ‘May your Birthday bring you Peace & perfect Happiness, Golden hopes & Love of Friends, And every Happiness this world can send.’ Object Dimensions – 138mm X 88mm The postcard is an item of ephemera donated to the UCLDH Digitisation Suite by Prof Melissa Terras, for teaching and training purposes in 2015.

  • Open Access English
    Authors: 
    Arana-Catania, Miguel; Kochkina, Elena; Zubiaga, Arkaitz; Liakata, Maria; Procter, Rob; He, Yulan;
    Publisher: Zenodo
    Project: UKRI | Learning from COVID-19: A... (EP/V048597/1)

    The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset. This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim. The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation). The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019). The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English. The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy. The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels. The data sources used are: - The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/ - CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID - MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID - CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data - TREC Health Misinformation track https://trec-health-misinfo.github.io/ - TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True). The entries in the dataset contain the following information: - Claim. Text of the claim. - Claim label. The labels are: False, and True. - Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals. - Original information source. Information about which general information source was used to obtain the claim. - Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities. Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1). References - Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596 - Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109. - Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552. - Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. - Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718. - Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations. - Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. - Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23. - Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885. - Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation. - Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics. - Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.

  • Open Access
    Authors: 
    Zhu, Lixing; He, Yulan; Zhou, Deyu;
    Publisher: Zenodo
    Project: UKRI | Twenty20Insight (EP/T017112/1), UKRI | Turing AI Fellowship: Eve... (EP/V020579/1)

    topical_wordvec_models You first need to create a save folder for training. Download the [saved model](https://topicvecmodels.s3.eu-west-2.amazonaws.com/save/47/model) and place it in ./save/47/ to run the trained model. To construct the training set, refer to https://github.com/somethingx02/topical_wordvec_model please. Trained [wordvecs](https://topicvecmodels.s3.eu-west-2.amazonaws.com/save/47/aggrd_all_wordrep.txt).

  • Open Access
    Authors: 
    Lixing Zhu; Pergola, Gabriele; Gui, Lin; Deyu Zhou; Yulan He;
    Publisher: Zenodo
    Project: UKRI | Turing AI Fellowship: Eve... (EP/V020579/1), UKRI | Twenty20Insight (EP/T017112/1), UKRI | Learning from COVID-19: A... (EP/V048597/1)

    Transformer encoder-decoder for emotion detection in dialogues

  • Open Access
    Authors: 
    Xingwei Tan; Pergola, Gabriele; Yulan He;
    Publisher: Zenodo
    Project: UKRI | Learning from COVID-19: A... (EP/V048597/1), UKRI | Twenty20Insight (EP/T017112/1), UKRI | Turing AI Fellowship: Eve... (EP/V020579/1)

    This is the code of EMNLP 2021 main track long paper "Extracting Event Temporal Relations via Hyperbolic Geometry". The paper proposed two hyperbolic-based approaches for the event temporal relation extraction task, which is an Event-centric Natural Language Understanding task.

  • Open Access
    Authors: 
    van Strien, Daniel;
    Publisher: Zenodo
    Project: UKRI | Living with Machines (AH/S01179X/1)

    The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator dataset] consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project. source: https://news-navigator.labs.loc.gov/ One of these categories is 'photographs'. This dataset contains a sample of these images with additional labels indicating if the photograph has one or more of the following labels: "human", "animal", "human-structure" and "landscape" The data is organised as follows: The images themselves can be found in `images.zip` `newspaper-navigator-sample-metadata.csv` contains metadata about each image drawn from the Newspaper Navigator Dataset. `multi_label.csv` contains the labels for the images as a CSV file `annotations.csv` conains the labels for the images with additional metadata This dataset was created for use in an under-review Programming Historian tutorial (http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt2) The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material. The data is shared here since it may be useful for others. This data documentation is a work in progress and will be updated when the Programming Historian tutorial is released publicly. The metadata CSV file contains the following columns: - filepath - pub_date - page_seq_num - edition_seq_num - batch - lccn - box - score - ocr - place_of_publication - geographic_coverage - name - publisher - url - page_url - month - year - iiif_url

  • Research software . 2021
    Open Access
    Authors: 
    Joseph Padfield;
    Publisher: Zenodo
    Project: UKRI | Practical applications of... (AH/T011084/1), EC | SSHOC (823782), EC | IPERION HS (871034)

    This code demonstrates generic public examples of a Simple IIIF Discovery system, based on tools used within the National Gallery to provide access to images from multiple institutions and present them together in IIIF compatible viewers. The system is based on a website requesting the details of IIIF images from a Simple IIIF Discovery end-point, based on a simple keyword search. The website does not need to understand the complexities of any underlying APIs, just the simple structure of the results returned by the end-point. The website then just needs to be able to reformat the IIIF results and feed them into a IIIF based viewer of choice. This version includes a number of updates to the original demonstrator related to improving the user interface, including adding a toggle option to jump between IIIF viewers, and updating the administration process of creating new end-points and related web-pages, this is all achieved via a simple JSON config files now. A working demo of this system can be seen at: https://research.ng-london.org.uk/ss-iiif

  • Open Access English
    Authors: 
    Ardanuy, Mariona Coll; Beelen, Kaspar; Lawrence, Jon; McDonough, Katherine; Nanni, Federico; Rhodes, Joshua; Tolfo, Giorgia; Wilson, Daniel C.S.;
    Publisher: Zenodo
    Project: UKRI | Living with Machines (AH/S01179X/1)

    Supplementary material for the station-to-station Github repository, containing the underlying code and materials for the paper 'Station to Station: Linking and Enriching Historical British Railway Data', accepted to CHR2021 (Computational Humanities Research). Mariona Coll Ardanuy, Kaspar Beelen, Jon Lawrence, Katherine McDonough, Federico Nanni, Joshua Rhodes, Giorgia Tolfo, and Daniel C.S. Wilson. "Station to Station: Linking and Enriching Historical British Railway Data." In Computational Humanities Research (CHR2021). 2021.

  • Open Access English
    Authors: 
    van Strien, Daniel;
    Publisher: Zenodo
    Project: UKRI | Living with Machines (AH/S01179X/1)

    Model description This model is intended to predict, from the title of a book, whether it is 'fiction' or 'non-fiction'. This model was trained on data created from the Digitised printed books (18th-19th Century) book collection. The datasets in this collection are comprised and derived from 49,455 digitised books (65,227 volumes), mainly from the 19th Century. This dataset is dominated by English language books and includes books in several other languages in much smaller numbers. This model was originally developed for use as part of the Living with Machines project to be able to 'segment' this large dataset of books into different categories based on a 'crude' classification of genre i.e. whether the title was `fiction` or `non-fiction`. The model's training data (discussed more below) primarily consists of 19th Century book titles from the British Library Digitised printed books (18th-19th century) collection. These books have been catalogued according to British Library cataloguing practices. The model is likely to perform worse on any book titles from earlier or later periods. While the model is multilingual, it has training data in non-English book titles; these appear much less frequently. How to use To use this within fastai, first install version 2 of the fastai library. Following the documentation instructions. Once you have fastai installed, you can use the model as follows: from fastai.text.all import load_learner learn = load_learner("20210928-model.pkl") learn.predict("Oliver Twist") Limitations and bias The model was developed based on data from the British Library's Digitised printed books (18th-19th Century) collection. This dataset is not representative of books from the period covered with biases towards certain types (travel) and a likely absence of books that were difficult to digitise. The formatting of the British Library books corpus titles may differ from other collections, resulting in worse performance on other collections. It is recommended to evaluate the performance of the model before applying it to your own data. Likely, this model won't perform well for contemporary book titles without further fine-tuning. Training data The training data for this model will be available from the British Libary Research Repository shortly. The training data was created using the Zooniverse platform. British Library cataloguers carried out the majority of the annotations used as training data. More information on the process of creating the training data will be available soon. Training procedure Model training was carried out using the fastai library version 2.5.2. The notebook using for training the model will be available at: https://github.com/Living-with-machines/bl-books-genre-prediction Eval result The model was evaluated on a held out test set: precision recall f1-score support Fiction 0.91 0.88 0.90 296 Non-fiction 0.94 0.95 0.95 554 accuracy 0.93 850 macro avg 0.93 0.92 0.92 850 weighted avg 0.93 0.93 0.93 850

  • Open Access English
    Authors: 
    Joseph Padfield;
    Publisher: Zenodo
    Project: EC | SSHOC (823782), UKRI | Practical applications of... (AH/T011084/1), UKRI | ARTICT | Art Through the ... (EP/R032785/1), UKRI | Persistent Identifiers as... (AH/T011092/1)

    This is a simple set of processes for creating a standard set of webpages based on a simple set of json files. This project is intended to work along side other projects to provide a simple way of creating a set of consistent webpages, which can be delivered as part of your own GitHub project using GitHub pages. It has been extended to allow more complex features such as presenting IIIIF (https://iiif.io) viewers, Timelines, and ordered Lists & Galleries. Various updates to the related JavaScript libraries have been added along with an option to use the whole system to create dynamic as well as static websites. For more detail and the most current version of the code see: https://github.com/jpadfield/simple-site