Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
138,392 Research products, page 1 of 13,840

  • Digital Humanities and Cultural Heritage
  • Research data
  • Research software
  • Other research products
  • 2018-2022
  • Open Access
  • Dataset

10
arrow_drop_down
Date (most recent)
arrow_drop_down
  • Open Access
    Authors: 
    , Haneca; , Ervynck;
    Publisher: Zenodo

    In deze zenodo-record zijn de brongegevens terug te vinden waarop de analyses in het hoofdstuk "Kenniswinst archeologieregelgeving 2016-2021" zijn gebaseerd. Dit hoofdstuk maakt deel uit van het onderzoeksrapport "Ameels V, Carpentier F, De Ketelaere S, Ervynck A, Geuens J, Haneca K, Pieters M, Van Looveren J & Verhelst A 2023: Evaluatie archeologie 2016-2021, Onderzoeksrapport Agentschap Onroerend Erfgoed, Brussel.": 1. Een overzicht van alle ingediende eindverslagen (n = 576) en nota's met eindafwerking (n = 26) over de periode 1 april 2016 t.e.m. 31 december 2021 eindverslagen_notas_2016_2021_overzicht.csv (comma-separated values) eindverslagen_notas_2016_2021_overzicht.xlsx (Microsoft Excel Open XML Spreadsheet) eindverslagen_notas_2016_2021_overzicht.shp (ESRI shapefile) (Een overzicht van alle ingediende eindverslagen - vanaf 1 april 2016 tot vandaag - kan je ook downloaden als GIS-bestand op https://geo.onroerenderfgoed.be/downloads.) 2. De resultaten van de inhoudelijke screening (periodisering, materiaalcategorieën, ...) van deze documenten: eindverslagen_notas_2016_2021_screening.csv (comma-separated values) eindverslagen_notas_2016_2021_screening.xlsx (Microsoft Excel Open XML Spreadsheet) eindverslagen_notas_2016_2021_screening.shp (ESRI shapefile)

  • Open Access English
    Authors: 
    Li, Weixuan;
    Publisher: Zenodo

    Abraham Bredius’ seminal work Künstler-Inventare contains over 150 inventories of artists’ possessions in the Dutch Republic. However, this rich source has never been fully transformed into datasets, partially because Bredius was selective in his transcriptions of artists’ inventories with a focus on paintings. To fill this gap and to overcome Bredius’ biases, the Virtual Interior project assembled and transcribed 46 inventories of artists’ and art dealers’ homes in 17th-century Amsterdam. The selection of the inventories was based on three criteria: 1) The artists’ and art dealers’ inventories in my sample were drawn up during an artist’s lifetime or shortly after his death; 2) the inventories listed goods arranged by room; 3) the original documents of the inventories can still be traced in the Amsterdam City Archives. Forty-six inventories were selected and transcribed from the original sources or copied from existing publications. The majority of my samples can be found in the notarial archives (Archive nr. 5075) and the rest in the Archive of the Chamber of Insolvency (Desolate Boedel kamer, Archive nr. 5072), both preserved in the Amsterdam City Archives. Starting with the published inventories, Bart Reuvekamp from our project team traced them back to the original sources in the archive, transcribed centuries-old handwritten pages, and compiled the listed objects in digital form. In this way, our database is able to encompass all belongings present in painters’ workshops, filling in the blanks that Bredius and other authors neglected or chose to leave out – the missing information that once hindered our comprehension of how artists organized their studios. The dataset is organized as follows: 1_Intro_and_data_explanation.xlsx introduces the dataset and explains the columns in the following files 2_Inventory_list.csv hosts the details of the inventories in this dataset 3_Inventory_items.csv contains the full transcriptions and categorical data of the objects in the inventories 4_Inventory_relationships.csv captures all the people mentioned in the inventories in the pre- or postscript and/or as debtors or creditors 5_Inventory_category_reference.csv provides a reference table for the ‘object_type’ and ‘object_category’ columns in 3_Inventory_items.csv 6_Inventory_subjectmatter_reference.csv offers a reference table for the ‘painting_subject’ and ‘painting_genre’ columns in 3_Inventory_items.csv NB: This Künstler-Inventare database is a provisional version. The correction and final data process has not been fully finished in Version 1. The transcriptions might contain errors and need to be treated with caution. This project is financed by NWO Smart Culture - Big Data / Digital Humanities grant

  • Open Access
    Authors: 
    Keshav Santhanam, Jon Saad-Falcon, Martin Franz, Omar Khattab, Avirup Sil, Radu Florian, Md Arafat Sultan, Salim Roukos, Matei Zaharia, Christopher Potts;
    Publisher: Zenodo

    XOR TyDi query and document files for running PLAID-ColBERTv2 experiments from "Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking".

  • Open Access
    Authors: 
    Berger, Michael; Bolte, Henrike; Führer, Veronika; Hausleitner, Felix; Hutterer, Sarah; Lüthi, Tim; Nancu, Mihaela; Passoni, Erica; Pataki, Katalin; Schröcksnadel, Sophie; +3 more
    Publisher: Zenodo

    This is ground truth for the vast collection of sermons of Nikolaus von Dinkelsbühl (ca. 1360 to 17th March 1433), translated and reorganised by a German redactor, from the 15th century has never been edited until now. It consists of 361 folios of parchment and paper. The text speaks about various topics such as fasting and other religious practices. Being one of the leading intellectuals of his time, Nikolaus von Dinkelsbühl also contributed to the development of the University of Vienna. The manuscript was probably produced in the vicinity of Klosterneuburg in Austria and is still kept there today (Shelfmark: Cod. 48). Data collection and ground truth creation: The edition at hand was produced by an international team of researchers from various fields in the context of the Vienna HTR Winter School 2022 with the help of Transkribus Expert Client. We uploaded the images of the manuscript into the Transkribus platform, applied the line recognition tool and manually copied the transcribed text lines into the recognised line boxes. Various models were trained with the ground truth (20% of the entire codex) created by the team. Images of the Klosterneuburg, Augustiner-Chorherrenstift, Cod. 48 are available at: https://manuscripta.at/diglit/AT5000-48/0001

  • Open Access
    Authors: 
    Berg, Johanna; Aasa, Carl Ollvik; Appelgren Thorell, Björn; Aits, Sonja;
    Publisher: Zenodo

    Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically. Dataset content OpenChart-SE, version 1 corpus (txt files and and dataset.csv) The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication. Codebook.xlsx The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection. suppl_data_1_openchart-se_form.pdf OpenChart-SE mock emergency care EHR form. suppl_data_3_openchart-se_dataexploration.ipynb This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus. More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se). Acknowledgement We thank all citizen scientists for contributing artificial EHRs and all members of our research groups, who provided helpful comments throughout the development of this project. This study was supported by a grant to Science for Life Laboratory from the Knut and Alice Wallenberg (KAW) Foundation (S.A. 2020.0182), which was distributed through the SciLifeLab and KAW National COVID-19 Research Program. The project is conducted in the AI Lund research environment at Lund University.

  • Research data . 2022
    Open Access
    Authors: 
    City of Vancouver;
    Publisher: City of Vancouver Open Data Portal

    ​Significant changes to the open data catalogue, including new datasets added, datasets renamed or retired, quarterly or annual updates to high-impact datasets, changes to data structure or definition. Smaller changes, such as adding or editing records or renaming a field in an existing dataset are not included. NoteThis log is published in the interest of transparency into the work of the open data program. You can subscribe to updates for a specific dataset by creating an account on the portal then clicking on the Follow button on the Information tab of any dataset. You can get updates by subscribing to our email newsletter. Data currency​New records will be added whenever a significant change is made to the open data catalogue.

  • Open Access
    Authors: 
    Dhrangadhariya, Anjani; Müller, Henning;
    Publisher: Zenodo

    This upload contains four main zip files. ds_cto_dict.zip: This zip file contains the four distant supervision dictionaries (P: participant.txt, I = intervention.txt, intervetion_syn.txt, O: outcome.txt) generated from clinicaltrials.gov using the Methodology described in Distant-CTO (https://aclanthology.org/2022.bionlp-1.34/). These dictionaries were used to create distant supervision labelling functions as described in the Labelling sources subsection of the Methodology. The data was derived from https://clinicaltrials.gov/ handcrafted_dictionaries.zip: This zip folder contains three files 1) gender_sexuality.txt: a list of possible genders and sexual orientations found across the web. The list needs to be more comprehensive. 2) endpoints_dict.txt: contains outcome names and the names of questionnaires used to measure outcomes assembled from PROM questionnaires and PROMs. and 3) comparator_dict: contains a list of idiosyncratic comparator terms like a sham, saline, placebo, etc., compiled from the literature search. The list needs to be more comprehensive. test_ebm_correctedlabels.tsv: EBM-PICO is a widely used dataset with PICO annotations at two levels: span-level or coarse-grained and entity-level or fine-grained. Span-level annotations encompass the full information about each class. Entity-level annotations cover the more fine-grained information at the entity level, with PICO classes further divided into fine-grained subclasses. For example, the coarse-grained Participant span is further divided into participant age, gender, condition and sample size in the randomised controlled trial. This dataset comes pre-divided into a training set (n=4,933) annotated through crowd-sourcing and an expert annotated gold test set (n=191) for evaluation. The EBM-PICO annotation guidelines caution about variable annotation quality. Abaho et al. developed a framework to post-hoc correct EBM-PICO outcomes annotation inconsistencies. Lee et al. studied annotation span disagreements suggesting variability across the annotators. Low annotation quality in the training dataset is excusable, but the errors in the test set can lead to faulty evaluation of the downstream ML methods. We evaluate 1% of the EBM-PICO training set tokens to gauge the possible reasons for the fine-grained labelling errors and use this exercise to conduct an error-focused PICO re-annotation for the EBM-PICO gold test set. The file 'test_ebm_correctedlabels.tsv' has error corrected EBM-PICO gold test set. This dataset could be used as a complementary evalution set along with EBM-PICO test set. error_analysis.zip: This .zip file contains three .tsv files for each PICO class to identify possible errors in about 1% (about 12,962 tokens) of the EBM-PICO training set. Objective: PICO (Participants, Interventions, Comparators, Outcomes) analysis is vital but time-consuming for conducting systematic reviews (SRs). Supervised machine learning can help fully automate it, but a lack of large annotated corpora limits the quality of automated PICO recognition systems. The largest currently available PICO corpus is manually annotated, which is an approach that is often too expensive for the scientific community to apply. Depending on the specific SR question, PICO criteria are extended to PICOC (C-Context), PICOT (T-timeframe), and PIBOSO (B-Background, S-Study design, O-Other) meaning the static hand-labelled corpora need to undergo costly re-annotation as per the downstream requirements. We aim to test the feasibility of designing a weak supervision system to extract these entities without hand-labelled data. Methodology: We decompose PICO spans into its constituent entities and re-purpose multiple medical and non-medical ontologies and expert-generated rules to obtain multiple noisy labels for these entities. These labels obtained using several sources are then aggregated using simple majority voting and generative modelling approaches. The resulting programmatic labels are used as weak signals to train a weakly-supervised discriminative model and observe performance changes. We explore mistakes in the currently available PICO corpus that could have led to inaccurate evaluation of several automation methods. Results: We present Weak-PICO, a weakly-supervised PICO entity recognition approach using medical and non-medical ontologies, dictionaries and expert-generated rules. Our approach does not use hand-labelled data. Conclusion: Weak supervision using weak-PICO for PICO entity recognition has encouraging results, and the approach can potentially extend to more clinical entities readily. All the datasets could be opened using text editors or Google sheets. The .zip files in the dataset can be opened using the archive utility on Mac OS and unzip functionality in Linux. (All Windows and Apple operating systems support the use of ZIP files without additional third-party software)

  • Open Access
    Authors: 
    Pereira, Leonardo Santiago Benitez;
    Publisher: Zenodo

    Collection of 300 support tickets manually labeled for semantic similarity, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The labeling process was performed in 2022 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish. All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag indicating the original content, for instance: the sentence "this text was written by Leonardo" is converted to "this text was written by [NAME]"). The removal was performed in three steps: first, the automated machine learning-based tool AWS Comprehend PII Removal was used; then, a sequence of custom regular expressions was applied; last, the entire corpus was manually verified.

  • Open Access
    Authors: 
    Moens, Jan; De Groote, Koen;
    Publisher: Zenodo

    Bijlage bij 'MOENS J. & DE GROOTE K. 2022: Ieper - De Meersen. Deel 2. De studie van het leer, Onderzoeksrapporten agentschap Onroerend Erfgoed 248, Brussel': - volledige inventaris van de leervondsten (als .xlsx-bestand)

  • Open Access German
    Authors: 
    Heitz, Caroline; Stapfer, Regine;
    Publisher: Zenodo

    The variable system was created to compile the 'MET-Pottery–Dataset' within the SNSF project No 100011_156205 ‘Mobilities, Entanglements and Transformations in Neolithic Societies of the Swiss Plateau (3900-3500 BC)’, short ‘MET-project’, conducted at the Institute of Archaeological Sciences, University of Bern between 2014 and 2018 (https://data.snf.ch/grants/grant/156205; https://boris.unibe.ch/77649/). It represents the largest and temporally most highly resolved collection of morphological pottery data of the Central European Neolithic. It offers data of 1046 ceramic vessels of different styles that originate out of 44 archaeological features of wetland and dryland sites of the northern Alpine Space and adjacent regions. Most of the archaeological contexts – anthropogenic layers of settlements, pits, and ditches – are typology-independent dated using dendrochronology or C14-dates. The data set includes a spreadsheet of nominal and numeric morphological variables, the collection of the vessels’ semi-profile silhouettes and the typological drawings from which all data was collected. In the scope of the MET-project ‘the data was used to elaborate a new mixed method research (MMR) methodology to investigate social relations beyond problematic concepts of homogeneous ‘archaeological cultures’. It is highly relevant for further methodological morphology-based research on pottery.

Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to Digital Humanities and Cultural Heritage. Are you interested to view more results? Visit OpenAIRE - Explore.
138,392 Research products, page 1 of 13,840
  • Open Access
    Authors: 
    , Haneca; , Ervynck;
    Publisher: Zenodo

    In deze zenodo-record zijn de brongegevens terug te vinden waarop de analyses in het hoofdstuk "Kenniswinst archeologieregelgeving 2016-2021" zijn gebaseerd. Dit hoofdstuk maakt deel uit van het onderzoeksrapport "Ameels V, Carpentier F, De Ketelaere S, Ervynck A, Geuens J, Haneca K, Pieters M, Van Looveren J & Verhelst A 2023: Evaluatie archeologie 2016-2021, Onderzoeksrapport Agentschap Onroerend Erfgoed, Brussel.": 1. Een overzicht van alle ingediende eindverslagen (n = 576) en nota's met eindafwerking (n = 26) over de periode 1 april 2016 t.e.m. 31 december 2021 eindverslagen_notas_2016_2021_overzicht.csv (comma-separated values) eindverslagen_notas_2016_2021_overzicht.xlsx (Microsoft Excel Open XML Spreadsheet) eindverslagen_notas_2016_2021_overzicht.shp (ESRI shapefile) (Een overzicht van alle ingediende eindverslagen - vanaf 1 april 2016 tot vandaag - kan je ook downloaden als GIS-bestand op https://geo.onroerenderfgoed.be/downloads.) 2. De resultaten van de inhoudelijke screening (periodisering, materiaalcategorieën, ...) van deze documenten: eindverslagen_notas_2016_2021_screening.csv (comma-separated values) eindverslagen_notas_2016_2021_screening.xlsx (Microsoft Excel Open XML Spreadsheet) eindverslagen_notas_2016_2021_screening.shp (ESRI shapefile)

  • Open Access English
    Authors: 
    Li, Weixuan;
    Publisher: Zenodo

    Abraham Bredius’ seminal work Künstler-Inventare contains over 150 inventories of artists’ possessions in the Dutch Republic. However, this rich source has never been fully transformed into datasets, partially because Bredius was selective in his transcriptions of artists’ inventories with a focus on paintings. To fill this gap and to overcome Bredius’ biases, the Virtual Interior project assembled and transcribed 46 inventories of artists’ and art dealers’ homes in 17th-century Amsterdam. The selection of the inventories was based on three criteria: 1) The artists’ and art dealers’ inventories in my sample were drawn up during an artist’s lifetime or shortly after his death; 2) the inventories listed goods arranged by room; 3) the original documents of the inventories can still be traced in the Amsterdam City Archives. Forty-six inventories were selected and transcribed from the original sources or copied from existing publications. The majority of my samples can be found in the notarial archives (Archive nr. 5075) and the rest in the Archive of the Chamber of Insolvency (Desolate Boedel kamer, Archive nr. 5072), both preserved in the Amsterdam City Archives. Starting with the published inventories, Bart Reuvekamp from our project team traced them back to the original sources in the archive, transcribed centuries-old handwritten pages, and compiled the listed objects in digital form. In this way, our database is able to encompass all belongings present in painters’ workshops, filling in the blanks that Bredius and other authors neglected or chose to leave out – the missing information that once hindered our comprehension of how artists organized their studios. The dataset is organized as follows: 1_Intro_and_data_explanation.xlsx introduces the dataset and explains the columns in the following files 2_Inventory_list.csv hosts the details of the inventories in this dataset 3_Inventory_items.csv contains the full transcriptions and categorical data of the objects in the inventories 4_Inventory_relationships.csv captures all the people mentioned in the inventories in the pre- or postscript and/or as debtors or creditors 5_Inventory_category_reference.csv provides a reference table for the ‘object_type’ and ‘object_category’ columns in 3_Inventory_items.csv 6_Inventory_subjectmatter_reference.csv offers a reference table for the ‘painting_subject’ and ‘painting_genre’ columns in 3_Inventory_items.csv NB: This Künstler-Inventare database is a provisional version. The correction and final data process has not been fully finished in Version 1. The transcriptions might contain errors and need to be treated with caution. This project is financed by NWO Smart Culture - Big Data / Digital Humanities grant

  • Open Access
    Authors: 
    Keshav Santhanam, Jon Saad-Falcon, Martin Franz, Omar Khattab, Avirup Sil, Radu Florian, Md Arafat Sultan, Salim Roukos, Matei Zaharia, Christopher Potts;
    Publisher: Zenodo

    XOR TyDi query and document files for running PLAID-ColBERTv2 experiments from "Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking".

  • Open Access
    Authors: 
    Berger, Michael; Bolte, Henrike; Führer, Veronika; Hausleitner, Felix; Hutterer, Sarah; Lüthi, Tim; Nancu, Mihaela; Passoni, Erica; Pataki, Katalin; Schröcksnadel, Sophie; +3 more
    Publisher: Zenodo

    This is ground truth for the vast collection of sermons of Nikolaus von Dinkelsbühl (ca. 1360 to 17th March 1433), translated and reorganised by a German redactor, from the 15th century has never been edited until now. It consists of 361 folios of parchment and paper. The text speaks about various topics such as fasting and other religious practices. Being one of the leading intellectuals of his time, Nikolaus von Dinkelsbühl also contributed to the development of the University of Vienna. The manuscript was probably produced in the vicinity of Klosterneuburg in Austria and is still kept there today (Shelfmark: Cod. 48). Data collection and ground truth creation: The edition at hand was produced by an international team of researchers from various fields in the context of the Vienna HTR Winter School 2022 with the help of Transkribus Expert Client. We uploaded the images of the manuscript into the Transkribus platform, applied the line recognition tool and manually copied the transcribed text lines into the recognised line boxes. Various models were trained with the ground truth (20% of the entire codex) created by the team. Images of the Klosterneuburg, Augustiner-Chorherrenstift, Cod. 48 are available at: https://manuscripta.at/diglit/AT5000-48/0001

  • Open Access
    Authors: 
    Berg, Johanna; Aasa, Carl Ollvik; Appelgren Thorell, Björn; Aits, Sonja;
    Publisher: Zenodo

    Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically. Dataset content OpenChart-SE, version 1 corpus (txt files and and dataset.csv) The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication. Codebook.xlsx The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection. suppl_data_1_openchart-se_form.pdf OpenChart-SE mock emergency care EHR form. suppl_data_3_openchart-se_dataexploration.ipynb This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus. More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se). Acknowledgement We thank all citizen scientists for contributing artificial EHRs and all members of our research groups, who provided helpful comments throughout the development of this project. This study was supported by a grant to Science for Life Laboratory from the Knut and Alice Wallenberg (KAW) Foundation (S.A. 2020.0182), which was distributed through the SciLifeLab and KAW National COVID-19 Research Program. The project is conducted in the AI Lund research environment at Lund University.

  • Research data . 2022
    Open Access
    Authors: 
    City of Vancouver;
    Publisher: City of Vancouver Open Data Portal

    ​Significant changes to the open data catalogue, including new datasets added, datasets renamed or retired, quarterly or annual updates to high-impact datasets, changes to data structure or definition. Smaller changes, such as adding or editing records or renaming a field in an existing dataset are not included. NoteThis log is published in the interest of transparency into the work of the open data program. You can subscribe to updates for a specific dataset by creating an account on the portal then clicking on the Follow button on the Information tab of any dataset. You can get updates by subscribing to our email newsletter. Data currency​New records will be added whenever a significant change is made to the open data catalogue.

  • Open Access
    Authors: 
    Dhrangadhariya, Anjani; Müller, Henning;
    Publisher: Zenodo

    This upload contains four main zip files. ds_cto_dict.zip: This zip file contains the four distant supervision dictionaries (P: participant.txt, I = intervention.txt, intervetion_syn.txt, O: outcome.txt) generated from clinicaltrials.gov using the Methodology described in Distant-CTO (https://aclanthology.org/2022.bionlp-1.34/). These dictionaries were used to create distant supervision labelling functions as described in the Labelling sources subsection of the Methodology. The data was derived from https://clinicaltrials.gov/ handcrafted_dictionaries.zip: This zip folder contains three files 1) gender_sexuality.txt: a list of possible genders and sexual orientations found across the web. The list needs to be more comprehensive. 2) endpoints_dict.txt: contains outcome names and the names of questionnaires used to measure outcomes assembled from PROM questionnaires and PROMs. and 3) comparator_dict: contains a list of idiosyncratic comparator terms like a sham, saline, placebo, etc., compiled from the literature search. The list needs to be more comprehensive. test_ebm_correctedlabels.tsv: EBM-PICO is a widely used dataset with PICO annotations at two levels: span-level or coarse-grained and entity-level or fine-grained. Span-level annotations encompass the full information about each class. Entity-level annotations cover the more fine-grained information at the entity level, with PICO classes further divided into fine-grained subclasses. For example, the coarse-grained Participant span is further divided into participant age, gender, condition and sample size in the randomised controlled trial. This dataset comes pre-divided into a training set (n=4,933) annotated through crowd-sourcing and an expert annotated gold test set (n=191) for evaluation. The EBM-PICO annotation guidelines caution about variable annotation quality. Abaho et al. developed a framework to post-hoc correct EBM-PICO outcomes annotation inconsistencies. Lee et al. studied annotation span disagreements suggesting variability across the annotators. Low annotation quality in the training dataset is excusable, but the errors in the test set can lead to faulty evaluation of the downstream ML methods. We evaluate 1% of the EBM-PICO training set tokens to gauge the possible reasons for the fine-grained labelling errors and use this exercise to conduct an error-focused PICO re-annotation for the EBM-PICO gold test set. The file 'test_ebm_correctedlabels.tsv' has error corrected EBM-PICO gold test set. This dataset could be used as a complementary evalution set along with EBM-PICO test set. error_analysis.zip: This .zip file contains three .tsv files for each PICO class to identify possible errors in about 1% (about 12,962 tokens) of the EBM-PICO training set. Objective: PICO (Participants, Interventions, Comparators, Outcomes) analysis is vital but time-consuming for conducting systematic reviews (SRs). Supervised machine learning can help fully automate it, but a lack of large annotated corpora limits the quality of automated PICO recognition systems. The largest currently available PICO corpus is manually annotated, which is an approach that is often too expensive for the scientific community to apply. Depending on the specific SR question, PICO criteria are extended to PICOC (C-Context), PICOT (T-timeframe), and PIBOSO (B-Background, S-Study design, O-Other) meaning the static hand-labelled corpora need to undergo costly re-annotation as per the downstream requirements. We aim to test the feasibility of designing a weak supervision system to extract these entities without hand-labelled data. Methodology: We decompose PICO spans into its constituent entities and re-purpose multiple medical and non-medical ontologies and expert-generated rules to obtain multiple noisy labels for these entities. These labels obtained using several sources are then aggregated using simple majority voting and generative modelling approaches. The resulting programmatic labels are used as weak signals to train a weakly-supervised discriminative model and observe performance changes. We explore mistakes in the currently available PICO corpus that could have led to inaccurate evaluation of several automation methods. Results: We present Weak-PICO, a weakly-supervised PICO entity recognition approach using medical and non-medical ontologies, dictionaries and expert-generated rules. Our approach does not use hand-labelled data. Conclusion: Weak supervision using weak-PICO for PICO entity recognition has encouraging results, and the approach can potentially extend to more clinical entities readily. All the datasets could be opened using text editors or Google sheets. The .zip files in the dataset can be opened using the archive utility on Mac OS and unzip functionality in Linux. (All Windows and Apple operating systems support the use of ZIP files without additional third-party software)

  • Open Access
    Authors: 
    Pereira, Leonardo Santiago Benitez;
    Publisher: Zenodo

    Collection of 300 support tickets manually labeled for semantic similarity, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The labeling process was performed in 2022 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish. All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag indicating the original content, for instance: the sentence "this text was written by Leonardo" is converted to "this text was written by [NAME]"). The removal was performed in three steps: first, the automated machine learning-based tool AWS Comprehend PII Removal was used; then, a sequence of custom regular expressions was applied; last, the entire corpus was manually verified.

  • Open Access
    Authors: 
    Moens, Jan; De Groote, Koen;
    Publisher: Zenodo

    Bijlage bij 'MOENS J. & DE GROOTE K. 2022: Ieper - De Meersen. Deel 2. De studie van het leer, Onderzoeksrapporten agentschap Onroerend Erfgoed 248, Brussel': - volledige inventaris van de leervondsten (als .xlsx-bestand)

  • Open Access German
    Authors: 
    Heitz, Caroline; Stapfer, Regine;
    Publisher: Zenodo

    The variable system was created to compile the 'MET-Pottery–Dataset' within the SNSF project No 100011_156205 ‘Mobilities, Entanglements and Transformations in Neolithic Societies of the Swiss Plateau (3900-3500 BC)’, short ‘MET-project’, conducted at the Institute of Archaeological Sciences, University of Bern between 2014 and 2018 (https://data.snf.ch/grants/grant/156205; https://boris.unibe.ch/77649/). It represents the largest and temporally most highly resolved collection of morphological pottery data of the Central European Neolithic. It offers data of 1046 ceramic vessels of different styles that originate out of 44 archaeological features of wetland and dryland sites of the northern Alpine Space and adjacent regions. Most of the archaeological contexts – anthropogenic layers of settlements, pits, and ditches – are typology-independent dated using dendrochronology or C14-dates. The data set includes a spreadsheet of nominal and numeric morphological variables, the collection of the vessels’ semi-profile silhouettes and the typological drawings from which all data was collected. In the scope of the MET-project ‘the data was used to elaborate a new mixed method research (MMR) methodology to investigate social relations beyond problematic concepts of homogeneous ‘archaeological cultures’. It is highly relevant for further methodological morphology-based research on pottery.