- home
- Advanced Search
6 Research products, page 1 of 1
Loading
- Other research product . 2017Open Access EnglishAuthors:Grill, Pablo; Claassen, Mathias; Rosá, Aiala; Correa, Hernán;Grill, Pablo; Claassen, Mathias; Rosá, Aiala; Correa, Hernán;Country: Argentina
This paper presents a series of semi-supervised learning algorithms which were designed to classify words or expressions with temporal meanings. The algorithms use a set of pre-tagged temporal expressions and a set of semantic classes which were defined within a research project on the lexical coding of temporal meaning in Spanish. The algorithms in this article are mostly based on word embeddings, but they also make use of other methods. The results obtained strongly depend on the temporal classes considered, but, for some classes, results have reached 90% precision or above. Sociedad Argentina de Informática e Investigación Operativa
- Other research product . 2016Open Access EnglishAuthors:Rio Riande, María Gimena del; González Blanco García, Elena; Martínez Cantón, Clara; Curado Malta, Mariana;Rio Riande, María Gimena del; González Blanco García, Elena; Martínez Cantón, Clara; Curado Malta, Mariana;Country: Argentina
This paper presents work-in-progress of the POSTDATA project. This project aims to provide means to solve the interoperability issues that exist among the digital poetry repertoires. These repertoires hold data of poetry metrics that is locked in their own databases and it is not freely available to be compared and to be used by intelligent machines that could infer over the data. The POSTDATA project will use Linked Open Data (LOD) technologies to overcome the interoperability problems. POSTDATA is developing a metadata application proFIle (MAP) for the digital poetry repertoires, a construct that enhances interoperability.This development follows the method for the development of MAP (Me4MAP).A MAP for the digital poetry repertoires will open doors for this repertoires to be able to structure the data with a common model in order to publish it as Linked Open Data. This paper presents how this MAP is being developed so far. Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
- Other research product . 2016Open Access EnglishAuthors:Argerich, Luis; Cano, Matías J.; Torre Zaffaroni, Joaquín;Argerich, Luis; Cano, Matías J.; Torre Zaffaroni, Joaquín;Country: Argentina
In this paper we propose the application of feature hashing to create word embeddings for natural language processing. Feature hashing has been used successfully to create document vectors in related tasks like document classification. In this work we show that feature hashing can be applied to obtain word embeddings in linear time with the size of the data. The results show that this algorithm, that does not need training, is able to capture the semantic meaning of words.We compare the results against GloVe showing that they are similar. As far as we know this is the first application of feature hashing to the word embeddings problem and the results indicate this is a scalable technique with practical results for NLP applications. Sociedad Argentina de Informática e Investigación Operativa (SADIO)
- Other research product . 2021Open Access EnglishAuthors:Mechaca C., Ana L.; Marmanillo, Walter G.; Xamena, Eduardo; Ramirez-Orta, Juan; Maguitman, Ana Gabriela; Milios, Evangelos E.;Mechaca C., Ana L.; Marmanillo, Walter G.; Xamena, Eduardo; Ramirez-Orta, Juan; Maguitman, Ana Gabriela; Milios, Evangelos E.;Country: Argentina
Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models. Sociedad Argentina de Informática e Investigación Operativa
- Other research product . 2019Open Access EnglishAuthors:Xamena, Eduardo; Marmanillo, Walter Gabriel; Mechaca, Ana Lidia;Xamena, Eduardo; Marmanillo, Walter Gabriel; Mechaca, Ana Lidia;Country: Argentina
Large amounts of ancient documents have become available in the last years, regarding Argentinian history. This fact turns possible to find interesting and useful aggregated information. This work proposes the application of Natural Language Processing, Text Mining and Visualization tools over Argentinian ancient document repositories. Conceptual maps and entity networks make up the first target of this preliminary paper. The first step is the normalization of OCR acquired books of General G¨uemes. Exploratory analyses reveal the presence of manifold spelling errors, due to the OCR acquisition process of the volumes. We propose smart automatic ways for overcoming this issue in the process of normalization. Besides, a first topic landscape of a subset of volumes is obtained and analysed, via Topic Modelling tools. Sociedad Argentina de Informática e Investigación Operativa
- Other research product . 2015Open Access EnglishAuthors:Garciarena Ucelay, María José; Villegas, María Paula; Cagnina, Leticia; Errecalde, Marcelo Luis;Garciarena Ucelay, María José; Villegas, María Paula; Cagnina, Leticia; Errecalde, Marcelo Luis;Country: Argentina
Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to the potential applications in security, crime detection and marketing, among others. An interesting point is to study the robustness of a classifier when it is trained with a dataset and tested with others containing different characteristics. Commonly this is called cross domain experimentation. Although different cross domain studies have been done for datasets in English language, for Spanish it has recently begun. In this context, this work presents a study of cross domain classification for the author profiling task in Spanish. The experimental results showed that using corpora with different levels of formality we can obtain robust classifiers for the author profiling task in Spanish language. Red de Universidades con Carreras en Informática (RedUNCI) XII Workshop Bases de Datos y Minería de Datos (WBDDM)
6 Research products, page 1 of 1
Loading
- Other research product . 2017Open Access EnglishAuthors:Grill, Pablo; Claassen, Mathias; Rosá, Aiala; Correa, Hernán;Grill, Pablo; Claassen, Mathias; Rosá, Aiala; Correa, Hernán;Country: Argentina
This paper presents a series of semi-supervised learning algorithms which were designed to classify words or expressions with temporal meanings. The algorithms use a set of pre-tagged temporal expressions and a set of semantic classes which were defined within a research project on the lexical coding of temporal meaning in Spanish. The algorithms in this article are mostly based on word embeddings, but they also make use of other methods. The results obtained strongly depend on the temporal classes considered, but, for some classes, results have reached 90% precision or above. Sociedad Argentina de Informática e Investigación Operativa
- Other research product . 2016Open Access EnglishAuthors:Rio Riande, María Gimena del; González Blanco García, Elena; Martínez Cantón, Clara; Curado Malta, Mariana;Rio Riande, María Gimena del; González Blanco García, Elena; Martínez Cantón, Clara; Curado Malta, Mariana;Country: Argentina
This paper presents work-in-progress of the POSTDATA project. This project aims to provide means to solve the interoperability issues that exist among the digital poetry repertoires. These repertoires hold data of poetry metrics that is locked in their own databases and it is not freely available to be compared and to be used by intelligent machines that could infer over the data. The POSTDATA project will use Linked Open Data (LOD) technologies to overcome the interoperability problems. POSTDATA is developing a metadata application proFIle (MAP) for the digital poetry repertoires, a construct that enhances interoperability.This development follows the method for the development of MAP (Me4MAP).A MAP for the digital poetry repertoires will open doors for this repertoires to be able to structure the data with a common model in order to publish it as Linked Open Data. This paper presents how this MAP is being developed so far. Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
- Other research product . 2016Open Access EnglishAuthors:Argerich, Luis; Cano, Matías J.; Torre Zaffaroni, Joaquín;Argerich, Luis; Cano, Matías J.; Torre Zaffaroni, Joaquín;Country: Argentina
In this paper we propose the application of feature hashing to create word embeddings for natural language processing. Feature hashing has been used successfully to create document vectors in related tasks like document classification. In this work we show that feature hashing can be applied to obtain word embeddings in linear time with the size of the data. The results show that this algorithm, that does not need training, is able to capture the semantic meaning of words.We compare the results against GloVe showing that they are similar. As far as we know this is the first application of feature hashing to the word embeddings problem and the results indicate this is a scalable technique with practical results for NLP applications. Sociedad Argentina de Informática e Investigación Operativa (SADIO)
- Other research product . 2021Open Access EnglishAuthors:Mechaca C., Ana L.; Marmanillo, Walter G.; Xamena, Eduardo; Ramirez-Orta, Juan; Maguitman, Ana Gabriela; Milios, Evangelos E.;Mechaca C., Ana L.; Marmanillo, Walter G.; Xamena, Eduardo; Ramirez-Orta, Juan; Maguitman, Ana Gabriela; Milios, Evangelos E.;Country: Argentina
Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models. Sociedad Argentina de Informática e Investigación Operativa
- Other research product . 2019Open Access EnglishAuthors:Xamena, Eduardo; Marmanillo, Walter Gabriel; Mechaca, Ana Lidia;Xamena, Eduardo; Marmanillo, Walter Gabriel; Mechaca, Ana Lidia;Country: Argentina
Large amounts of ancient documents have become available in the last years, regarding Argentinian history. This fact turns possible to find interesting and useful aggregated information. This work proposes the application of Natural Language Processing, Text Mining and Visualization tools over Argentinian ancient document repositories. Conceptual maps and entity networks make up the first target of this preliminary paper. The first step is the normalization of OCR acquired books of General G¨uemes. Exploratory analyses reveal the presence of manifold spelling errors, due to the OCR acquisition process of the volumes. We propose smart automatic ways for overcoming this issue in the process of normalization. Besides, a first topic landscape of a subset of volumes is obtained and analysed, via Topic Modelling tools. Sociedad Argentina de Informática e Investigación Operativa
- Other research product . 2015Open Access EnglishAuthors:Garciarena Ucelay, María José; Villegas, María Paula; Cagnina, Leticia; Errecalde, Marcelo Luis;Garciarena Ucelay, María José; Villegas, María Paula; Cagnina, Leticia; Errecalde, Marcelo Luis;Country: Argentina
Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to the potential applications in security, crime detection and marketing, among others. An interesting point is to study the robustness of a classifier when it is trained with a dataset and tested with others containing different characteristics. Commonly this is called cross domain experimentation. Although different cross domain studies have been done for datasets in English language, for Spanish it has recently begun. In this context, this work presents a study of cross domain classification for the author profiling task in Spanish. The experimental results showed that using corpora with different levels of formality we can obtain robust classifiers for the author profiling task in Spanish language. Red de Universidades con Carreras en Informática (RedUNCI) XII Workshop Bases de Datos y Minería de Datos (WBDDM)