Actions
  • shareshare
  • link
  • cite
  • add
add
auto_awesome_motion View all 8 versions
Publication . Article . 2021

Semantic Data Set Construction from Human Clustering and Spatial Arrangement

Olga Majewska; Diana McCarthy; Jasper J. F. van den Bosch; Nikolaus Kriegeskorte; Ivan Vulić; Anna Korhonen;
Open Access
English
Published: 11 Oct 2021 Journal: Computational Linguistics, volume 47, issue 1, pages 69-116 (issn: 0891-2017, eissn: 1530-9312, Copyright policy )
Country: United Kingdom
Abstract

Abstract Research into representation learning models of lexical semantics usually utilizes some form of intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical semantic similarity estimation is a widely used evaluation method, but efforts have typically focused on pairwise judgments of words in isolation, or are limited to specific contexts and lexical stimuli. There are limitations with these approaches that either do not provide any context for judgments, and thereby ignore ambiguity, or provide very specific sentential contexts that cannot then be used to generate a larger lexical resource. Furthermore, similarity between more than two items is not considered. We provide a full description and analysis of our recently proposed methodology for large-scale data set construction that produces a semantic classification of a large sample of verbs in the first phase, as well as multi-way similarity judgments made within the resultant semantic classes in the second phase. The methodology uses a spatial multi-arrangement approach proposed in the field of cognitive neuroscience for capturing multi-way similarity judgments of visual stimuli. We have adapted this method to handle polysemous linguistic stimuli and much larger samples than previous work. We specifically target verbs, but the method can equally be applied to other parts of speech. We perform cluster analysis on the data from the first phase and demonstrate how this might be useful in the construction of a comprehensive verb resource. We also analyze the semantic information captured by the second phase and discuss the potential of the spatially induced similarity judgments to better reflect human notions of word similarity. We demonstrate how the resultant data set can be used for fine-grained analyses and evaluation of representation learning models on the intrinsic tasks of semantic clustering and semantic similarity. In particular, we find that stronger static word embedding methods still outperform lexical representations emerging from more recent pre-training methods, both on word-level similarity and clustering. Moreover, thanks to the data set’s vast coverage, we are able to compare the benefits of specializing vector representations for a particular type of external knowledge by evaluating FrameNet- and VerbNet-retrofitted models on specific semantic domains such as “Heat” or “Motion.”

Subjects by Vocabulary

Microsoft Academic Graph classification: Lexical semantics Feature learning Natural language processing computer.software_genre computer Semantic data model Cluster analysis Computer science Set (abstract data type) Artificial intelligence business.industry business

Subjects

Artificial Intelligence, Computer Science Applications, Linguistics and Language, Language and Linguistics

Funded by
EC| LEXICAL
Project
LEXICAL
Lexical Acquisition Across Languages
  • Funder: European Commission (EC)
  • Project Code: 648909
  • Funding stream: H2020 | ERC | ERC-COG
Validated by funder
,
EC| LEXICAL
Project
LEXICAL
Lexical Acquisition Across Languages
  • Funder: European Commission (EC)
  • Project Code: 648909
  • Funding stream: H2020 | ERC | ERC-COG
Validated by funder
Related to Research communities
Digital Humanities and Cultural Heritage
Download fromView all 4 sources
lock_open
Computational Linguistics
Article
License: cc-by-nc-nd
Providers: UnpayWall
moresidebar