This study applies relative entropy in naturalistic large-scale corpus to calculate the difference among L2 (second language) learners at different levels. We chose lemma, token, POStrigram, conjunction to represent lexicon and grammar to detect the patterns of language proficiency development among different L2 groups using relative entropy. The results show that information distribution discrimination regarding lexical and grammatical differences continues to increase from L2 learners at a lower level to those at a higher level. This result is consistent with the assumption that in the course of second language acquisition, L2 learners develop towards a more complex and diverse use of language. Meanwhile, this study uses the statistics method of time series to process the data on L2 differences yielded by traditional frequency-based methods processing the same L2 corpus to compare with the results of relative entropy. However, the results from the traditional methods rarely show regularity. As compared to the algorithms in traditional approaches, relative entropy performs much better in detecting L2 proficiency development. In this sense, we have developed an effective and practical algorithm for stably detecting and predicting the developments in L2 learners’ language proficiency. H2020 European Research Council
Publisher: Springer Science and Business Media LLC
Project: EC | WIDE (742545), EC | WIDE (742545)
AbstractScientific writings, as one essential part of human culture, have evolved over centuries into their current form. Knowing how scientific writings evolved is particularly helpful in understanding how trends in scientific culture developed. It also allows us to better understand how scientific culture was interwoven with human culture generally. The availability of massive digitized texts and the progress in computational technologies today provide us with a convenient and credible way to discern the evolutionary patterns in scientific writings by examining the diachronic linguistic changes. The linguistic changes in scientific writings reflect the genre shifts that took place with historical changes in science and scientific writings. This study investigates a general evolutionary linguistic pattern in scientific writings. It does so by merging two credible computational methods: relative entropy; word-embedding concreteness and imageability. It thus creates a novel quantitative methodology and applies this to the examination of diachronic changes in the Philosophical Transactions of Royal Society (PTRS, 1665–1869). The data from two computational approaches can be well mapped to support the argument that this journal followed the evolutionary trend of increasing professionalization and specialization. But it also shows that language use in this journal was greatly influenced by historical events and other socio-cultural factors. This study, as a “culturomic” approach, demonstrates that the linguistic evolutionary patterns in scientific discourse have been interrupted by external factors even though this scientific discourse would likely have cumulatively developed into a professional and specialized genre. The approaches proposed by this study can make a great contribution to full-text analysis in scientometrics.
Abstract Hyphenated compounds have largely been neglected in the studies of compounding, which have seldom analysed compounds in context. In this study, we argue that the hyphen use in compounds is strongly motivated. Hyphenation is used when words form a unit, which reduces the possibility of parsing them into separate units or other forms. The current study adopts a new perspective on contextual factors, namely, which part of speech (PoS) the compound as a whole belongs to and how people correctly parse a compound into a unit. This process can be observed and analysed by considering examples. This study therefore holds that hyphenation might have gradually become a compounding technique that differs from general compounding principles. To better understand hyphenated compounds and the motivation for using hyphenation, we conduct a quantitative investigation into their distribution frequency to explore how English hyphenated compounds have been used in over the last 200 years. Diachronic change in the frequency of the distribution for compounds has seldom been considered. This question is explored by using frequency data obtained from the three databases that contain hyphenated compounds. Diachronic analysis shows that the frequencies of tokens and types in hyphenated compounds have been increasing, and changes in both frequencies follow the S-curve model. Historical evidence shows that hyphenation in compounds, as an orthographic form, does not seem to disappear easily. Familiarity and economy, as suggested in the cognitive studies of compounding, cannot adequately explain this phenomenon. The three databases that we used provide cross-verification that suggests that hyphenation has evolved into a compounding technique. Language users probably unconsciously take advantage of the discriminative learning model to remind themselves that these combinations should be parsed differently. Thus the hyphenation compounding technique facilitates communication efficiency. Overall, this study significantly enhances our understanding of the nature of compounding, the motivations for using hyphenation, and its cognitive processing.
Pseudowords have long served as key tools in psycholinguistic investigations of the lexicon. A common assumption underlying the use of pseudowords is that they are devoid of meaning: Comparing words and pseudowords may then shed light on how meaningful linguistic elements are processed differently from meaningless sound strings. However, pseudowords may in fact carry meaning. On the basis of a computational model of lexical processing, linear discriminative learning (LDL Baayen et al., Complexity, 2019, 1–39, 2019), we compute numeric vectors representing the semantics of pseudowords. We demonstrate that quantitative measures gauging the semantic neighborhoods of pseudowords predict reaction times in the Massive Auditory Lexical Decision (MALD) database (Tucker et al., 2018). We also show that the model successfully predicts the acoustic durations of pseudowords. Importantly, model predictions hinge on the hypothesis that the mechanisms underlying speech production and comprehension interact. Thus, pseudowords emerge as an outstanding tool for gauging the resonance between production and comprehension. Many pseudowords in the MALD database contain inflectional suffixes. Unlike many contemporary models, LDL captures the semantic commonalities of forms sharing inflectional exponents without using the linguistic construct of morphemes. We discuss methodological and theoretical implications for models of lexical processing and morphological theory. The results of this study, complementing those on real words reported in Baayen et al., (Complexity, 2019, 1–39, 2019), thus provide further evidence for the usefulness of LDL both as a cognitive model of the mental lexicon, and as a tool for generating new quantitative measures that are predictive for human lexical processing.
Using computational simulations, this work demonstrates that it is possible to learn a systematic relation between words' sound and their meanings. The sound-meaning relation was learned from a corpus of phonologically transcribed child-directed speech by using the linear discriminative learning (LDL) framework (Baayen, Chuang, Shafaei-Bajestan, & Blevins, 2019), which implements linear mappings between words' form vectors and semantic vectors. Presented with the form vectors of 16 nonwords, taken from a study on word learning (Fitneva, Christiansen, & Monaghan, 2009), the network generated the estimated semantic vectors of the nonwords. As half of these nonwords were created to phonologically resemble English nouns and the other half were phonologically similar to English verbs, we assessed whether the estimated semantic vectors for these nonwords reflect this word category difference. In 7 different simulations, linear discriminant analysis (LDA) successfully discriminated between noun-like nonwords and verb-like nonwords, based on their semantic relation to the words in the lexicon. Furthermore, how well LDA categorized a nonword correlated well with a phonological typicality measure (i.e., the degree of its form being noun-like or verb-like) and with children's performance in an entity/action discrimination task. On the one hand, the results suggest that children can infer the implicit meaning of a word directly from its sound. On the other hand, this study shows that nonwords do land in semantic space, such that children can capitalize on their semantic relations with other elements in the lexicon to decide whether a nonword is more likely to denote an entity or an action. (PsycInfo Database Record (c) 2020 APA, all rights reserved).
This study addresses the question of whether there is anything special about learning a third language, as compared to learning a second language, just by virtue of the third language being the third language acquired, and independently of the specific properties of the third language. We used computational modeling to explore this question for the learning of a small vocabulary of some 400 words, with English as L1, German or Mandarin as L2, and Mandarin and alternatively Dutch, as L3. For computational modeling, we made use of the mathematical framework of linear discriminative learning, which we extended with the learning rule of Widrow-Hoff to enable the modeling of incremental learning of the mappings between form and meaning when words' meanings are represented by vectors of real numbers (embeddings) rather than by abstract symbolic units. A series of simulation experiments covering single-language learning, bilingual learning, and finally trilingual learning, clarified that within the framework of discrimination learning, within-language homophones give rise to frailty in comprehension that in turn for production gives rise to semantic errors in L1, and language intrusions in L2 and L3. Our model correctly predicts production to lag behind comprehension in learning, and it clarified that, within the boundaries of discrimination learning, the properties of the L3 crucially determine whether L3 learning appears to involve a language that is `dormant' with respect to L1 and L2. Qualitatively surprisingly different patterns of acquisition of the L3, and its interactions with L1 and L2, can arise in our simulations without any changes in the mathematics driving learning. Our simulations also show that when words' forms incorporate not only segmental but also suprasegmental information, the nature of errors that arise in production changes. In the general discussion, we reflect on the implications of our findings for the question of what is special about multilingualism.
Nonwords are often used to clarify how lexical processing takes place in the absence of semantics. This study shows that nonwords are not semantically vacuous. We used Linear Discriminative Learning (Baayen et al., 2019) to estimate the meanings of nonwords in the MALD database (Tucker et al., 2018) from the speech signal. We show that measures gauging nonword semantics significantly improve model fit for both acoustic durations and RTs. Although nonwords do not evoke meanings that afford conscious reflexion, they do make contact with the semantic space, and the angles and distances of nonwords with respect to actual words co-determine articulation and lexicality decisions.
The initial stage of language comprehension is a multi-label classification problem. Listeners or readers, presented with an utterance, need to discriminate between the intended words and the tens of thousands of other words they know. We propose to address this problem by pairing a network trained with the learning rule of Rescorla andWagner (1972) with a second network trained independently with the learning rule of Widrow and Hoff (1960). The first network has to recover from sublexical input features the meanings encoded in the language signal, resulting in a vector of activations over the lexicon. The second network takes this vector as input and further reduces uncertainty about the intended message. Classification performance for a lexicon with 52,000 entries is good. The model also correctly predicts several aspects of human language comprehension. By rejecting the traditional linguistic assumption that language is a (de)compositional system, and by instead espousing a discriminative approach (Ramscar, 2013), a more parsimonious yet highly effective functional characterization of the initial stage of language comprehension is obtained.