In this post we present two resources: a semantic lexicon and a typological database.
First, we are developing a new corpus tool for Swedish as tags the words in a corpus for semantic fields. In collaboration with Lancaster University, we create a Swedish version of "the USAS tags" that was developed in conjunction with the British National Corpus, and then further developed for a corpus tool for English, WMatrix. This corpus tool can tag words in a corpus for the different semantic fields to which the words belong. The semantic fields are based on the Longman Lexicon of Contemporary English and have a multi-level structure with 21 overall fields like 'Money & Trade', 'Quantity & Dimensions', but also the possibility of further fine-grained division. The ambition is to create a tool that can search words in a particular semantic field in a larger material and also compare which semantic fields that are overrepresented in a particular corpus material in comparison with a second text corpus. The tool has been used in English-language research and for research in eg. discourse analysis and metaphor research. As a starting point, the dictionary that forms the basis of the WMatrix tagger has been automatically translated into Swedish using the freely available English-Swedish online dictionary, the People's dictionary, and an automatic word class-tagger. The procedure has previously been tried for several other European languages (Piao et al. 2016). The semantically tagged dictionary that this procedure results in is, for the most part, incomplete or in some cases directly incorrect (due to problems in the English-Swedish dictionary, errors in the wording, or in the automatic translation). The dictionary therefore undergoes manual checking. So far, about 4,500 of 18,000 words have been checked. Demo and download of the current version is available here.
Secondly, we introduce DiACL, the Diachronic Atlas of Comparative Linguistics. This is a database of lexical and typological/morphosyntactical data for historical, comparative and phylogenetic linguistics. It contains data from 500 languages from 18 families, divided into 3 macro areas: Eurasia, Pacific and South America. The database has the following contents: 1) Lexical data set with basic vocabulary, 2) Lexical data set with cultural vocabulary, 3) Typological/morphosyntactic data, including the main types of word sequence, clause syntax, nominal and verbal morphology. The database contains data from contemporary and historical languages and, if possible, reconstructed languages. Data has been collected from lexicons, grammar and through new field work (especially from the Caucasus and South America). All data is source-referenced to scientifically trusted literature.
Johan Frid, with good help from Anna W Gustafsson and Gerd Carling.