Research in computational linguistics has been carried out at the department (or its precursors) since the early 1960s. A major infrastructural achievement was the Stockholm–Umeå Corpus (SUC), containing one million words with manually corrected part-of-speech tags, whose first version was developed 1989–97. Distribution of SUC was outsourced to Språkbanken in 2008, but the corpus is still being maintained within the department, with the latest version (SUC 3.0) having been released in 2012. Corpora and tools currently distributed include Stagger (a part-of-speech tagger with models for Swedish and Icelandic), Spacos (a system for aligning words in parallel corpora), the Stockholm Internet Corpus (SIC), and SUC-CORE (a subset of SUC annotated for coreference relations between noun phrases). See information about Computational Linguistics resources. The section carries out research related to first-language acquisition, massively parallel corpora, and user-generated content; see information about research in Computational Linguistics.
Research in sign language has been carried out at the department since 1972. Much of the current research focuses on lexicography and corpora for sign language, particularly Swedish Sign Language (SSL). A breakthrough in the representation of sign-language data occurred in the mid-1990s when playback of video on personal computers became possible, replacing videotapes. A related development has been the availability of systems for fine-grained, searchable annotation of videos with signing. These advances have been instrumental in work on the SSL lexicon, which started in 1988, and on the SSL corpus, which started in 2003. The latter includes semi-spontaneous dialogues as well as monologues with narratives and elicitation tasks. Both the SSL lexicon and corpus are distributed by the section; for more information, see Sign Language resources.
As a type K center in Swe-Clarin, we plan to continue and expand upon the collaborations we already have with neighbouring disciplines that deal with primary data in the form of natural language, including the following:
The section for Computational Linguistics is collaborating with the Department of Economics at Stockholm University on a data set consisting of high-school essays (Swedish: gymnasiets nationella prov i svenska) with a rich set of metadata, including the original grading of the teachers as well as an independent, blind re-grading. Previous research at the Department of Economics has showed that the Swedish grading system suffers from discrimination based on social class, gender, ethnicity and age. This is of great interest in political economy, since incorrect grades cause major problems and inefficiency in society, apart from the negative consequences for the individual. In this case, we developed a system for automatic grading based on machine learning which could improve efficiency in a practical setting where one seeks to identify candidates for incorrectly graded essays. A spin-off of this work is a collaboration with the project Alla kan skriva, which develops tools for language education, funded by Vinnova, the Swedish Post and Telecom Authority (PTS) and the Swedish Academy.
Together with the National Edition of August Strindberg's Collected Works and the Department of Literature and History of Ideas, the section is collaborating on building a linguistically annotated corpus of Strindberg's literary fiction. This is a continuation of earlier work on the Stockholm University Strindberg Corpus; see information about Computational Linguistics resources.
The section for Sign Language is cooperating with the Institute for Language and Folklore (Institutet för språk och folkminnen) in its support for sign language as a minority language with special status.
Contact: Mats Wirén, mats.wiren (at) ling.su.se