Swedish in a Multilingual Setting, SMS

The CLARIN Knowledge Centre for Swedish in a Multilingual Setting (CLARIN-SMS) offers expertise in linguistic processing of text, especially for Swedish and/or when multiple languages are involved. In addition, CLARIN-SMS offers expertise in the application of language technology to Swedish Sign Language.

CLARIN-SMS is primarily directed at researchers in the humanities and social sciences with a need for analysis, annotation or mining of Swedish or multilingual text, and additionally at researchers with a need for corpora or tools for Swedish Sign Language.

CLARIN-SMS makes resources in the form of tools for linguistic processing and corpora available in service of the humanities and social sciences. The resources include monolingual (mainly Swedish) and multilingual corpora across several domains, and tools for basic processing of text, including tokenization, morphological analysis, part-of-speech tagging, syntactic parsing, and named entity recognition. CLARIN-SMS offers special expertise in the following areas:

– Processing of parallel and comparable corpora, including alignment and machine translation
– Cross-linguistically consistent annotation within the framework of Universal Dependencies
– Computation and evaluation of measures of text complexity
– Language technology for Swedish Sign Language

Tools and Resources

Sapis - StilLett API Service
A restful web service including tools for measuring text complexity and text simplification. Sapis User manual (pdf).
https://www.ida.liu.se/~larah03/transmap/Corpus/
A parallel corpus with some 4000 English original sentences from different sources and their Swedish translations.
https://www.ida.liu.se/divisions/hcs/nlplab/resources/swectors/
A set of static Swedish word vectors and the code used for generating them.
https://www.ida.liu.se/divisions/hcs/nlplab/resources/ges/
Gold alignments for 1164 English-Swedish sentence pairs for the purpose of testing word alignment software. Source data from Europarl v.2.
https://www.ling.su.se/english/nlp
Several corpora as well as tools for part-of-speech tagging, word alignments and more
Svensk Diakronisk korpus (Swedish Diachronic Corpus)
A corpus of texts covering the time period from Old Swedish to present day, with a wide variety of text types and freely available for download and search.
Contact: Eva Pettersson, eva.pettersson@lingfil.uu.se
SweGram
SWEGRAM aims to provide a tool for text analysis in Swedish and English. You can upload one or several texts and annotate them at different linguistic levels with morphological and syntactic information. The annotated texts can then be used to extract statistics about the text properties with respect to text length, number of words, readability measures, part-of-speech, and much more.
Contact person: Beáta Megyesi, beata.megyesi@lingfil.uu.se
Universal Dependencies
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing nearly 200 treebanks in over 100 languages.
Contact: Joakim Nivre, joakim.nivre@lingfil.uu.se
SOU corpus
This repository contains cleaned and further processed versions of Swedish Government Official Reports - Statens offentliga utredningar (SOU). The documents are based on html versions from Riksdagens öppna data (http://data.riksdagen.se/) and cover the years 1994 to 2020.
Contact: Sara Stymne, sara.stymne@lingfil.uu.se
Data sets for causality recognition
Three data sets of Swedish text annotated for the presence of causality. The sets are annotated with two different tasks in mind, namely causality recognition and causality ranking with respect to a query prompt containing at least a cause or an effect.
Contact: Sara Stymne, sara.stymne@lingfil.uu.se