The CLARIN Knowledge Centre for Swedish in a Multilingual Setting (CLARIN-SMS) offers expertise in linguistic processing of text, especially for Swedish and/or when multiple languages are involved. In addition, CLARIN-SMS offers expertise in the application of language technology to Swedish Sign Language.
CLARIN-SMS is primarily directed at researchers in the humanities and social sciences with a need for analysis, annotation or mining of Swedish or multilingual text, and additionally at researchers with a need for corpora or tools for Swedish Sign Language.
CLARIN-SMS makes resources in the form of tools for linguistic processing and corpora available in service of the humanities and social sciences. The resources include monolingual (mainly Swedish) and multilingual corpora across several domains, and tools for basic processing of text, including tokenization, morphological analysis, part-of-speech tagging, syntactic parsing, and named entity recognition. CLARIN-SMS offers special expertise in the following areas:
– Processing of parallel and comparable corpora, including alignment and machine translation
– Cross-linguistically consistent annotation within the framework of Universal Dependencies
– Computation and evaluation of measures of text complexity
– Language technology for Swedish Sign Language
Tools and Resources
- Sapis - StilLett API Service
A restful web service including tools for measuring text complexity and text simplification. Sapis User manual (pdf).
- https://www.ida.liu.se/~larah03/transmap/Corpus/
A parallel corpus with some 4000 English original sentences from different sources and their Swedish translations.
- https://www.ida.liu.se/divisions/hcs/nlplab/resources/swectors/
A set of static Swedish word vectors and the code used for generating them.
- https://www.ida.liu.se/divisions/hcs/nlplab/resources/ges/
Gold alignments for 1164 English-Swedish sentence pairs for the purpose of testing word alignment software. Source data from Europarl v.2.
- https://www.ling.su.se/english/nlp
Several corpora as well as tools for part-of-speech tagging, word alignments and more
- Svensk Diakronisk korpus (Swedish Diachronic Corpus)
A corpus of texts covering the time period from Old Swedish to present day, with a wide variety of text types and freely available for download and search.
Contact: Eva Pettersson, eva.pettersson@lingfil.uu.se
- SweGram
SWEGRAM aims to provide a tool for text analysis in Swedish and English. You can upload one or several texts and annotate them at different linguistic levels with morphological and syntactic information. The annotated texts can then be used to extract statistics about the text properties with respect to text length, number of words, readability measures, part-of-speech, and much more.
Contact person: Beáta Megyesi, beata.megyesi@lingfil.uu.se
- Universal Dependencies
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing nearly 200 treebanks in over 100 languages.
Contact: Joakim Nivre, joakim.nivre@lingfil.uu.se
- SOU corpus
This repository contains cleaned and further processed versions of Swedish Government Official Reports - Statens offentliga utredningar (SOU). The documents are based on html versions from Riksdagens öppna data (http://data.riksdagen.se/) and cover the years 1994 to 2020.
Contact: Sara Stymne, sara.stymne@lingfil.uu.se
- Data sets for causality recognition
Three data sets of Swedish text annotated for the presence of causality. The sets are annotated with two different tasks in mind, namely causality recognition and causality ranking with respect to a query prompt containing at least a cause or an effect.
Contact: Sara Stymne, sara.stymne@lingfil.uu.se
Participants
CLARIN-SMS is a distributed Knowledge Centre which includes the following partners:
Linköping University, Department of Computer and Information Science. Contact: Lars Ahrenberg, lars.ahrenberg@liu.se
Stockholm University, Department of Linguistics. Contact: Mats Wirén, mats.wiren@ling.su.se
Uppsala University, Department of Linguistics and Philology. Contact: Eva Pettersson, eva.pettersson@lingfil.uu.se
Help Desk Contact
Arne Jönsson, arne.jonsson@liu.se
Publications
Lars Ahrenberg (2015). Converting an English–Swedish Parallel Treebank to Universal Dependencies. Proc. Third International Conference on Dependency Linguistics (DepLing 2.015), Association for Computational Linguistics, pages 10–19. ACL Anthology W15-2103.
Marco Kuhlmann and Stephan Oepen (2016). Towards a Catalogue of Linguistic Graph Banks. Computational Linguistics, 42, 4, 819–827. ISSN 0891-2017, E-ISSN 1530-9312.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman (2016). Universal Dependencies v1: A Multilingual Treebank Collection. Proc. Tenth International Conference on Language Resources and Evaluation (LREC 2016).
Robert Östling (2018). Part of Speech Tagging: Shallow or Deep Learning? Northern European Journalof Language Technology, Volume 5, Article 1.
Robert Östling, Carl Börstell, Moa Gärdenfors and Mats Wirén (2017). Universal Dependencies for Swedish Sign Language. Proc. 21st Nordic Conference on Computational Linguistics, pages 303–308. Linköping.
Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, and Sara Stymne (2018). 82 treebanks,34 models: Universal Dependency Parsing with Multi-Treebank Models. Proc. CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 113–123.