The Department of Linguistics at Stockholm University participates in Swe-Clarin through two sections, Computational Linguistics and Sign Language. We are currently pursuing two projects supported by Swe-Clarin.
The first project, started in February 2016, deals with syntactic annotation of the Swedish Sign Language Corpus (SSLC, http://www.ling.su.se/teckenspråksresurser/teckenspråkskorpusar/svensk-teckenspråkskorpus/svensk-teckenspråkskorpus), and the mapping of this annotation to Universal Dependencies (UD, http://universaldependencies.org), a standard for the design of multilingual treebanks which in turn enables application of multilingual language technology tools. The effort of creating SSLC began in 2003. The corpus consists of 24 hours of video containing monologues and dialogues with 42 signers, divided into 300 files of which 85 are annotated in the latest distribution. All annotation is made in ELAN (https://tla.mpi.nl/tools/tla-tools/elan/) and has so far included glossing of signs, part of speech for glosses, and translation into Swedish. The aim of the new syntactic annotation is to provide an analysis of clause structure. The annotation of clauses is based on semantic and prosodic criteria, with the most important semantic criterion being that a clause contains (at least) one predicate with its arguments. (The presence of serial verbs in sign language means that sometimes multiple predicates appear in a sentence, contrary to what is the case in spoken language, in which a criterion for a clause is often that it contains at most one predicate, or finite verb, with pseudo-coordination as an exception.) A prosodic criterion is that a clause should correspond to a prosodic unit, but this is not a clearcut requirement since prosodic structure in sign language, like in spoken language, not always agrees with the syntactic structure. (Prosody in sign language primarily includes non-manual signals such as eye movements, head movements, body posture, sign duration, etc., and manifests the same function as the prosody of spoken language, that is, to signal meaning, typically relating to pragmatics.) So far, about 25 files have been annotated with syntactic structure, and in a second step, we are developing principles for mapping this to a syntactic annotation along the lines of UD. Our starting-point is the set of syntactic UD categories for Swedish, and further work will clarify to what extent this annotation needs language-specific additions. The work is done in collaboration between Computational Linguistics and Sign Language, and has so far resulted in one publication: Börstell, C., Wirén, M. Mesch, J. & Gärdenfors, M. (2016). Towards an Annotation of Syntactic Structure in the Swedish Sign Language Corpus. Proc. 7th Workshop on the Representation and Processing of Sign Languages: Corpus Mining. Paper presented at Language Resources and Evaluation Conference (LREC) (pp. 19-24). Paris: ELRA. http://www.lrec-conf.org/proceedings/lrec2016/index.html.
The syntactic annotation in ELAN
The second project, started in June 2016, aims to construct an annotated corpus of Strindberg's collected works. The project is a collaboration with the Litteraturbanken at the University of Gothenburg and the National Edition of August Strindberg's Collected Works at Stockholm University. The National Edition consists of 72 volumes with about 6 million words published between 1981 and 2012. Our corpus project is based on electronic versions of the textbooks provided by the Litteraturbanken, with the goal of freely distributing the corpus in three versions:
1. A raw-text version without annotation, with the simplest possible structure for encoding chapters, paragraphs, headings, etc., by using blank lines. This version is intended for those who want to work with the text directly, for example, with their own scripts or with a profiling tool such as Sketch Engine (https://www.sketchengine.co.uk/).
2. A CoNLL version with one word per line and annotation in columns. To begin with, the annotation is being based on a normal analysis chain with tokenisation, part-of-speech tagging and dependency parsing, with functionality for dealing with archaic features in the language. This version is intended for those who want to work with the annotated text without having to use the XML version or a search interface.
3. An XML version including an XML schema encoding the text and annotation, which can be packaged with an independent search engine and/or Korp.
In addition to Strindberg's texts from the National Edition, we hope to make available the literary comments along with the corpus.
The work is a continuation of a previous project on a smaller scale (before the advent of Swe-Clarin) which resulted in a corpus of Strindberg's autobiographical works, called the Stockholm University Strindberg Corpus (SUSC). See Nilsson Björkenstam, K. Gustafson Capkova, S. & Wirén, M. (2014). The Stockholm University Strindberg Corpus: Content and Possibilities. In Roland Lysell (Ed.), Strindberg on International Stages / Strindberg in Translation. Cambridge: Cambridge Scholars Publishing. http://su.diva-portal.org/smash/get/diva2:705835/FULLTEXT01.pdf.
English