Book of Abstracts
December 6
1 Mats Wirén Stockholm University, Sweden
SweLL - an upcoming infrastructure for Swedish as a Second Language
The purpose of SweLL is to set up an infrastructure for collection, digitization, normalization, and annotation of learner production, as well as to make available a linguistically annotated corpus of approx. 600 L2 learner texts. I will introduce the reasoning behind the metadata and error taxonomy that has been accepted in the project, and give an overview of the tools under development.
______________________________________________________________
2 Jūratė Ruzaitė Vytautas Magnus University, Kaunas (Lithuania)
Development of a Lithuanian learner corpus: A work in progress report
The Lithuanian Learner Corpus Corpus (LLC) is a corpus of primarily written learner language, but it also includes a spoken language component. The corpus is under construction now and consists of raw texts produced by learners of different proficiency levels. Thus the data are not linguistically annotated or error-tagged, but this will be done in our further work. In my presentation, I will focus on the design of the corpus, the initial steps of data collection, and the metadata used.
______________________________________________________________
3 Ari Huhta University of Jyväskylä (Finland)
Towards a learner corpus of English and Swedish in Finland
The corpus of learner language at the U. of Jyväskylä comprises written performances from Finnish learners of English as a foreign language and immigrant learners of Finnish as a second language. Most of the corpus consists of raw texts; some Finnish texts have been annotated partially. Our plan is to annotate and error-tag more of the corpus in the future. In the presentation, I describe the creation of the corpus, including how the performances were linked with the Common European Framework levels of proficiency.
______________________________________________________________
4 Nives Mikelic Preradovic University of Zagreb
CroLTeC - Croatian Learner Text Corpus
CroLTeC learner corpus consists of 6,213 essays, out of which 1,217 were born- digital, while 4,996 essays were scanned, transcribed and converted into XML. CroLTeC has a total of 1,054,287 tokens, and essays have been collected on all six levels of language learning at Croaticum - Center for Croatian as Second and Foreign Language at the Faculty of Humanities and Social Sciences in Zagreb.
It contains essays from 755 students with 36 different mother tongues, among them the most prominent being Spanish, English, German, Polish, Chinese, French and Arabic.
CroLTeC is available online through the TEITOK platform (Janssen, 2014).
All CroLTeC essays contain metadata about the title, number and type of essay (homework, part of exam or field class, etc.). Also, the corpus will be searchable by age, sex, language proficiency level and the mother tongue of the learner.
The corpus is still under construction. Data are being lemmatized and annotated with MSD tags as well as error-tagged. The error-tagging scheme is partially based on Solar (the scheme of Slovene's developmental corpus: http://eng.slovenscina.eu/korpusi/solar) and the error-coding of the Cambridge Learner Corpus, with some additions made for Croatian language.
______________________________________________________________
5 Egon W. Stemle Eurac Research, Italy
Learner Corpus Infrastructure (LCI) @ Eurac Research
Learner corpora build a fundamental basis for a noticeable part of the research activities of the Institute for Applied Linguistics. The project aims at enhancing the research potential of the Institute by creating an always more efficient infrastructure for the collection, processing and maintenance of learner corpora.
______________________________________________________________
6 Elżbieta Kaczmarska University of Warsaw (Poland)
Towards a learner corpus of Czech for Polish speaking students
With more than 50 new students every year, Czech is the most popular language taught at the Institute of Western and Southern Slavic Studies of University of Warsaw. It is also the second most popular Slavic language chosen by Polish students (at universities and private schools in Wrocław, Katowice, Sosnowiec, Opole, Racibórz, Poznań and elsewhere). To make the teaching more efficient we intend to build a corpus of Czech texts written by Polish learners by extending the L1 Polish – L2 Czech subcorpus of CzeSL (Czech as a Second Language), a learner corpus built at Charles University in Prague. It requires not only collecting new texts and cooperation with the CzeSL team, but also applying some subtle changes to the annotation system of CzeSL,considering the common mistakes made by Poles speaking Czech. The paper will present requirements and needs of users of thefuture L1 Polish – L2 Czech learner corpus. In conclusion, some remarks about the possibility of using the CzeSL annotation system for Polish as L2 will be presented, with a special focus on the language of Polish repatriates.
______________________________________________________________
7 Inga Znotiņa Riga Stradins University (Riga, Latvia), Liepaja University (Liepaja, Latvia)
Learner corpus of the second Baltic language: annotation and data comparability
The learner corpus of the second Baltic language (in short - Esam) offers some internal comparability, as it includes bi-directional data: Latvian-Lithuanian and Lithuanian-Latvian beginner learner texts. The texts are lemmatized, POS-tagged, annotated for sentence types, and (currently only partly) error-tagged. The corpus is rather small – approx. 52000 words. I would like to discuss options and limitations for using its data in larger scale research.
______________________________________________________________
December 7
8 Sylviane Granger & Magali Paquot Université catholique de Louvain
Towards standardization of metadata for L2 corpora
Although there have been a large number of learner-corpus-based studies in recent years, the results are often inconclusive and at times seemingly contradictory because the data they are based on are not comparable. This highlights the importance of drawing up a standardized system of metadata for L2 texts. In our presentation we will take stock of a range of current metadata sets and make suggestions for minimal and maximal design principles.
______________________________________________________________
9 Kari Tenfjord University of Bergen
Metadata in ASK
Based on our experience from developing the corpus tool (Corpuscle) and designing the ASK corpus (The Norwegian learner corpus), as well as our experience from research based on that corpus, we will present some viewpoints on how detailed the metadata coding of sociolinguistic and other extratextual aspects of the texts should be to ensure that data in different corpora can meaningfully be compared. We will also comment on metadata usage in ASK.
______________________________________________________________
10 Elena Volodina University of Gothenburg, Sweden
Legal issues in learner essay collection
We have been working with lawyers towards the possibility of making SweLL essay collection freely available for research - with preserved metadata that we want to associate with our essays. I will describe problems, dilemmas, (unpopular) decisions and our finalized “data handling flow" in connection to that.
______________________________________________________________
11 Iria del Río University of Lisbon, Portugal
Error annotation in the COPLE2 corpus
We present the error annotation system of the COPLE2 corpus. COPLE2 is a learner corpus of Portuguese as second/foreign language available online through the TEITOK platform. It contains written and oral productions of Portuguese learners with different native languages (15) and proficiencies (A1-C1). The corpus encodes rich metadata concerning the type of text and the student’s profile. Student’s modifications in the text and teacher’s corrections are also encoded. The texts are lemmatized and POS tagged. We are currently performing error tagging and we have already annotated 50% of the corpus. Our tagging system is a two-step architecture that takes advantage of the TEITOK environment and the information already encoded in the corpus to reduce manual effort in the annotation process. We perform a first manual annotation using a simple scheme that allows for the automatic generation (fully or partially) of a fine-grained classification. We will discuss the architecture of our system as well as the design of our tagset, which is still under development.
______________________________________________________________
12 Julia Prentice University of Gothenburg, Sweden
Error taxonomy and other considerations in the SweLL project
One of the aims of the SweLL-project is to make an annotated learner corpus of Swedish available for SLA-researchers. The linguistic annotation of such a corpus is a problem that has to be considered as well from a computational- and corpus linguistic, as from a SLA-perspective. Taking the latter perspective I will discuss some of the questions that the development of a taxonomy for annotation of the SweLL-corpus has raised so far.
______________________________________________________________
13 Maarten Janssen Universidade de Coimbra, Portugal
TEITOK - using an XML based framework for learner corpora
TEITOK is a corpus management system based on TEI/XML. It has been used for learner corpora in Portuguese, Latvian, and Croatian. I will show how it can be used to combine textual annotation, linguistic annotation and error annotation in a single framework, using a graphical user interface, all in a widely adopted representation format.
______________________________________________________________
14 Dan Rosén Språkbanken; University of Gothenburg, Sweden
The SweLL normalization editor for learner texts
We will present ongoing work on designing and developing an editor for normalizing (error correcting) learner texts. The main idea is to regard going from the learner text to the target hypothesis as a parallel corpus where words from the source texts are linked to the target texts. We label these links using an error taxonomy to indicate why this normalization is motivated. The editor supports links from multiple source words to multiple target words.
______________________________________________________________
15 Nadezda Okinina Eurac Research
Transc&Anno
I will present Transc&Anno, a web-based collaboration tool allowing the transcription of text images and their shallow on-the-fly annotation. Transc&Anno is being developed in order to address the needs of learner corpora research so as to facilitate learner corpora digitisation. Transc&Anno provides an intuitive transcription and annotation environment that allows the user perform the transcription and annotation tasks quickly and easily. Transc&Anno was created on top of the FromThePage transcription tool developed entirely with standard web technologies – Ruby on Rails, Javascript, HTML, and CSS.
______________________________________________________________
16 Adriane Boyd (University of Tübingen Germany)
MERLIN: Lessons Learned
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that was designed to illustrate the Common European Framework of Reference for Languages (CEFR). I will give an overview of the annotation architecture and tools developed for a wide range of manual and automatic annotations and present some of the lessons learned during the project.
______________________________________________________________
17 Andreas Nolda University of Szeged, Hungary, and Hagen Hirschmann, Humboldt University Berlin
Towards a German–Hungarian learner corpus:
Standards and software for annotating learner data in ‘Dulko’
In the ‘Dulko’ project (Deutsch-ungarischesLernerkorpus), an annotated German–Hungarian learner corpus is going to be built at University of Szeged which aims at being compatible with the Falko learner corpora at Humboldt University of Berlin, while still substantially extending and generalising the conception of target hypotheses and error annotation in Falko. We shall in particular present a complete toolchain using EXMARaLDA, TreeTagger, and ANNIS, as well as a systematic and tightly integrated error tagset.
______________________________________________________________
December 8
18 Alexandr Rosen Charles University, Prague (Czech Republic)
Trying make a learner corpus user happy: from annotation to search tools
A collection of essays, written by non-native learners of Czech (about 1M words), has been used to build several corpora, with or without metadata, in a dedicated multi-level format or structured according to a commonly used search tool. Its manual or automatic annotation includes correction, lemmas and standard morphosyntactic tags assigned to the source or the corrected target, and error labels based on a formally defined or grammar-based taxonomy. An overview of the annotation and the corpora themselves will be presented. The results suggest a question or two: to what extent they meet expectations of existing or prospective users and what can be done to better suit their needs.
______________________________________________________________
19 Paul Meurer, Silje Ragnhildstveit and Kari Tenfjord University of Bergen
The ASK corpus
It is of great importance to keep in mind that there are two research competences that meet in the learner corpus enterprise, corpus linguistics and SLA. Since we cannot expect users of learner corpora to be experts on corpus tools and search languages, it is all the more important to have a user-friendly interface that helps and supports the user in formulating queries. We will demonstrate how these ideas are implemented in the interface of ASK.
There is considerable variation between different corpora concerning the annotation of the data. We will show how the integrated annotation system in ASK can be used to do individual searchable annotations of the material. This feature might help to increase compatibility and to allow sensible searches across corpora.
______________________________________________________________
20 Therese Lindström Tiedemann University of Helsinki
Case study: Studying the Swedish passive.
I will present ongoing work on Swedish as a second language based on both SweLL, Topling and COCTAILL. The main idea is to discuss what possibilites and problems there can be in studying this kind of construction and variation through a learner corpus. What I as a linguist who need to study it better or more easily.
Swedish has three different passive constructions plus the possiblity of using an impersonal construction with the pronoun “man". When (and how) do learners learn the different constructions? How well does this relate to the L1 usage? What kind of errors do they make? (How do they learn to distinguish them?)
______________________________________________________________
21 Simon Smith Coventry University, UK
SkELL for Chinese: a corpus-based Chinese learning platform
SkELL, based on Sketch Engine, currently exists for a number of languages, but not Chinese. In a funding bid in preparation, the tool will be extended to the Chinese language.
Segmentation (tokenization) is non-trivial and ambiguous, even for human experts. There exists already a range of segmentation algorithms, but they tend to privilege a break into longer words. On the SkELL Chinese platform, we will provide a feature whereby learners can control the typical segment length, depending on their particular needs at the point of use.
______________________________________________________________
22 David Alfter University of Gothenburg, Sweden
SweLLex + - productive L2 learner vocabulary and more
I will present the graded word list SweLLex which has been compiled from learner essays, as well as a lexical complexity analysis module building on top of this resource.
______________________________________________________________