Mini workshop on collocations and sentence selection

Du är här

Hem / Mini workshop on collocations and sentence selection
09:00 till 12:00

Workshop location: University of Gothenburg, Humanisten, C450



09.00   Ildikó Pilán          

Title: HitEx: Automatic selection of corpus examples for language learning exercises [slides]

Abstract: I will present HitEx, a system aiming at the automatic identification of sentence from corpora that are suitable as exercise item candidates for learners of Swedish of as second language. I will describe a diverse set of selection criteria which relies on Natural Language Processing methods. I will focus on two fundamental aspects in particular: determining the difficulty level of sentences for learners, and the dependence of the extracted sentences on their original context.


10.00   Iztok Kosem       

Title: Semi-automated lexicography in Slovenia: collocations dictionary and beyond

Abstract: In recent years, Slovenian lexicography has undergone significant changes in methodological approaches to dictionary making, which stem from the work on the Slovene Lexical Database (Gantar et al. 2016) and are grounded in presenting lexicographers with automatically extracted of lexical data from the corpus on which the analysis, validation and clean-up is then conducted.

The current project benefiting from this approach is Collocations Dictionary of Slovene (CDS), which aims to present collocational information initially on a test sample of 5000 words but later for virtually every word for which there is enough collocational information available. In my presentation, I will present the entire workflow of making entries in CDS, from automatic extraction of collocations and their examples, post-processing of extracted data, and lexicographic analysis. I will also show how the data from the CSD feeds into other projects such as synonym dictionary and Slovenian-Hungarian dictionary.

CSD is breaking new grounds in Slovenia in many ways, and this is what the final part of my presentation will focus on. I will show how, and why, we are devising senses to be presented via short indicators only, and the ways in which we intend to introduce crowdsourcing into the workflow in order to facilitate clean-up processes. Furthermore, we are in the middle of making plans for testing new techniques of collocation extraction as part of recently approved research project Collocations in Slovene.


Gantar, P., Kosem, I., & Krek, S. (2016). Discovering Automated Lexicography: The Case of the Slovene Lexical Database. International Journal of Lexicography, 29(2), 200–225.


11.00   Kristina Koppel  

Title: Example sentences in Estonian learners’ dictionaries [slides]

Abstract: The focus of this presentation is on the example sentences in Estonian learners’ dictionaries. Firstly, I will introduce the dictionaries compiled in the Institute of the Estonian Language which are aimed at learners of Estonian as a foreign or a second language Secondly, I will give an overview of the automatic compilation of the Estonian Collocations Dictionary (ECD) (Kallas et. al. 2015) database – an ongoing project aimed at language learners on B2-C1 level.

Automatic compilation of the ECD included also extraction of authentic example sentences. The largest corpus for Estonian – Estonian National Corpus –, that we used to extract the ECD database, has been built according to the needs of lexicographers, researchers, language specialists, and not according to the needs of language learners. I will introduce a function of Corpus Query System Sketch Engine (Kilgarriff et. al. 2004) called Good Dictionary Extraction (GDEX) (Kilgarriff et. al. 2008). GDEX filters corpus sentences by evaluating the complexity of the sentence, e.g.  the frequency of words, the length of the sentence etc. I will introduce the parameters of sentences in Estonian learners’ dictionaries (Koppel 2017) and future projects where they will be implemented.


Kallas, Jelena; Kilgarriff, Adam; Koppel, Kristina; Kudritski, Elgar; Langemets, Margit; Michelfeit, Jan; Tuulik, Maria; Viks, Ülle (2015). Automatic generation of the Estonian Collocations Dictionary database. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd, 1−20.

Kilgarriff, Adam, Pavel Rychlý, Pavel Smr, David Tugwell 2004. The Sketch Engine. – G. Williams, S.Vessier (Eds.). Proceedings of the 11th EURALEX international congress. Lorient, France: Université de Bretagne Sud, 105–115.

Kilgarriff, Adam, Milos Husák, Katy McAdam, Michael Rundell, Pavel Rychlý 2008. GDEX: Automatically finding good dictionary examples in a corpus. In E. Bernal, J. DeCesaris (Eds.). Proceedings of the 13th EURALEX International Congress. Barcelona: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, 425–432.

Koppel, Kristina (2017). Heade näitelausete automaattuvastamine eesti keele õppesõnastike jaoks. [Automatic detection of good dictionary examples in Estonian learner’s dictionaries.] In Eesti Rakenduslingvistika aastaraamat, 13, pp. 53−71.