NCN workshop on Interaction design in the context of CLARIN
Venue: University of Gothenburg, Sweden
3–4 May 2017
On 3–4 May 2017 the Nordic Clarin Network (NCN) is organizing the next in its series of thematic workshops at the University of Gothenburg. This time the theme is Interaction design in the context of CLARIN. Suitable presentation/demo topics falling under this heading can be (but are not limited to):
- • visualization in a broad sense, including visualization of linguistic structures, networks, statistics, time series, geomapping, and neighbouring topics such as (the study of) literature through large digitized collections using topic modelling
- • interface design for language technology based research tools, e.g. corpus query interface design
- • workflow organization/interface design for optimal combination of automatic and manual annotation (or resource building in general), including solutions involving crowdsourcing or “incidental” data collection/correction (à la reCAPTCHA)
Participation in the workshop is by invitation only.
If you would like to participate with a presentation and/or a demo, please contact your national NCN coordinator:
- Denmark: Bente Maegaard (bmaegaard æt hum.ku.dk)
- Finland: Krister Lindén (krister.linden æt helsinki.fi)
- Iceland: Eríkur Rögnvaldsson (eirikur æt hi.is)
- Norway: Koenraad De Smedt (desmedt æt uib.no)
- Sweden: Lars Borin (lars.borin æt svenska.gu.se)
Preliminary program
Wednesday, 3 May (room: Humanisten / D411 map)
12.00–13.00 | Lunch, served in the workshop room |
13.00–13.15 | Welcome, opening of the workshop |
13.15–13.45 | Interactive transcription and transliteration: a win-win opportunity for HSS and speech research [slides] |
Jens Edlund (KTH, Stockholm) We propose a speech technology based method to transcribe speech recordings and transliterate hand writing that brings significant and concrete benefit to two fields: HSS (including speech research) and speech technology. The method – interactive speech based transcription and transliteration – is intended as a collaborative effort between these areas, and a collaboration in which both parties win. In HSS research, a large amount of potential research materials consist of manuscripts that cannot easily be transliterated automatically with optical character recognition (OCR) for a host of reasons ranging from hand writing variability through poor print quality and degraded paper quality. Similarly, our archives are brimming with spoken sources that cannot be transcribed automatically using automatic speech recognition (ASR) for an equally diverse range of reasons ranging from dialectal variation through poor audio quality to noisy recording environments. Transliterating these texts and transcribing these audio recording is such a slow and painstaking endeavour that it is quite often simply not undertaken, and when it is, it is at great cost. The method proposed here combines speech technology, such that the person transcribing or transliterating initially the text, which is then automatically transcribed using ASR, with techniques for efficient editing of ASR results to a system that significantly lowers the time and effort required to transcribe/transliterate, improves on the ergonomics of the researchers working with the materials, and at the same time delivers speech data that can be used to improve speech technology. |
|
13.45–14.15 | An interface for anonymization of personal stories [slides] |
Niklas Blomstrand, Lars Ahrenberg (Linköping University) What makes document anonymization difficult and how could a system be designed to facilitate this process? Through user centered design we have tried to understand archivists’ perspective on document anonymization to find a solution that fit their needs. Our work has resulted in an effect map that describes the desired effects of the system and how they are to be achieved. From this model a prototype for the system has been developed. At the workshop, we will present the prototype as well as the effect map that it is based on. |
|
14.15–14.45 | Sparv and text classification [slides] |
Dan Rosén, Anne Schumacher (Språkbanken, University of Gothenburg) Sparv is the toolkit that annotates all corpora at Språkbanken. The annotations include segmentation, part-of-speech tagging, compound analysis, dependency tree parsing, named-entity recognition and word sense disambiguation. In contrast to these token- and sentence-oriented analyses, Sparv has recently become capable of training models for classifying entire texts. Models can be trained to predict e.g. a text's author, their political affiliation; or in which section a newspaper article appeared. Training is supervised, i.e. requires labelled data, which we have at Språkbanken. In this talk we will give an overview of Sparv and show initial work on visualising trained text classification models given some example text. |
|
14.45–15.15 | Easy to use, easy to read – First steps towards designing the user interface of a text analysis toolkit |
Simon Cavedoni, Emil Fritz, Jakob Säll (Linköping University) We explore the demands on a digital service for web editors’ work on producing easy-to-read material. As part of the TeCST (Text Complexity and Simplification Toolkit) project we have approached the toolkit from an interaction design perspective, in order to improve the usability of the service, as well as visualizing the provided measures of text complexity in an intuitive way. Two workshops were arranged with web editors and text producers, with the purpose to explore the participants’ view on text analysis and simplification as well as investigating how a toolkit could support their work. This resulted in a specification of requirements for the design as a whole, as well as a proposed way to visualize text complexity. |
|
15.15–15.45 | Coffee/tea break |
15.45–16.15 | Automatic selection of corpus examples for language learning exercises with explicit user control [slides] |
Ildikó Pilán (Språkbanken, Universiy of Gothenburg) We present HitEx, a system aiming at the automatic identification of sentences from corpora that are suitable as exercise item candidates for learners of Swedish as a second language. The system is based on a diverse set of selection criteria which, with the help of a transparent graphical interface, provides users (i.e. researchers, teachers, assessors and course book writers) some control over the sentence selection process. |
|
16.15–16.45 | Correct-annotator: An Annotation Tool for Learner Corpora [slides] |
Felix Hultin (Stockholm University) Correct-annotator is a browser-based, single-page annotation tool to annotate language errors for learner corpora in Swedish. Having grown out of the research project SweLL, Correct-annotator attempts to significantly ease the work flow of an annotator, by allowing the annotator to edit and correct the text, from which the system induces potential language error annotations. This differs from previous annotation tools, in that annotation is made on text transformations, rather than on a static text, which can significantly lower the amount of user operations involved in annotation. |
|
16.45–17.15 | General discussion |
19.00 | Workshop dinner, Restaurant "Familjen", http://www.restaurangfamiljen.se/ |
Thursday, 4 May (room: Lennart Torstenssonsgatan 6 / K332 map)
09.30–10.00 | Interactive Visualizations and use of the Language Mill at Kielipankki |
Jussi Piitulainen (FIN-CLARIN) Mylly the Mill is the Kielipankki instance of the Chipster platform (chipster.csc.fi). It provides a graphical interface for using specially installed tools to data that users either upload to Mylly or create in Mylly, and for inspecting (or downloading) the results. The tool scripts have access to the computational resources, including data and software installed for command line users, in a cluster computing system. We have Aalto speech recognizer, Turku dependency parser, and much of Helsinki transducer tools already. The development is in progress between UHEL and CSC. |
|
10.00–10-30 | Interactive Visualizations in INESS [slides] |
Paul Meurer (Uni Research Computing), Victoria Rosén and Koenraad De Smedt (University of Bergen) Grammatical structures tend to be quite complex and varied, so that their adequate visualization is an indispensable feature of treebanks. We present a range of interactive visualization and navigation techniques in INESS (a service integrated in CLARINO), starting with the interactive dynamic rendering and highlighting of relations between c-structures and f-structures. Also other treebanking functions in INESS rely on visual interactions, including text preprocessing, choosing discriminants, previewing of disambiguation choices, browsing, querying, displaying search results with highlighting, monitoring treebank status and versioning, and alignment of parallel treebanks. |
|
10.30–11.00 | Coffee/tea |
11.00–11.30 | A new treebank interface for the digital humanities [slides] |
Tinna Frímann Jökulsdóttir, Anton Karl Ingason (University of Iceland) We describe PaCQL (Parsed Corpus Query Language), a novel query language for carrying out research on parsed historical corpora, an important task for the digital humanities. PaCQL implements and enhances many of the most important features of earlier software that is designed for computational research in historical syntax and combines such functionality with a search engine which employs a fast in-memory index that cuts down waiting time in many realistic research scenarios. A web interface is provided with an automatically created summary of the main quantitative findings, including a visualization and output formats that are suitable for further processing in statistical packages like R and SPSS. The primary goal of this project is to contribute to the development of software tools which are designed from the ground up specifically with the needs of the digital humanities in mind. |
|
11.30–12.00 | The Swedish-Finnish Korp collaboration – past, present and future [slides] |
Johan Roxendal (Språkbanken, University of Gothenburg), Jyrki Niemi (FIN-CLARIN) The DRY principle of software development states that we shouldn’t unnecessarily repeat ourselves. This holds equally true, on the macro level, for the software tools developed by the CLARIN members. By sharing code and encouraging cooperation around that code we can sow for others to reap and reap what others sow. But it’s not always as easy as that. We present the past, present and future of cooperation around the Korp codebase and hope to inspire an increase in such cooperation between CLARIN members. |
|
12.00–13.00 | Lunch |
13.00–13.30 | Learner modelling through interactional data collection [link] |
David Alfter (Språkbanken, University of Gothenburg) Personalization and adaptation of language learning platforms is one of the goals if we want to improve and possibly accelerate the language learning experience for learners. We collect data from learners both through an explicit questionnaire and continually as they interact with the learning platform by means of different exercise types. In addition to the use of this data for personalization and adaptation, we also gain data about the learner that can be linked with other data about the user. |
|
13.30–14.00 | D3.js in NLP: An Introduction with Case Studies [slides] |
Felix Hultin (Stockholm University) D3.js is a JavaScript library used for creating, often interactive, data visualization in the browser. Having naturally grown out of and adapted itself to modern web standards, such as HTML, SVG, CSS and JavaScript it has become a widely popular tool in web development, used by organisations such as The New York Times, The Guardian and OpenStreetMap. Seeing that web browsers are used more than any other piece of software and that web standards do not easily change, D3.js is here to stay. D3.js has, however, not been fully explored in the context of NLP. In the many D3.js visualization trackers, there are only a few NLP-related visualizations out there and the library is hardly mentioned in the academic literature of NLP. In this presentation, I intend to show how D3.js works, and give some case studies from my own NLP-related projects, more specifically in parsing of context-free grammars and visualizing phonotactic structures, in how it can be used for NLP-related purposes. |
|
14.00–14.30 | General discussion |
14.30–15.00 | Closing/ Coffee, tea |
If you have questions about the local arrangements, including the program, please contact us through this contact form .