The objective of Digitalt kulturarv — ett digital folkminnesarkiv (Digital Cultural Heritage – a Digital Folklore Archive), is to digitize folklore records from the collections of the Institute for language and folklore (ISOF), and make them accessible. Within the project, a database of fully 16,000 complete records has been built. Apart from text material, consisting of transcribed records, alternatively scanned using OCR or HTR, the database also contains metadata, such as year of recording, categories and location, as well as information about the person recording and informants (i.e. name, year of birth, sex).
We have created two platforms within the project: Sägenkartan (Map of Legends), which is aimed at a wider audience, and Database of Intangible Cultural Heritage, which is intended for research.
Sägenkartan is map-based. Dots shown on the map represent parishes, from where we have records. If the user clicks on a dot, a list of legends from that area appears, with information about what the records contain and who the informants are. It is also possible to study the records as full-text format, or as an image. Users interested in a certain legend/myth category, can filter search results to show, for example, only tales about brownies or witches on the map. With Sägenkartan, we wanted to create a simple and user-friendly platform for making the older folklore collections at ISOF accessible.
While Sägenkartan only shows a selection of records from the database, the whole corpus will be available through the platform aimed at research, Database of Intangible Cultural Heritage. The version aimed at research is based on Elastic search, which enables searching in several different fields in the database simultaneously. For example, it is possible to search for a particular word in the text and limit the results so that only records made from female informants, born between 1848 and 1854 in the province of Värmland, is displayed.
The interface shows not only a list with search results, it also visualizes statistics from metadata. For example, a map illustrates the geographical distribution of the records. In contrast to Sägenkartan, this map is based on polygons of different colours – the more results found in a parish, the darker the colour. Alternatively, the colours can also illustrate the percentage of the total number of records found in each parish.
Below the map, different visualizations provide an overview of metadata connected to the search result. Here, we use the function in Elastic search called aggregation, which enables us to get information about categories from a search result, and then calculate the distribution of texts per category. As you can see below, the word ‘spöke’ (qhost) appears mostly in the category ‘Död och de döda’ (Death and the dead).
We use several different visualizations, for example line charts over records per year, or birth year of the persons recording as well as the informants, and column charts to illustrate distribution according to sex.
Another function is column charts over the most common terms appearing in the respective search results. We have run Latent Dirichlet Allocation (LDA) topic modelling on all texts in the database, and use our topic-model to calculate the frequency of individual search terms, which we then visualize. This gives us an image of the content in a search result consisting of many texts. If we continue with our ghost example, we can see that the most common search terms connected to the word ‘spöke’ (ghost), is ‘väg’, ‘far’, ‘hadd’, ‘natt’, ‘såg’, och ‘död’.
Below the visualization modules, is also a list of hits, where the user can read individual records, see a list of people connected to the result, as well as a list of marked sentences focusing on the word used in the search.
Topic modelling data is also used, in a module that visualizes the words and the connections between them as networks. For this, we use the X-Pack module for Elastic search, which can collect fields from the search result and find relationships between them, based on the number of texts containing the same data. Here, we have an example of a network, when searching for the words ‘kärring’ and ‘käring’ (old woman/hag, spelt in two different ways).
The network analysis makes it possible to see semantic clusters within the network. In the search for, for example ‘kärring’, the word is connected to ‘mjölk’ (milk), ’smör’ (butter), ’trollkäring’ (witch) and ’ko’ (cow). All words in the network are clickable. If you click on the word ‘mjölk’, for example, you see a list of records containing the words ’kärring’ and ’mjölk’. There we can see that some of them are about ‘bjäran’ and ‘mjölkhare’, a creature that witches used to steal milk.
The project is still under development, and a web-based platform will be launched in the spring of 2018.
Trausti Dagsson and Fredrik Skott