Kimmo Kettunen
The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910. This collection contains approximately 1.95 million pages in Finnish and Swedish. The Finnish part of the collection consists of about 2.39 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi.
The web service digi.kansalliskirjasto.fi is used, for example, by genealogists, heritage societies, researchers, and history enthusiast laymen. There is also an increasing desire to offer the material more widely for educational use. In 2014 the service had slightly over 10 million page loads. User statistics show that about 90 % of the usage of the Digi comes from Finland, but a 10+ % share of use is coming outside of Finland.
This presentation discusses some ways to approximate the overall lexical quality of the Digi collection. Our methods are corpus analysis oriented and do not include use of elaborate statistics or language modelling. We use mainly basic word frequency counting together with morphological analysis and approximate the current state of the collection based on these. We shall also shortly discuss other development plans for the collection.