Research Internship – Developing an automated solution for measuring the quality of digitized texts – Koninklijke Bibliotheek – Den Haag
The digitized collection of the KB National Library of the Netherlands is mostly available through Delpher and DBNL. All scanned images of texts have been processed with the optical character recognition (OCR) software ABBYY. However, this software does not always perform very well for historical material and the OCR quality of Delpher is not as high as its users would like it to be.
At the KB we are exploring ways to improve the quality of the OCR. Next to this, we are also searching for a way to measure the OCR quality. At the moment, measuring the quality of our digitized text is done manually by using samples of the data. Using an automated approach for measuring the OCR quality could lead to better insights into the bottlenecks for digitization as well as which parts of the collection needs special attention in terms of improving the OCR quality.
During this research internship or graduation assignment we would like to examine the possibilities of an automated approach for measuring the quality of our digitized texts. The possibilities that suit our collection can then be used to develop an automated ‘pipeline’ for measuring the quality of texts for both Delpher and DBNL.
A literature review for currently existing (theoretical or practical) solutions for measuring the quality of digitized texts. In order to measure the quality of digitized texts it is important to look at character recognition as well as segmentation. The first part of the internship shall consist of a literature review in which existing solutions for measuring the quality of digitized text are explored.
Developing and evaluating a ‘pipeline’ for automated measuring of the quality of digitized texts. Methods found in the literature review and self-proposed ideas will be tested on the collection of the KB. We expect that a combination of methods is needed for a reliable and consistent measurement of quality. Appropriate methods shall be selected and will then be used to develop an automated ‘pipeline’ for measuring the quality of the digitized texts from Delpher and DBNL.
- is at the final stage of his or her study Software Engineering, Computer Science, Artificial Intelligence, Data Science, (Digital) Humanities or related
- can work technically independently, but with substantive support of our Data Science and Digitisation Team
- can handle existing tooling, or knows how to gain knowledge about this
- has expertise in the field of NLP, machine learning and statistical models, for example for the evaluation of output
- has basic knowledge of Dutch language. Although this is not a necessity, some understanding will be helpful
- A research internship at the Data Science Team of the Research Department of the National Library of the Netherlands (18 fte) for max 6 months, but all catered to your needs and requirements of your university and supervisor
- A working place at the offices of the KB, downtown The Hague, only a 3 minute walk from Central Station
- Substantive support by both our Data Science Team as well as the Digitisation Team
- Access to all data from of the KB, tooling we developed before see and hardware we have available to run our own experiments
- Reimbursement for travel costs and a compensation, in line with our regular internship compensation.
The KB is a nationally and internationally renowned institution: with more than 500 employees, we are one of the major Dutch heritage and science institutions and have an important coordinating role in the network of public libraries. Tasks include preserving, collecting and making available all publications published in or about the Netherlands and building the national digital library. We also think it is important to train young colleagues.
We regularly have internships for students of various courses and disciplines, both academic and higher professional (eg book science, literature study, Artificial Intelligence, Data Science, Software Engineering, (Digital) Humanities, IT, HRM, financial, facility management, communication etc.). For example, we assist HBO students in their work placement, but also academics who want to carry out their (graduation) research or graduation project at the KB.
For more information about this internship or to apply, please contact Mirjam Cuper (Data Scientist) at 06-38298534 or send an e-mail to .
Comments are closed