Explore the future

Machine Learning Replaces Manual Transcription of Old Documents

New technologies allow, not only, to digitize documents but to transcribe old Danish, handwritten documents. For research projects and in the public sector, this creates great value as it becomes possible to perform analysis on the basis of a solid data set, which goes far back in time.

Research institutions and the public sector often need to be able to perform analysis on data not yet digitized or transcribed. Many documents are scanned and stored electronically, but the information in the documents isn't available as structured data and can therefore not be used in data analysis.

Especially in research projects, it's crucial that one can see the development in data associations over time, hence historical data becomes relevant. Within socioeconomic conditions, this may be church books, health records, or elementary school grade sheets, which must be digitized into a data format, which allows data to be combined with anonymized register data. The same is true in genetics, the DNA strand is digitized but not in a format that can be interconnected with other relevant data.

Manual transcription is flawed. Traditionally, researchers have sent data abroad for manual transcription. Here, selected data is being entered manually in tabular format. Although labor is cheap, the risk of mistyping is high, especially when the text is in Danish.

At the same time, it's not possible to identify or validate, if it's typed correctly or incorrectly. This results in a source of error, which is unknown in size and can ultimately lead to research projects coming out with false conclusions - without anyone being aware.

"

We use neural networks and microtasking when establishing the training data set, and because we develop an app for this purpose, it's possible to use crowdsourcing.

Machine learning

Machine learning is a subcategory of artificial intelligence that covers neural networks. The name is inspired by the biological neural networks in the brain. Together with other image recognition techniques as for example computer vision, this subcategory is used to recognize where the data item is located in the scanned document.

Use Microtasking and Machine Learning

The alternative to manually typing in the documents is using machine learning and microtasking. We develop an app that is able to identify and produce selected data elements from the document. The selected data elements are translated into digital data and stored in a structured form. This is possible when the algorithm in Machine Learning uses neural networks.

The neural network must be trained to recognize the written and/or typed words and letters. The app displays the selected data element from the document, and the user types in what that person sees. To ensure the quality of data entry, all data elements are entered by at least two individuals.

Each time a word from the document is manually entered, the source is linked with it, i.e. the value of the data element with the entered. In this way, a training set is established, which over time expands and eventually results in the artificial intelligence taking over the rest of the process. The transcription of the document will then take place without human involvement. The result is a quality-assured data set, which researchers can use in their analysis.

Using an app for microtasking also opens up crowdsourcing. Citizens, who want to invest in data creation, get access to the app and can contribute by entering data. It's an efficient, fast and cost-saving way to establish your training data set - and a necessary step before Machine Learning completes the transcription task.

THE STRENGTH LIES IN THE PERSONAL MEETING

If you are curious about the possibilities of using microtasking and neural networks to establish your data for analytics, contact Christian Emil Westermann.

Book a meeting

About rooftop analytics

Rooftop Analytics is able to transcribe handwritten documents using machine learning. We make use of computer vision, neural networks, and microtasking. We incorporate logic, which enhances quality assurance, and we train the algorithm to the extent that it can read the documents without human involvement.

Read more

Rooftop Analytics uses open source applications for developing and documenting the algorithm code.

We are a team with unique competences in data science and see extraordinary opportunities in data.