Machine Learning Replaces Manual Transcription of Handwritten Documents
New technologies allow, not only, to digitize documents but to transcribe Danish, handwritten documents. For research projects and in the public sector, this creates great value as it becomes possible to perform analysis on the basis of a solid data set, which goes far back in time.
Research institutions and the public sector often need to be able to perform analytics on data not yet digitized or transcribed. Many documents are scanned and stored electronically, but the information in the documents isn't available as structured data and can therefore not be used in data intelligence.
Especially in research projects, it's crucial that one can see the development in data associations over time, hence historical data becomes relevant. Within socioeconomic conditions, this may be church books, health records, or elementary school grade sheets, which must be digitized into a data format, which allows data to be combined with anonymized register data. The same is true in genetics, the DNA strand is digitized but not in a format that can be interconnected with other relevant data.
Manual transcription is flawed. Traditionally, researchers have sent data abroad for manual transcription. Here, selected data is being entered manually in tabular format. Although labor is cheap, the risk of mistyping is high, especially when the text is in Danish.
At the same time, it's not possible to identify or validate, if it's typed correctly or incorrectly. This results in a source of error, which is unknown in size and can ultimately lead to research projects coming out with false conclusions - without anyone being aware.
Machine learning is a subcategory of artificial intelligence that covers neural networks. The name is inspired by the biological neural networks in the brain. Together with other image recognition techniques as for example computer vision, this subcategory is used to recognize where the data item is located in the scanned document.
Use Microtasking and Machine Learning
The alternative to manually typing in the documents is using machine learning and microtasking. We develop an app that is able to identify and produce selected data elements from the document. The selected data elements are translated into digital data and stored in a structured form. This is possible when the algorithm in machine learning uses neural networks.
The neural network must be trained to recognize the written and/or typed words and letters. The app displays the selected data element from the document, and the user types in what that person sees. To ensure the quality of data entry, all data elements are entered by at least two individuals.
Each time a word from the document is manually entered, the source is linked with it, i.e. the value of the data element with the entered. In this way, a training set is established, which over time expands and eventually results in the artificial intelligence taking over the rest of the process. The transcription of the document will then take place without human involvement. The result is a quality-assured data set, which researchers can use in their analysis.
Using an app for microtasking also opens up crowdsourcing. Citizens, who want to invest in data creation, get access to the app and can contribute by entering data. It's an efficient, fast and cost-saving way to establish your training data set - and a necessary step before Machine Learning completes the transcription task.
About rooftop analytics
Rooftop Analytics is able to transcribe handwritten documents using machine learning. We make use of computer vision, neural networks, and microtasking. We incorporate logic, which enhances quality assurance, and we train the algorithm to the extent that it can read the documents without human involvement.