Written by Simon Knudde
The world is digitalising, everybody knows it. Looking back over the past decades, a trendy item that has basically disappeared is the encyclopaedia, compiling the knowledge of mankind. These books were expensive and typically outdated before even being published. Their era ended with Wikipedia offering a digital, crowdsourced, free, and up-to-date version of it. The encyclopaedia business got disrupted, and they merely serve as a decoration on our bookshelves nowadays.
Why am I talking about encyclopaedias and Wikipedia you ask? Because I want to talk about information. Specifically, how to store, organise, and retrieve it.
In a broader context, books and newspapers used to be the go-to when looking for information. Today, we “google it”. What made Google and Wikipedia so attractive, and more importantly, why do we use them over books? The answer is simple. Thanks to their search engine, we do not need to waste time searching as the algorithm can retrieve the information for us.
At work, digitalisation is underway. Today, over 55% of companies have started their journey towards fully digitalised documents, yet billions are still being spent on paper [1][2]. Start-ups offering document management systems are flourishing. The efficiency gains are so evident that migrating to a digital document management system is an obvious go-to for every major company.
But how does it work? How can you turn physical paper documents into a digital way of working? What are the pitfalls to avoid?
The first step is to digitalise your existing document. And let me tell you, scanning them into PDFs, will not bring you much. Turning them into text data so that algorithms can use it, on the other hand, is what you should be focussing on. For this, one needs to perform intelligent character recognition (ICR), a widespread AI technology combining computer vision (also referred to as optical character recognition – OCR) and natural language processing (NLP). The algorithm identifies text on images and compares the viewed text with a language corpus to recreate the text of scanned documents. ICR can be found in many tools or even downloaded as a stand-alone from open-source repositories – such as tesseract – for custom use.
After performing ICR, your paper-based documents can now be integrated into your databases and live alongside your new – digitally native – documents.
But where is the value creation? Let’s cite a few examples:
Keyword extraction: identifying which words make this document stand out from others. This allows advanced analytics to create a network of relationships between documents. (e.g. regroup all documents relative to a specific client or event)
Date extraction: recreate the timeline of events and document generation or automatically generate notifications for deadlines (e.g. contract renewals).
Search engine – having all your information on your fingertips in a matter of seconds.
OCR coupled with NLP thus allows you to improve your way of working. Taking the example of automatic notification generation, this reduces the risk of missing an important deadline on a contract and automates labour-intensive tasks of monitoring all ongoing contracts. Do you want to explore more possibilities of automating key administrative processes in your organisation? Have a look at our Intelligent Automation video series!
To summarise this first article, we tackled the topic of ICR to turn paper-based documents into purely digital ones, a crucial step for the transition from the old way of working towards a future proof system where all archives are integrated into modern high-performing systems.
Click here to read the 2nd article of our series, where we will elaborate further on search engine implementation. Namely, we will discuss the important considerations to have in mind when implementing a search engine, illustrated by a use case.
[1] IDC Document Processes Survey, May 2019
[2] Usage of digital documentation tools 2020 by department, Kimberly Mlitz, October 2021
Comments