How to mine the gold hiding in your unstructured data

Written by Mathieu Xhonneux

Businesses traditionally rely on the collection and analysis of quantitative data to measure the performance of their operations. Yet, thanks to the fast-paced digitalization of these processes, unprecedented volumes of data are now generated and exchanged between companies. Besides the conventional structured data – think of data stored in Excel sheets or SQL databases – tremendous amounts of actionable information in unstructured textual records – emails, forms, invoices, websites, …. – are often unexploited. In particular, with nearly 80% of its data stored in the form of unstructured text [1], the medical sector is sitting on a gold mine of data. Written medical records, prescriptions, doctor’s notes, pathology reports, … all this textual data could be leveraged to greatly accelerate the development of drugs, improve public health policies, or sustain scientific research.

Until very recently, the bulk processing of unstructured text has always been considered too expensive and time-consuming. The recent achievements in the field of natural language processing (NLP), with ChatGPT as its most illustrious forerunner, have unlocked many possibilities. By combining computational linguistics—rule-based modelling of human language—with machine learning models, NLP algorithms enable computers to automate various tasks on unstructured text (data mining, email classification, chatbots, …). Whereas a human previously needed one hour to screen five documents, it is often claimed that NLP algorithms can now analyze and extract key insights from thousands of documents in a matter of seconds. In practice, however, the implementation of NLP techniques is never a straightforward task. Developing a production-ready NLP solution requires solving many problems: raw data has to be parsed and cleaned, specific vocabulary must be integrated by training custom machine learning models, incorrect classifications from semantic ambiguities need to be detected, and so on.

A biomedical NLP pipeline identifies all nouns present in a sentence and maps them to records of a medical database

For a client active in the pharmaceutical sector, BrightWolves designed and deployed in production a custom NLP pipeline that extracts biomedical concepts from unstructured patient records. In a fraction of second, the pipeline identifies relevant keywords in a health record and maps them into a universal medical database, effectively transforming the unstructured records into actionable structured data. Starting from the specifications of our clients, our team crafted the different stages of this custom NLP pipeline:

We identified appropriate language models for the processing of biomedical texts (such as BioBERT [2] or SciBERT [3]), and integrated them in the well-known spaCy framework [4].
Based on the output of the language model, we designed custom heuristics to identify and extract relevant biomedical concepts.
The processed and structured data is finally automatically inserted in the back-end database of the client.

The final solution achieves a high accuracy of classification, which enables our client to automatically integrate the NLP-processed records in its operations.

Beyond the pharmaceutical industry, the effectiveness of NLP in extracting data can also be leveraged in many other sectors. Do you have operations that could be improved by a smart usage of data analytics? If you're interested in exploring how our expertise can help you extract the gold mine hiding in your data, please get in touch with Sven Van Hoorebeeck for a conversation.

Sources

[1] https://www.drugdiscoverytrends.com/whats-driving-the-natural-language-processing-revolution-in-pharma-and-life-sciences/

[2] https://academic.oup.com/bioinformatics/article/36/4/1234/5566506

[3] https://arxiv.org/abs/1903.10676

[4] https://spacy.io/

How to mine the gold hiding in your unstructured data

Recent Posts