V² Labs: Origins and NLP for Chart Review

Introduction

Vedant and I have been working together for the past 3 years. At the end of 2017, we discussed creating a venture to combine our interests in healthcare and technology. We’ve had our own ups and downs throughout this time. We’ve weathered step exams, clinical rotations, job changes and multiple ideas to arrive at where we are now. This is our story.

Part 1: Origin Story

Our first few ideas revolved around utilizing big data to improve the insurance negotiation process. It was a vague amalgamation of core concepts of value-based contracts, social determinants of health, & big data dashboards. We spoke to several family & friends who were in the insurance/Accountable Care Organizations (ACO) sphere - the conversations were helpful, but we didn't really have a crunchy solution. Given the intermittent scheduling, we didn’t really find too much traction with this idea. Ultimately, it kind of died… and our efforts stalled for a while. We decided to restart the process and start afresh at the beginning of my research year

When we resumed our meetings at the beginning of this past year, I knew 2 things:

This year was a perfect opportunity for us to actually get something off the ground (I was in my final year of my MSDS at Duke, and he was on a flexible research year).
The insurance idea was bleh..

Brainstorming with a blank slate, we approached the initial few sessions with the goal of finding a new idea that we both loved. We set Labor Day as the deadline. Our goal was to generate as many ideas as possible with absolutely no limitation. Ultimately, we’d have to prepare a pitch deck to convince the other person to support our best idea.

Part 2: Brainstorming

Our initial ideas included a chatbot for chronic pain management and a streamlined discharge planning process. We had been exposed to these problems peripherally through our professional and personal lives but we didn’t have the day-to-day experience to definitively understand the unique challenges and current workflow. Plus, similar to our insurance ideas, we didn’t have the technological or professional capital to know where to start. Instead of being able to iterate and tinker on our own, we’d have to talk to domain experts and try to figure out their pain points and whether they’d be interested in adopting the technology. We placed these ideas low on our list.

By the end of our month of brainstorming, Vedant had finished his onboarding as a research fellow and had started his work in database management. His job was to update databases in preparation for the major conference abstract submissions, AKA - “chart review.”

After he had learned the nuances of the data type, locations, and parameters, the work became extremely monotonous and repetitive. He had heard complaints from his classmates, residents and fellows over the past 3 years on how mind-numbing chart review was, but it was his first time experiencing it himself. They were 100% right, you could spend 8 hours a day doing chart review and pretty much hate your life at the end. It definitely required a clinical eye to be able to parse through the more nuanced language, but a lot of the work could be achieved by “Ctrl-F” keywords and finding variable of interest right next to it.

No alt text provided for this image — *While the medium has changed from paper to electronic charts over the years, the associated frustration with chart review has not changed.* ‍

I can’t stress this enough - chart review SUCKS!

We were talking about the challenges with chart review and I realized Natural Language Processing (NLP) could help with this process. Theoretically, it’s easy to understand the huge potential for NLP to unlock insights from unstructured data. So much of information-rich clinical data is in free text form, and extracting insights directly from clinical notes could help fill in the gaps not captured by structured data like ICD-10 and CPT codes. However, NLP is still in its infancy and is definitely still a “growing” research area. When applied to completely unstructured clinical notes, NLP runs into the problem of having to deal with too many variations in structure, syntax, and terminology - language IS immensely complex.

Although unstructured data is still too messy for NLP to capture meaningful insights from a large corpus of clinical notes without some deep learning, we realized that semi-structured data (ex: procedure, radiological, and pathological notes) could be more amenable to simpler NLP techniques. Furthermore, given that researchers were typically interested in specific pre-defined variables that were relatively constant across notes, we felt that we could develop a rules-based algorithm that would achieve a reasonable performance on the task.

Once we discussed this idea, Vedant felt we had something special. He pitched hard to convince me that we should build an automated chart review process for clinical research. Initially, I was skeptical. We didn’t have any data to confidently say that there’d be a market in the clinical research space, but Vedant knew there’d be at least one customer if we could build this out: himself.

Part 3: Natural Language Processing

This couldn’t be an article about natural language processing without at least one reference to Watson playing Jeopardy (New York Times).

In the world of natural language processing, rules-based systems are always the starting point, and sometimes the ending point as well. Take the famous IBM Watson Jeopardy example: under the hood, Watson was being powered by thousands and thousands of rules, hard-coded and tested by engineers to achieve the highest level of accuracy possible on training questions. There’s some other complexity built into the model that they used, but if you break it down to the most singular unit, you get a series of if-then statements. For example, given the question:

POETS & POETRY: He was a bank clerk in the Yukon before he published “Songs of a Sourdough” in 1907.

The Watson system has to perform some complicated NLP to understand what the question is asking for and align that with the architecture of relations that it has stored in its backend (Wikipedia pages). Ultimately the system ends up with a semantic relation like this:

authorOf(focus, “Songs of a Sourdough”) temporalLink(publish(…),1907)

This then allows Watson to search through its database for candidate answers that match this semantic relation, using a huge series of rules that allow it to parse through the Wikipedia text and assign likelihoods to given matches. Watson then returns the answer with the highest likelihood, if it is above the chosen confidence threshold. (1)

Obviously, systems like that are only as good as the human-encoded rules, and that’s why researchers are now moving towards larger neural network models that are able to pick up on intrinsic similarities in large corpuses of text, even when those similarities might not be evident to us puny humans.

The current most popular framework for this is the transformer model, which has been improved and replicated many times with different plays on the wording used in the original paper: BERT (Bidirectional Encoder Representations from Transformers). At a high level, transformer models like BERT work by using an encoder and a decoder; the encoder reads the text input as numerical vectors that the computer can understand, and the decoder produces the prediction for the given task. BERT is unique because it reads all of the text at once, as opposed to previous transformer models that used encoders which simply read from left to right. This additional context allows the model to learn more information about a specific token (word), a feature that has shown to be extremely useful for language modeling. We’ll dive deeper into BERT and language models in a later post.

BERT-based models: Clearly researchers are not known for their creative naming skills. Source

While this can (and has) led to massively improved performance, not understanding how the system makes its decisions has been a continual hindrance for the application of Machine Learning (ML) in healthcare, and this is true for language neural networks as well. Just like rigorous and transparent science is required of clinical trials, the same lens is cast on ML but is met with skepticism when there is a “black box” model behind the predictions. That being said, there are still cases where institutions are moving forward with neural networks, but they always ensure that there is a “human in the loop”, someone with clinical expertise who can validate the model’s outputs before they affect a patient. We see this willingness to experiment in clinical care as well, such as when we accept medications with proven benefit even if we don’t know the mechanism of action. In fact, a lot of older drugs were used in the scientific literature before we even knew their exact effect on pathways. Obviously, we want to move as quickly as we can while ensuring that patient safety comes first, so the healthcare industry is grappling with this balance between innovation and safety.

The use of transformer models is also restricted by their hunger for data. In order to get effective predictions, you either need a large amount of data to train your own version of this model or use the pre-assigned weights and configure them for your task, both of which require time and thought in order to achieve a reasonable solution. While there is continued research around the simplification of this workflow, it still seems a ways away from being implemented in production for large health systems. For these reasons, many rules-based language systems are still in use and development in healthcare, even if they are under the marketing of ‘AI’ or ‘Machine Learning’.

Part 4: Complexities in Clinical Notes

Like we mentioned before, this task of parameter identification and value extraction is perfect for rules-based algorithms: akin to a ‘Control-F’ command, we simply have to define rules that tell the program when and where to look for parameters and how to extract their associated values. Easy, right? Well, not all the time.

When we looked at clinical notes for Prostatic Artery Embolization, we found varying degrees of complexity associated with the rules we needed to create and the information we were able to extract. The simplest cases were those where the parameter names and values were formatted similarly across all of our notes. For us, this included parameters such as the category of Barbeau test (where the value was always entered as either A,B,C, or D) and the type of access (radial, ulnar, femoral). Having discrete categories for these values not only made our rules simpler, but it also meant that there was very little variability in the way in which they were entered in the notes. This is something worth mentioning here: if you are truly interested in getting value from your unstructured data, then it is worth spending the time to standardize note entry. Templates can be useful here, but also showing the research goals from the NLP side can help inform physicians about the ways in which they deviate from the agreed-upon standardization. Just like in all of data science, the better the data, the better the output. That’s why we achieved the best performance on our lab pathology notes: these were essentially structured fields that were entered into free text notes, so it always followed the convention of [lab test] [value] %. With no variability in the formatting, we were able to easily extract the values for the labs we wanted, and even were able to extract additional labs that weren’t pre-defined.

Sample urinalysis report: structured machine-generated text like this is the simplest use-case; free-hand clinical notes are the most complex.

The next level of complexity we encountered had to do with extracting numerical values that were defined in different ways. In our case, this was the left and right embolization volume. The good news is that these were usually defined explicitly in the summary section of the note; the bad news is that there was a good amount of variability in which these values were entered.

The simplest case was:

“Selective catheterization of right prostatic artery which is a branch off the internal pudendal artery with 7ml of 300-500 um embolic particles”

In this case, we were able to pick up that the volume will be for the right artery, and since ‘7ml’ is the only word in the sentence that contains ‘ml’, we can extract this as the right embolization volume (note: a rule that relies on all instances of the characters ‘ml’ could be fragile or sufficient, depending on how confident you are in the formatting of your notes. Like I mentioned before, we knew that the formatting would be consistent for these parameters, so that’s why we chose to go with the ‘ml’ demarcation. However, if you feel that the units could differ or if there could be other instances of ‘ml’, then you would have to think further about this type of rule).

However, we also ran into cases like:

“Embolization of the right prostatic artery with 10 ml 100-300 (2ml proximal) and 300-500 um (2 ml proximal 6ml distal) (inferior vesical branch)”

In this case, the total right embolization volume is 10ml, but you can see that our rule from above would also pick up the additional proximal and distal volumes that are present. The sum of these additional values actually add up to our total embolization volume (2+2+6=10), so we decided to add that characteristic into our rule. We accomplished this by extracting all of the ‘ml’ values, sorting them in descending order, and then checking if the sum of the other values add up to the first (max) value that we captured. If they agree, like they do in this case, then we are set. Otherwise, we end up returning both to the user and allow them to validate and select the correct value (which we will show in our next post). As you can see, this is already getting messy, and there are a few other instances with this embolization volume extraction that we had to work through in order to create the best rule possible.

The final layer of complexity we encountered had to do with open-ended values. For us, this came in the form of the left and right prostatic artery origins. Sometimes these values would be present in the summary lines alongside the embolization volume, such as the snippets shown above. These cases were the simplest, since we just created a dictionary beforehand with potential artery locations and extracted any matching names that were found. However, there were instances where the artery origins were not present in this summary section, which meant that we had to dig through the note itself to find it. Searching for the dictionary values alone didn’t suffice, since there could be multiple mentions of the ‘internal pudendal’ artery that weren’t associated with the origin. We also had to integrate dependency into this search, since sometimes the location of the origin (right or left) was not present in the sentence itself, but was instead referred to in the context of that paragraph. For these instances, we had to store and track the current referenced artery and assign any following origin to that side of the body. For example:

A 5Fr catheter was exchanged and contrast injection given illustrating the anterior and posterior divisions of the right internal iliac artery. The prostatic artery was noted to be a branch off the internal pudendal artery.

Here, the second sentence gives us the prostatic artery origin, but we don’t know if it’s for the right or left side. The preceding sentence contains that information, and since we read through the note line by line, our variable for which artery is being talked about will have been set to ‘right’ for that sentence, and then that is applied to the origin when it is found in the next sentence.

These challenges are just a few examples of what we ran into, but we learned a ton from this iterative process of developing rules, finding gaps, improving our rules, and repeating. Sometimes it was a linguistic reason for the gap, other times it was a clinical reason; either way, we worked closely together to ensure that the rules we built would allow us to achieve the highest accuracy possible on our notes.

With this in mind, no NLP system is perfect, and it is extremely important to be sensitive to errors in healthcare. Even though there is no existing tracking of human-made errors in data collection for research, we knew that an automated system like this would have to have reasonable performance in order to be actually used, but we also recognized that there would have to be a human in the loop in order to validate and correct datapoints before they were used in an actual study. Obviously, the eventual goal would be to move to an entirely automated system, but the stakes are too high in healthcare to not have a level of transparency and checking built in.

For these reasons, we decided to integrate our rules-based algorithm into a local application that would allow users to upload their notes, run our algorithm, and then go through the notes and correct any mistakes that were made before ultimately exporting the data. We both felt that this was critical to get buy-in, which will be the focus of our next post.

Part 5: Next Steps…

We worked on this initial proof of concept for 6 months, continuously tinkering with the approach and understanding the nuances of NLP. Eventually after months of iterations, we discussed the idea to mentors and warm contacts in order to gauge interest in utilizing our code based approach to automate the chart review process.

Our goal with this first post was to share some of our work with the world, in the hopes that our approach and learnings can inspire others to build similar solutions. The truth is that now more than ever, the healthcare industry needs interdisciplinary contributors to come together, share insights, and prototype scrappy solutions that can solve challenges for real patients and physicians at the ground-level. That’s why we are creating V² Labs, to continue to share insights and knowledge that we feel can empower clinicians, students, data scientists, and all of the other changemakers who are actively passionate about solving these issues. We are in the process of building our website and honing our content, so if you have thoughts or questions, please feel free to reach out to either of us through LinkedIn or Twitter, or send us an email at thev2labs@gmail.com. Thanks for reading!

References

Jeopardy! as a Modern Turing Test: Did Watson Really Win?

Viggy Kumaresan and Vedant Acharya