Measuring the quality of a model’s output is difficult, as the definition of ‘accuracy’ is often determined by the use case, and any metric will be gamed. This is more than a technical problem.
Writing this blog has inspired me to re-explore the broader field of machine learning. Recently, I was reading Filip Piekniewski’s blog post about vision processing for autonomous vehicles and this quote really got me:
“…it is horribly difficult to measure [rare situations] because rare and dangerous situations are above everything else rare (but not nonexistent!)…”
That gets to the heart of a real challenge in discussing multilabel classification of text, an application to which Textician’s NoNLP™ technology is especially suited. I want to explore this issue in the context of a specific use case: reimbursement coding for medical records. Grab the strap – your self-driving train is departing.
In multilabel classification, you are answering the question, “Which subset T of tags from dictionary D best characterizes the input?” In our case, the input is unstructured text and, more specifically, doctors’ notes, lab summaries, pathology reports, etc. This is a horribly inconsistent corpus. And, just to make it fun, T is typically a list of fewer than 10 items, while D is four or five orders of magnitude larger! This is a difficult problem to say the least!
But it’s almost as difficult to evaluate the results in a useful way… that is, in a way that you can compare models and modeling technologies to determine which has the “best” result. Why?
If T is limited to one item, then there are a lot of metrics, including the ever-popular F1 score. It’s pretty easy to test, count the false positives and false negatives, and calculate. Perhaps that’s OK.
But F1 in multilabel classification – typically averaged across D – breaks down rather rapidly. The fundamental problem is that a macro average presumes an equal proportion of occurrences of each element of D. That’s decidedly not the case in medical coding: the ICD-10 code E11.9 by itself is somewhat more prevalent than the entire W58 group! But a micro average swamps the rare elements to the point that they are irrelevant.
F1 is thus easily gamed. If we include the long tail of D – Filip’s “rare (but not nonexistent)” codes – we heavily distort the average from the most frequent codes. However, the opposite – simply averaging F1 for the most frequent N codes – disregards those rare (but important) codes altogether. In a true embodiment of Goodhart’s Law, most computer assisted coders, all of which are based on rules-based NLP technology, use the latter approach to their advantage: why not keep your average up by disregarding rare codes for which you haven’t coded any rules, especially when the rare code is unlikely to be in the test set? Why not? Because it makes a difference in the real value of the application. (See below.)
The Worse Problem
The measurement problem gets worse when you consider that medical codes are hierarchical, with groups of codes that are mutually exclusive. If the model returns a positive for both W58 (“bitten by a crocodile”) and W58.1 (“bitten by a crocodile home while engaged in a leisure activity”), one of those is a false positive. If the model returns W58.00 (“bitten by a crocodile while engaged in a sports activity”) and W58.01 (“bitten by a crocodile while engaged in a leisure activity”), it’s a judgment call which is correct if the patient was playing a game of backyard soccer… with one code “correct” and the other a false positive. Both of these examples would count against any formal measure of accuracy and, given their rarity, the likelihood of having enough training data to fix the model is nil.
The Even Worse Problem
As we discussed in an earlier post, it’s often best if the Context is expressed in probabilities rather than binary values for each element. And that is the case for our multilabel classifier: we can be 90% confident that E11 (“Type II Diabetes”) applies and 65% confident that E11.9 (“Type II Diabetes without complications”) applies. Suppose the latter is deemed “correct;” is the former 90% of a false positive?
Perhaps – despite mathematicians’ best efforts – it doesn’t matter, because that’s not the worst of it.
The Worst Problem
The above is all about mathematics – it’s a difficult and interesting set of considerations. But the worst problem is money, because superimposed on all of the above is an objective function in dollars and hundreds of dollars. (This is healthcare – there’s no sense in cents.) It turns out that the reimbursements for more specific codes like E11.9 are higher than for nonspecific codes like E11. At the same time, there are severe penalties for “over-coding” – for using a more specific code than justified. So, as noted in the previous section, the more specific (but less confident) code may be correct relative to the objective.
Or not. Because in US healthcare, there is another complication: denials. The payer (e.g., the insurer) can deny a claim, and they are, generally, more likely to deny a claim for a specific code (which is rarer but higher reimbursement). So, theoretically, the objective function could be shaped such that the nonspecific code is the correct one, even though the specific code applies.
And then denials can be appealed, and a certain number are then approved. And denied reimbursements can be recoded and resubmitted. That objective function sure gets pretty complex… and detached from “accuracy” of the coding model!
OK – we’re off the rails. Autonomous locomotives apparently have a way to go (pun intended). Maybe there’s a need for machine learning to optimize the overall objective function of medical reimbursement, with the codes as a somewhat fungible intermediate. Perhaps the “accuracy” of the codes is fungible as well. Hmmm….