One concept that really struggles with the statistics-to-English translation dictionary is error. While some might think that this word means that there is a mistake that demands fixing, most understand that this is somehow connected to margin of error or some other concept that means the results presented need to be taken with a grain of salt. Even armed with that understanding, people can mistake what they are looking at or how the concept can affect what they do with the analysis. Further, how we define what error is acceptable can shape how we process the analysis we are asked to use for improvement.
Before I begin, in this essay, I am focusing exclusively on one type of measurement error, the issues associated with analyzing results based upon survey data. There are other ways to insert error into data-collection: poor question-design, issues with sample construction, outreach mode just to name a few. These would also be good topics for essays, but not today.
[I wrote a whole section explaining what error and margin of error is. I cut it for length and because it felt like a long footnote rather than something the propelled the essay forward. In this essay, I am assuming that the reader understands what sample error is and how to understand margin of error numbers. If you don’t, or want a refresher, or just like reading my essays, I will post it separately.]
There is a lot of math and a few assumptions made when presenting a margin of error in data results. The essential takeaway from my perspective is that we talk about error in an abstract way, since we don’t really know how much error is present in any number. We will never know the population number, so we cannot know how accurate the sampled number is; we can only assume that we are likely to be close. But as you talk about margin of error with audiences, there are a few things to keep in mind.
- Just because we state that there is a range, it does NOT mean all answers within that range are equally possible. For example, in some research we discover that the observed score is 57% with a margin of error at +/- 4%. The range where we would find the population number is somewhere between 53%-to-61%, but it is not equally likely to be any of those numbers. It is still more likely to be 57% than anything else, but we cannot discount the possibility of it being closer to either end of the range. For many people, though, their desire to be happy or depressed will lead them to assume the best or worst of the data and assume the number at one end is the ‘real’ number.
- Error works both ways. Often people will assume that error is always depressing scores. So, they will assume that if the sampled number is 57%, it is really 61%, not realizing (or honestly accepting) that it could just as likely be 53%, and it is more likely 57%. Mathematically half of the time, the sample will generate a number that is greater than the population number and half the time the number will be lower than the population number. Given that you don’t know, you have to take the number at face value and not always assume the best or worst of the data.
- The only way to shrink your error range is to do more surveys, but that is easier said than done, since in most cases, any significant reduction in margin of error will require a massive increase in the sample you pull. While a small margin of error is nice, most don’t want to double or triple their surveying budget to get it.
All of this brings me to my first point, which is error is like dust; you will always have it. You will likely have less dust if you dust daily versus weekly, but you will always have some floating around. Likewise, short of surveying the entire population, there will always be error in the data. The main difference between error and dust, though, is that dust you can often see, but error is only assumed. In a bipolar world where everything is perfect or garbage, though, you will have to learn to accept the good rather than curse the darkness.
The issue here is that in most patient experience data, no margin of error is reported, so it is impossible to really know if the data is improving, regressing or simply fluctuating within the margin of error. This is problematic (and I have mentioned this before), because leaders will be more than happy to distribute carrots and sticks based upon movement in the data that is not statistically significant. Some organizations will create control charts for their PX data, but they don’t usually share them, since almost all variation is within the margin of error, which at best makes them difficult for motivation and at worst makes them difficult to argue for improvement. When I worked in the vendor space, I hated building control charts for individual hospital units because once people saw the range of variance, they would get frustrated and say that the data was useless. The best use of this data is, frankly, to track direction. While seeing the data improve every month for three or four months may not be statistically significant, it can motivate staff to continue doing the things that take us closer to the light at the end of the tunnel. By the end of the year, then, we see that those incremental 0.3% to 0.7% improvements can lead to 5% to 6% improvements year over year, which are statistically significant.
The second issue with error I will address is with an important concept that drives any conversation of significance. With one datapoint, margin of error gives you a sense of its relative relationship to the real, but unknown, population number. When you start comparing these datapoints and making guesses about whether the numbers are statistically significantly different or not, we enter a different but related concept of error. When we discuss whether two datapoints are statistically the same or different, we can make two different kinds of error.
- Type 1 Error (aka α-error aka false positive): This error occurs when we say an observed difference is statistically significant when it is not.
- Type 2 Error (aka β-error aka false negative): This error occurs when we say an observed difference is NOT statistically significant, but it is.
So, for example, you might see survey results that will identify a difference as statistically significant at a 95% confidence interval. This means that the author is 95% sure that the observed difference is statistically significant and there is only a 5% likelihood that the difference is random chance. Other differences in data points may be present, but if they don’t rise to the level of 95% confidence, we cannot differentiate them from random chance. While you will most likely see the confidence interval set at 95% confidence as this is the social science standard, you could set your threshold at 85%, 75% or even 99%.1 Each of these will obviously change the certainty that a difference is or is not significant, just as it will change the size of difference to get that classification.
What may be less obvious is the decision you make regarding your confidence interval will drive your ability to manage Type 1 and Type 2 error. Let us look at two extremes.
- If we want to make absolutely sure that everything we label as statistically significantly different is statistically significantly different, then we would set our threshold at 99.999% (called in some circles the “five nines.”) Here we are never likely to be wrong when we tag a difference as statistically significant. We have virtually no Type 1 error. Yay! BUT it also means that we are likely to miss some things that are statistically significant. If something rises to the threshold of 96% or 99% or 99.99% confidence, that looks good, but it fails our standard. So, while we have virtually eliminated Type 1 error, we have opened the door wide to Type 2 error because we will call things not significant when they really are significant.
- If we don’t want to miss any significant differences, we would probably cast our confidence level much lower. Setting it at 50% would essentially be saying that significance is a coin-flip. So, perhaps we set it at 75%. That way we feel good that we won’t miss any interesting or important differences. We have eliminated Type 2 error. But we are now accepting all differences that meet this very low threshold, meaning that we are likely to call things significant, even when the chances of them being random are still 1 in 4. So, in rigorously protecting against Type 2 error, we have allowed Type 1 error to run rampant.
I often describe this as squeezing a balloon. You can control one side of the balloon by squeezing it in, but that just means the other side will expand. It is difficult to squeeze all parts of the balloon at the same time. By managing one type of error, you run the risk of allowing the other type to expand. Statisticians look to create the best balance between the two, often called the crossover error rate, to manage both to the best of their ability.
From an audience’s perspective, though, this balloon squeezing is not visible. They only see the consequences. Most of the time, they are happy with just seeing the asterisk indicating significance, as many people want the sausage, but don’t need to see it being made. But how you set these standards can have ripple effects on that data’s usage. It can have you not addressing elements that are valuable, or it can always have you chasing ghosts.
The best healthcare example of this is how this math affects Best Practice Advisories (BPA) in the electronic health record. For those who are not familiar, BPAs are notifications that pop in an electronic health record when a patient meets some set of criteria. So, if a patient has certain lab or test results, the BPA will fire to notify the care team that they should check for sepsis or a catheter-associated urinary tract infection (CAUTI) or a central line associated bloodstream infection (CLABSI) or several other potential problems. What some may not know is that these BPAs are based upon the same statistical math any predictive model uses, meaning that they are susceptible to the same Type 1 and Type 2 errors. The logic dictates that if you have an elevated level of X, a decreased presence of Y and severe joint pain, it is probable that you have this health problem. But how “elevated,” “decreased,” and “severe” are defined will determine the likelihood creating false positives or false negatives. They may alert the care team to a problem that doesn’t exist and they may miss a problem that does exist. When you mix in the general distain that the care team has for BPAs (ask a doctor how they feel when the machine is telling them how to practice medicine, or you can just imagine their response), you have a recipe for problems.
The doctors say, “Don’t bother me unless you are absolutely sure” (aka set that confidence interval high to eliminate false positives) so they can trust the notice. So, the clinical informatics (CI) team ratchets up that confidence interval, so there will be no false positives. But, in doing so, the system will miss patients in the early stages of an issue and only notify the care team when the situation gets worse. The hospital says, “Catch any hint of an issue so we can avoid litigation or an insurance company denying a claim” so the CI team sets the threshold really low, so nothing gets missed and we avoid false negatives. We don’t miss any problems, but then the care team must sort through a hundred notifications to find the one problem. So, they ignore all of the BPAs and the doctors say “Don’t bother me unless you are absolutely sure.” The wheel turns and because of a high threshold or BPA-fatigue, early interventions are missed and people die.
Part of the problem is that AI has the potential to make this worse because it works faster and in a black box.2 Plus, people don’t consider error rates when talking about AI. AI functions on the same criteria of basing decisions on available data. It can learn through iteration where its successes and failures can feed into future decision-making so it can improve quickly. But at its heart there are still criteria or boundaries for establishing Type 1 and Type 2 error acceptance. It embraces the logic that, with more variables, it can be more accurate without considering that adding more variables also means inserting the potential for more error in the system. [For those interested, this is described in Nate Silver’s book, The Signal and the Noise, where predicted outcomes can change based upon the number of decimal places one uses for the input variables.]
Given that all of this takes place in a black box, the end-user is not even aware of the boundaries established and therefore doesn’t know if the model favors false positives over false negatives. Indeed, I have not heard any conversation in mainstream circles about how AI manages its own false outcomes. This invisible error is the most troubling part of it, in my eyes. Going back to a point I made in the previous essay, this may work well with large populations. But the average family member is more interested in the care their mother is receiving and less interested in how mistakes in her care will bend the mortality curve moving forward.
For the second straight essay on statistics, it seems like I am denigrating the very thing that put food on my table. Again, my problem is not with the math or the process. It is with the fact that many people demand things from analysis without considering how those demands have ripples. They expect the data to be highly sensitive to minor adjustments in performance, but when the data doesn’t behave in this way, people either blame the messenger, blame the staff or claim that the whole process is garbage. Some day I will write about my vision for properly-used data, but for the moment, I will simply say that by not understanding the process and the limitations, these people get out over their skis and fall on their faces. In those situations, they rarely take responsibility for themselves but instead rage against the process that they don’t really understand.
1Then there are the poor unfortunate souls who will argue for 95% confidence on a one-tailed test. While I understand the underlying logic here, they would be better served to just report their results at 90% confidence, but then again, maybe I am just a crabby old man asking them to acknowledge the existence of a full bell curve. IFYKYK.
2By “black box” I mean a situation where the mechanics of work are not visible. Data gets fed into one side and the other side spits out an answer, but without knowing the process for turning input into output, we have no way of evaluating how it works, or how to problem-solve if the output is wrong. This is probably most pithily demonstrated by Douglas Adams’ book Hitchhiker’s Guide to the Universe, where the supercomputer Deep Thought was asked what the answer was to the ultimate question of life and it replied, “42.”
Leave a comment