Real World Implications of Statistical Concepts, 3: Benchmarks

There is probably nothing I am of two minds on more than setting benchmarks on service-line norms.¹ I am old enough to remember the ‘before time’ when a score was a score was a score. These were simpler times when hospitals would set goals and any difference in the patients and staff of an individual unit got washed out. Soon, though, individual units or unit managers would complain that some units had an easier path to success than others. The stereotype was that OB units, full of happy moms would give glowing reviews; oncology units, full of old patients with dim prognoses would give sour reviews. By setting a goal based upon a variety of patient types across the country, a hospital would go from treating all units uniformly to treating all units fairly.

So, clients demanded and vendors provided service line norms, so an OB unit could compare themselves against only other OB units and an ortho clinic could compare themselves against only other ortho clinics. While there were three things that challenged the accuracy of these norms, on the surface, they showed a difference in different patients’ satisfaction and therefore were very popular. Like so many things, their value or overvalue is driven by how they are used. So, overall, the advice is to approach them thoughtfully and not get overextended. This essay will explore this balancing act in the context of four debates raging in my head.

Before I do, let me quickly answer the question my previous paragraph raised: what are the three things that challenge the accuracy of service-line benchmarks?

The first problem is that they are based upon unit designations provided by the client. No vendor has the time or manpower to independently assign service-lines to hospital units or clinics. So, when a client establishes a vendor, they will self-identify the service lines of each unit and clinic. While the vendor may provide a list of options, not all hospitals will take this task with the same amount of seriousness or precision. Plus, the primary contact for the vendor may not even know how every clinic or unit is defined. It is not to say that the designations are garbage, but when determining unit classification is in the eye of the beholder, this data may not be as clearly defined as one might imagine.
The second issue is that even within each designation there can be a lot of wiggle room. Some units may also take overflow from other units, so their definition gets watered down. Some units may change completely, but no one realized that changing a unit from med/surg to cardiac might also require notifying the vendor downstream to redirect that data to the appropriate norms. Further, just because a unit is called an inpatient behavioral health unit does not necessarily say much about the kinds of patients they treat.
Finally, each hospital’s workflow may mean patients getting turfed in different ways. In all hospitals a patient may be moved from unit to unit during their hospital stay, but the data will reflect only the unit the patient was discharged from. Moreover, the process for these transfers will be different. Some hospitals will discharge out of their ICU units. Some hospitals will move a patient from ICU to a step-down unit before they get discharged. Some hospitals do both. Some hospitals make more aggressive use of swing-beds than others. So, it is not only difficult to create standard definitions of units, but also to define standard definitions of patients.

None of this is to say that these comparisons are not accurate, but only that like other things discussed here, one should not assume a level of clarity this is not warranted.

Similarities vs Differences

As referenced above, the trend towards service line norms was driven by the sense that some patient populations are just different from others. This makes intuitive sense. The type of patient and even the type of clinician can lead to a different vibe on some units versus others. Like all feelings, this can be overdramatized by stereotypes, but like many stereotypes, there is at least a grain of truth in them.

By parsing out these differences, one can clean out some of the noise in these comparisons. When mediocre OB units still outpace the average hospital score, and a high-performing oncology unit is still only treading water comparatively, it is obvious that a more accurate comparison would allow manager and senior leaders to better evaluate their units’ performance.

The problem, though, is that by highlighting these differences, one fails to identify any useful similarities. While there are differences between an ICU and a step-down unit, the differences are primarily ones of degree, rather than process. In fact, by embracing the similarities, these units can work to create a seamless transition for their patients. By focusing on the differences, it implies that we cannot learn from or duplicate the work done by others. I remember sitting down with two groups in a hospital system in Indiana.

One group was the physicians and, in talking about what common behaviors they could adopt and coach each other on, they could not get past the fact that they were all different. Some were primary care, some were specialists, some worked in big clinics, some worked in small clinics. They struggled to discuss what must-haves they could all sign on to, because they were just too different. As a result, they could not create any system standard that they could all abide.
The other group was the clinic managers from the same clinics. They had no problem discussing the things that were common between all practices. They all said that they had common workflows for scheduling, referrals, and prescriptions and likewise common problems with each. It was easy for them to form a common cause on some must-haves and hold each other accountable.

It was clearly the debate over the half-full/half-empty glass. While having specialty norms can be helpful to highlight the differences, having these norms also sends the implicit message that it is difficult or impossible to work together or find common themes.

This is especially challenging when you look at the differences articulated between service lines. When the difference between various clinic service lines is only 0.2% to maybe 1.1%, one must wonder if that precision is worth the building of silos around pediatricians and the podiatrists.

Motivation vs Despair

By calling out the differences in patients by service line, one can alleviate stress put on folks from a one-size-fits-all model. Even the language of educating frontline staff on what the goal is (“I am not asking you to be better than 75% of all units, only better than 75% of oncology units”) can help people focus on the donut and not the whole. After all the data itself says that 25% of oncology units meet or exceed this goal. If they can do it, we can do it.

On the other side, though, are those who have lofty goals, relative to everyone else. For every oncology unit, there is an OB unit whose goal may be higher than everyone else and even higher than the hospital goal. For these folks, that tailored goal may feel more like a punishment. “We are already outperforming the hospital, and the reward is to climb even higher. We must be awesome because those other units stink.”

In this situation like so many others, the goal may feel reasonable in how it is based, but unreasonable in its expectation. Even if the goal is based upon logic, it may feel much more like a stick than a carrot as it is implemented.

Hospital vs Unit

While the service line benchmark may have evolved out of a desire for more granularity in how individual units fare, it has made it harder to talk about how a hospital performs. Even if every unit meets their benchmark goal, it does not mean that the hospital will meet its benchmark goal. Conversely, even if most units do not meet their benchmark goal, the hospital can still meet its goal. This may seem counterintuitive, but it happens more frequently than one might think. There are a couple reasons why this can happen.

Different benchmarks. A hospital benchmark will be based upon ALL the hospitals in the database. Big hospitals with multiple specialty units, medical and medical/surgical units, and perhaps multiple ICUs and small hospitals, with only one broad med/surg unit. I know of a critical access hospital that has three discharge units: ICU, OB and Med/Surg. So, while each individual unit is only compared to like units, the overall hospital is compared against all hospitals, whether they serve the same population or have the same specialties.
Different reality or participation. I promised an essay on response rates which is where this issue would be explored more fully. The shorter version for this essay is that even if that critical access hospital only compared itself to other critical access hospitals with only OB, ICU and med/surg units, their scores may not be comparable because of the contribution each of those units make to the total discharge volume. Consider the stereotype that OB patients are happy, med/surg patients are crabby and ICU is a mixed bag. If the mix of the three units is 33%/33%/33% at one hospital and the mix is 20%/5%/75% at another, the overall performance will look different even if the performance at each unit is exactly the same.

Mixed into this, is the fact that each reported number is only understandable relative to its goal. So, an OB unit performing over the hospital target, but below its own target, may be the reason why a hospital is not meeting goal. Meanwhile the large medical unit is meeting its goal but is below the hospital goal so while may be the reason why the hospital is doing as well as it is, it will appear to be failing relative to OB and get blamed for the hospital’s failure.

Standardized vs Real-World

One main complaint on how HCAHPS scores are reported is that CMS adds weights to ‘standardize’ the data so all hospitals are comparable. So, a hospital’s reported scores may be much different than their raw data because of the methodology used, the mix of OB/surg/med patients, the patient’s self-described health status, etc. The underlying reason was to make the best apples-to-apples comparison of hospitals in a market and not disadvantage a hospital who is a safety-net facility or advantaging a hospital that could be more boutique.

This logic is reasonable, since you want hospitals in a similar market to be pushed through the same sausage press as everyone else. Creating a standardized measure makes for an easier evaluation by a layperson, especially if you assume that all hospitals essentially perform the same services. But increasingly not all hospitals offer essentially the same services. As hospitals become parts of systems and those systems move services around, a hospital may look different than it did when it was an independent hospital. For example, OB services may not be available at every hospital in a system. Given that the percentage of OB patients is part of the weighting criteria, having NO OB service, or being 100% OB services can create a skewed weighting. At some point, creating a standardized model runs afoul of the real world. A birthing hospital should not have their scores depressed to make them comparable to a full-service hospital since they are not trying to compete with a full-service hospital.

On a more micro-level, all these same system decisions can impact the ability of a hospital to compare its OB with a NICU with an OB that transfers all of its high-risk patients to another hospital within its system.

In the end, there is certainly energy in a hospital or system craving a more nuanced and precise comparison measure, but it is unclear if the juice is worth the squeeze. Spending the time to:

retrain staff and leaders on the differences in targets based upon differences in service lines,
restructure or redesign monthly reports to illustrate success and failure,
make sure senior leaders understand these relative numbers to deliver the correct carrot or stick to a unit manager,

all in the service of changing how the entire hospital/system speaks about success and failure is not insignificant. If an organization is doing this to make it easier to motivate staff to meaningful targets, it may be effort worth expending. But, if ultimately this makes it more difficult to count wins and losses and just becomes a way for leaders to explain away their performance and hide from the hard work of improvement, it may simply be adding a new coat of paint to a house on fire.

¹Service-line norms are benchmarks based upon patient experience data collected and bucketed by the type of experience a patient had. On the inpatient side, it could be med/surg, ICU or orthopedics, etc. On the clinic side, it can be broad, like specialty and primary care, or more specific, like orthopedic, pediatrics, internal medicine, etc. This allows setting goals based upon similar units, rather than a larger dataset of patients that cross the care spectrum.

Real World Implications of Statistical Concepts, 3: Benchmarks

Share this:

Leave a comment Cancel reply