Key Concepts: Data Quality and Data Profiles
Thursday, October 23, 2014
Disease Surveillance With An Enterprise Data Warehouse
We could easily justify a 30-page paper on this topic, but I don’t have time for that, so I’m going to offer a few thoughts and observations on this topic while it’s hot. I hope that others of you with similar experiences will share the same by contacting me or submitting a comment.
Here’s a summary of the current options that are available for monitoring data that could help identify disease outbreaks. Additional details about these options appear later in the blog, and you can see in those details, the options are not great.
1. Monitoring chief complaint/reason for admission data in Admit, Discharge, and Transfer (ADT) data streams.
2. Monitoring coded data that is collected in Electronic Health Records (EHRs).
3. Monitoring billing data.
Federal Meaningful Use regulations require that EHRs be able to submit syndromic data to a surveillance system, but I have a feeling that requirement is going to run into data quality problems, for the reasons described below.
One of the key concepts that underlies what I’m about to discuss in this blog is data quality. Poor data quality translates into poor outcomes for decision-making, imprecise decision-making, and imprecise responses to a situation. The equation for data quality is:
Data Quality = Completeness x Validity
The higher your data quality, the more precise your understanding of the situation at hand and the more precise your decisions and reaction can be to a situation.
“Completeness” is exactly as the word implies—how complete and granular is the data you have access to, about a patient’s situation? The metaphor that I like to use is that of a low vs. high-resolution picture. The higher the resolution of the picture, the more detail you can see and understand about the subject in the picture. A low-resolution picture leaves you guessing about the finer details. Highly “complete” data is equivalent to higher resolution.
“Validity” is a little more difficult to describe, but in short, it relates to the context of the situation in which the data is collected and the accuracy of the data. If a nurse measures a patient’s temperature and enters it into an EMR, that satisfies the concept of Completeness—the data has been captured. If that nurse enters the wrong temperature, we have a violation of the concept of data Validity. Timeliness of data is also a dimension of Validity. In the case of charting a patient’s temperature, if a nurse enters the correct temperature in the patient’s chart, but enters it four hours after actually taking the patient’s temperature, that data lacks temporal Validity. To be valid, the data must be timely, relative to the decision-making or action associated with the data.
In addition to Data Quality, the other key concept is the notion of a “Data Profile” for a patient and disease type. A simple data profile for a patient is pretty straightforward—name, gender, age, height, weight, address—those are the basics of a first pass at a data profile. To round out the data quality of that data profile, we would also like to collect current medications, known allergies, past surgeries, family history of disease, and chronic conditions. The next pass through the data profile is the collection of data associated with vitals and labs-- temperature, heart rate, respiratory rate, blood pressure, basic blood and urine labs. Each additional pass adds new data points about a patient—i.e., a higher resolution picture-- and contributes to that patient’s data profile and, hopefully, the data quality about that patient, as well.
Diseases also have a data profile, based upon commonly acknowledged symptoms and, hopefully, very discrete lab results or other diagnostics, such as those from imaging. The first symptom of Ebola is a fever, followed days later by increasingly worsening fever, bleeding, and vomiting. Unfortunately, the initial data profile of a patient with Ebola—data profile = fever-- overlaps with hundreds of other disease states. Without more complete and valid data—that is, a higher resolution picture-- the initial clinical data picture of Ebola looks like a multitude of diseases. To raise the threshold of concern for declaring patient as “Ebola possible”, the next data point beyond fever, absent a pathology assessment for the virus, is a data point about the patient’s socio-physical environment prior to the onset of fever—i.e., Does this patient have a fever AND has this patient been exposed to another patient or group of patients who were Ebola infectious? I’m oversimplifying a bit, but the precision and accuracy of the data profile is a series of these data profile AND statements. Any time we see a pattern of Boolean statements as a tool for describing a situation, we should immediately see an opportunity for computer-assisted decision-making.
Every healthcare system in the US should possess a generalized data profile alerting engine that is fed by an Enterprise Data Warehouse (EDW) that could, in-turn feed analytic output to the EHR at the point of care, as well as any number of other downstream data consumers, such as state and federal government. Within that alerting engine, a healthcare system would be capable of creating any number of profiles for “Patients Like This” (I notice that Epic has trademarked that phrase, but I’ve been using it, too, since about 2001). Those of you who are familiar with Theradoc can see the conceptual overlap with what I’m calling a patient profile alerting engine. But, there are also significant differences in the concept and, especially, the implementation models.
The profiler would sit in the background, passively watching the stream of data into the EDW until the profiler reached a tipping point in its predictive algorithm, at which time it would declare, “This patient has a [%] probability of [this disease or condition].” Declaring the likelihood of a patient with an infectious disease is not good enough, however. We must also use the alerting engine and our data to recommend the action and intervention to take following the prediction. In the case of Ebola, the alerting engine would notify the immediate clinical team and the Infectious Disease SWAT team to isolate and sterilize everyone within contact. At Intermountain, we called these altering engines that were embedded in the EHR and EDW, “Medical Logic Modules.” Trained teams of informaticists, clinicians, data engineers, and pharmacists would define the logic inside these modules, monitor them over time, and adjust as necessary depending upon the evolution of the data profile for the disease or condition.
Here’s a diagram of the conceptual architecture:
So, that’s the long-term goal—a patient profile alerting system attached to an EDW, configured with Medical Logic Modules, and passing that data to an EHR so that the analytics of the EDW is facilitating better decisions at the point of care. For the moment, I will avoid mentioning the contractual restrictions and technical barriers that some of the major EHR vendors place on their APIs that make this sort of architecture very challenging to implement. If these vendors understood the commercial value of open APIs (technically and contractually), they would understand that it would increase their market share and IP, not detract from it.
So… back to the topic at hand… What do we do in the near term? Is there anything we could do, realistically, to better track the progress of diseases like Ebola, given our existing ecosystem of information systems in healthcare? As you can see from the options that I list below, we could do better than nothing and the existing state of affairs, but any current option has significant shortcomings.
Here are the options that are available right now, listed with their pros and cons:
1. Monitoring ADT messages: This option would use the chief complaint/reason for admission data in an ADT message. The advantage of this option is that it is real-time, upon presentation of the patient at a healthcare facility. The disadvantage of this option is the lack of codified, computable data in the data stream, thus requiring some form of natural language processing. Better than nothing, but far from precise. When presenting, chief complaint is usually captured in a free text field in the registration system. A handful of forward thinking healthcare organizations are starting to codify these chief complaints with SNOMED, using an interface terminology on the front end of that coded data so that registrars can still enter a lay term for the chief complaint that is mapped to a medically-meaningful and computable code in the background.
The other disadvantage of this approach is simple: a patient’s complaint does not equal a clinical diagnosis. A complaint of fever and vomiting from a patient might infer Ebola, but only a clinician and a lab test are capable of declaring with certainty that the diagnosis is Ebola. Until that clinically valid data is available, the best you can do is be concerned, but not definite. On a technology level, the other disadvantage of this option is, the chief complaint is frequently a single field in a database, so registrars will usually only capture one complaint—“headache” or “vomiting” when there might be several symptoms. Sometimes, the registrar will overload the data capture field with multiple complaints, separated by commas.
If this were a good option, then HIEs or common interface engines would be capable of analyzing these ADT messages for this type of content, but there is no widespread occurrence of that in the industry because of the shortcomings to the approach.
2. Analyzing Coded EHR and Other Clinical Data: In this option, we could monitor coded data (SNOMED or ICD) for diagnosis, labs tests and results, and diagnostic imaging. This is the most precise option available, but it is unlikely to ever be a real-time option in healthcare. The way current EHRs are designed, the data that is entered into an EMR for a patient’s encounter is an all or nothing commitment—that is, until the clinician “closes” the encounter with all associated data and notes, that data is not made available for analytics and decision support. This is an unfortunate design and behavior because, as I mentioned earlier, there is at least some decision-making value in an incomplete, low-resolution data profile. The alerting engine could be running in the background, looking at these clinical encounters that are only partially complete from a data perspective, whether the clinician is finished with their documentation or not. In any case, as currently designed and practiced, it can be many hours, several days, or even a few weeks before a clinician closes a clinical encounter in an EHR. That’s not exactly opportune for managing a rapidly expanding disease outbreak.
3. Analyzing Coded Data From Billing Systems. This has all the problems of option 2, and more. It’s not unusual for revenue cycle processes and systems to take over 30 days to drop a bill. But, in the absence of an EMR, this data is certainly better than nothing for profiling.
There are other options emerging, but they are years away from being integrated into the workflow of current EHRs and clinical processes. There have been several recent studies showing the potential for analyzing social data, such as Twitter and Facebook. Also, analyzing consumer purchasing data is a possibility. Several years ago, before Wal-Mart became more secretive and declared their analytics systems as their most valuable corporate asset except people, I saw a demonstration of their application that could track the sale of products across all of their stores, in near real time— sales data was available for analysis in their EDW within 9 minutes of a transaction at any of their stores, as I recall. One of the more interesting parts of the demonstration was their ability to see the progression of influenza across the US, as inferred by the sale of over-the-counter medications.
Our options are not great right now, but with a well-designed and flexible data warehouse, at least healthcare delivery organizations have the beginnings of an option that can improve in precision as we integrate more and more data, increasing the completeness and resolution of the picture for syndromic surveillance.
If anyone has any other ideas, options, or thoughts on this topic, I would love to hear from you.
This article in the Washington Post describes the increased presence of Russian nuclear-armed submarines off the coast of the US, reflectin...
The following comment was posted on the Mr HISTalk web site, in response to my earlier blog about interoperability: "Dale, why hasn’...
I was searching through some files today, looking for something else, when I came upon this. It's my "CIO Watch List" from Nov...
Population health isn’t as complex or novel an idea as some people make it out to be. We're wringing our hands and making it more comp...