Data modeling is the single greatest cause of long term
failure and under-performance in data warehousing and business intelligence
systems, across all industries. It is
compounded as a problem in healthcare due to the complexity of the data and the
analytic environment.
Errors in data modeling strategy gestate slowly, usually
emerging in 2-3 years after go-live when the demands for more flexible, complex
and real time analytics increase in the maturation of a data driven
culture. The most debilitating and
common mistake in data modeling is the assumption that a star schema (sometimes
referred to as a dimensional model) can meet all of the analytic needs of the
healthcare enterprise. Consider a
traditional library of books as a metaphor.
An encyclopedia can serve as a general reference for simple research in
elementary and middle school, but its content is soon not sufficient for
satisfying the more in-depth and complex research of a college or graduate
student. Initially, to organizations who
are early on the path of data driven decisions and research, star schemas often
appear useful and valuable, but over longer periods of time, typically 2-3
years, as the data driven culture matures and the analytic needs become more
complex, star schemas suffer from a number shortcomings including the loss of
source system data fidelity and query performance problems.
Star schemas have the unusual effect of
requiring increasingly complex SQL statements that would be less complex under
more appropriate data modeling environments.
These complex SQL statements take longer to develop, require higher
skills on the part of the data analysts, and lead to programming errors. Often times, these programming errors go
undetected, resulting in the most risky analytic errors—false positives and
false negatives-- in reports that are intended to inform significant business
and clinical quality decisions. In summary, star schemas are useful in
supporting relatively simple analytic needs such as counts and summations of
events and financial transactions, but they should not be used as the single--
or even the dominant-- strategy for data modeling in healthcare BI systems.
Retaining the fidelity and detail of the source system data
as you pull that data into the data warehouse is critically important. Often times, inexperienced data architects
will overemphasize the “Transform” in the ETL process and in so doing, will
lose the ability to transform and evolve data structures within the data
warehouse to meet new analytic needs that they didn’t anticipate in the
original design. If uncertain about the
organization’s analytic requirements, don’t over analyze and become lost in
requirements gathering. Start simple,
with the known analytic needs in today’s US-based healthcare enterprise, for
example:
• JCAHO
• HEDIS
• NCQA
• PQRS
• MU
• Professional
accreditation, such as STS, ACS and Trauma
• State-specific
reporting such as Cancer, HIV and Behavioral Health
• Organizational
goals regarding Clinical Service Lines or Quality Improvement Programs
• Disease
registries for both large and small n diseases and conditions
It is critically important to configure transaction systems
in a manner that will support the analytic needs of the organization;
otherwise, the best data modeling strategy is of little or no value in the BI
system. Many times, electronic health
records are rushed into deployment with physician adoption being the most
important criteria. While this is an
important goal, pausing in the planning phase to configure the EHR transaction
system so that it supports the analytic needs of the physicians and
organization will actually improve physician adoption, not detract from
it. The same analytic needs described
above should be addressed in the configuration of the transaction system EHR--
work backwards from these known requirements into the data collection strategy
when configuring the EHR and your analytic capabilities in the data warehouse
will benefit enormously.
The best data modeling strategy for a BI system is one which
is tailored to meet the specific analytic environment or need, and thus adapts
to the data analyst rather forcing the data analyst to adapt to the data
model. In the most successful data
warehouses and BI systems, a review of the data modeling strategy would reveal
a combination of 1st, 2nd, and 3rd normal form, star, and snowflake models,
arranged in database schemas according to specific analytic subject areas,
analogous to the books that are organized by sections and subject areas in a
library. For example, the analytic needs associated with hospital case mix data
can usually be satisfied with a 1st normal form data model augmented by a star
schema. The analytic needs for a chronic
disease management environment can generally be satisfied with the combination
of a 1st and 2nd normal form data model.
The same data modeling strategy can be used to support HEDIS, NCQA, and
JCAHO reporting needs. The analytic
needs associated with managing a physicians’ group, where productivity, claims
processing, and accounts receivable reports are important, can usually be
satisfied with a 2nd normal form data model augmented by a star schema.
When in doubt about the best data modeling strategy to use
for a particular analytic need, the best option is usually a 1st normal form. These “flat” data models can often times be
millions of records deep and over 100 columns wide. Quite often, this simple approach is
sometimes counterintuitive to many data architects who lack analytic
experience, but these unassuming data models meet a very broad number of
analytic scenarios. It is also worth
noting that the growing use of and familiarity with spreadsheet-based analysis
(e.g., Excel) among data driven cultures is particularly well-suited to simple,
flat tables of rows and columns in the data warehouse where relational joins to
other tables are less important than filters, sorts, and pivot tables to the
data analysts’ needs.
In conclusion, the
most valuable and extensible data warehouse and BI systems in all industries,
especially healthcare, use a data modeling strategy which leverages numerous
data modeling techniques applied and arranged in adaptive fashion to meet the
needs of data analysts in specific subject areas of reporting and analytics.
2 comments:
I could have really used this a year ago! : )
What are you thoughts on the architecture from the book "The Corporate Information Factory"?
1) Central Data Warehouse aggregates data from source applications in third normal form.
2) Information then flows into subject area specific data marts in star schema form to support easy drilling and reports.
Good article. We use a flat data model for most data, and don't use cubes. The flat data model retains all the original data detail, so more flexibility is allowed and storage requirements are small.
We go a step further and don't use SQL at all. In our experience, SQL is well suited for updating data, but poorly suited for analysis. We use a separate data base engine that we wrote and optimized for use with flat data. We figured out how to make it fast. Not as fast as cubes, but fast enough, a second or two for most analyses. Ask yourself, does google use SQL for doing their search? No. They wrote something special. The same logic applies to data warehouse for better performance.
http://www.ehdp.com/vitalnet/ has more information.
Daniel Goldman
Post a Comment