Wednesday, September 7, 2011

Data Warehouse Data Modeling For CIOs

Data modeling is the single greatest cause of long term failure and under-performance in data warehousing and business intelligence systems, across all industries.  It is compounded as a problem in healthcare due to the complexity of the data and the analytic environment.

Errors in data modeling strategy gestate slowly, usually emerging in 2-3 years after go-live when the demands for more flexible, complex and real time analytics increase in the maturation of a data driven culture.  The most debilitating and common mistake in data modeling is the assumption that a star schema (sometimes referred to as a dimensional model) can meet all of the analytic needs of the healthcare enterprise.   Consider a traditional library of books as a metaphor.  An encyclopedia can serve as a general reference for simple research in elementary and middle school, but its content is soon not sufficient for satisfying the more in-depth and complex research of a college or graduate student.  Initially, to organizations who are early on the path of data driven decisions and research, star schemas often appear useful and valuable, but over longer periods of time, typically 2-3 years, as the data driven culture matures and the analytic needs become more complex, star schemas suffer from a number shortcomings including the loss of source system data fidelity and query performance problems.  

Star schemas have the unusual effect of requiring increasingly complex SQL statements that would be less complex under more appropriate data modeling environments.  These complex SQL statements take longer to develop, require higher skills on the part of the data analysts, and lead to programming errors.  Often times, these programming errors go undetected, resulting in the most risky analytic errors—false positives and false negatives-- in reports that are intended to inform significant business and clinical quality decisions. In summary, star schemas are useful in supporting relatively simple analytic needs such as counts and summations of events and financial transactions, but they should not be used as the single-- or even the dominant-- strategy for data modeling in healthcare BI systems.

Retaining the fidelity and detail of the source system data as you pull that data into the data warehouse is critically important.   Often times, inexperienced data architects will overemphasize the “Transform” in the ETL process and in so doing, will lose the ability to transform and evolve data structures within the data warehouse to meet new analytic needs that they didn’t anticipate in the original design.   If uncertain about the organization’s analytic requirements, don’t over analyze and become lost in requirements gathering.  Start simple, with the known analytic needs in today’s US-based healthcare enterprise, for example:

•             JCAHO
•             HEDIS
•             NCQA
•             PQRS
•             MU
•             Professional accreditation, such as STS, ACS and Trauma
•             State-specific reporting such as Cancer, HIV and Behavioral Health
•             Organizational goals regarding Clinical Service Lines or Quality Improvement  Programs
•             Disease registries for both large and small n diseases and conditions

It is critically important to configure transaction systems in a manner that will support the analytic needs of the organization; otherwise, the best data modeling strategy is of little or no value in the BI system.  Many times, electronic health records are rushed into deployment with physician adoption being the most important criteria.  While this is an important goal, pausing in the planning phase to configure the EHR transaction system so that it supports the analytic needs of the physicians and organization will actually improve physician adoption, not detract from it.  The same analytic needs described above should be addressed in the configuration of the transaction system EHR-- work backwards from these known requirements into the data collection strategy when configuring the EHR and your analytic capabilities in the data warehouse will benefit enormously.

The best data modeling strategy for a BI system is one which is tailored to meet the specific analytic environment or need, and thus adapts to the data analyst rather forcing the data analyst to adapt to the data model.  In the most successful data warehouses and BI systems, a review of the data modeling strategy would reveal a combination of 1st, 2nd, and 3rd normal form, star, and snowflake models, arranged in database schemas according to specific analytic subject areas, analogous to the books that are organized by sections and subject areas in a library. For example, the analytic needs associated with hospital case mix data can usually be satisfied with a 1st normal form data model augmented by a star schema.  The analytic needs for a chronic disease management environment can generally be satisfied with the combination of a 1st and 2nd normal form data model.  The same data modeling strategy can be used to support HEDIS, NCQA, and JCAHO reporting needs.  The analytic needs associated with managing a physicians’ group, where productivity, claims processing, and accounts receivable reports are important, can usually be satisfied with a 2nd normal form data model augmented by a star schema.

When in doubt about the best data modeling strategy to use for a particular analytic need, the best option is usually a 1st normal form.  These “flat” data models can often times be millions of records deep and over 100 columns wide.  Quite often, this simple approach is sometimes counterintuitive to many data architects who lack analytic experience, but these unassuming data models meet a very broad number of analytic scenarios.  It is also worth noting that the growing use of and familiarity with spreadsheet-based analysis (e.g., Excel) among data driven cultures is particularly well-suited to simple, flat tables of rows and columns in the data warehouse where relational joins to other tables are less important than filters, sorts, and pivot tables to the data analysts’ needs.

 In conclusion, the most valuable and extensible data warehouse and BI systems in all industries, especially healthcare, use a data modeling strategy which leverages numerous data modeling techniques applied and arranged in adaptive fashion to meet the needs of data analysts in specific subject areas of reporting and analytics.


Glen McCallum said...

I could have really used this a year ago! : )

What are you thoughts on the architecture from the book "The Corporate Information Factory"?
1) Central Data Warehouse aggregates data from source applications in third normal form.
2) Information then flows into subject area specific data marts in star schema form to support easy drilling and reports.

Anonymous said...

Good article. We use a flat data model for most data, and don't use cubes. The flat data model retains all the original data detail, so more flexibility is allowed and storage requirements are small.

We go a step further and don't use SQL at all. In our experience, SQL is well suited for updating data, but poorly suited for analysis. We use a separate data base engine that we wrote and optimized for use with flat data. We figured out how to make it fast. Not as fast as cubes, but fast enough, a second or two for most analyses. Ask yourself, does google use SQL for doing their search? No. They wrote something special. The same logic applies to data warehouse for better performance. has more information.

Daniel Goldman

Russian Nuclear Subs Prowling Again

This article in the Washington Post describes the increased presence of Russian nuclear-armed submarines off the coast of the US, reflectin...