Tuesday, January 20, 2009

Data Warehouse Data Modeling

This is a fairly tech-ish blog entry, but given the rise of importance and number of projects related to data warehousing and business intelligence in healthcare-- which is a good thing-- I am concerned about the continued rate of costly and poorly informed decisions in the execution of those projects. A blog entry on the topic is probably worthwhile, even if a bit techie. The essence of this blog was originally posted among members of the Healthcare Data Warehousing Association (http://www.hdwa.org/).

I would caution those who are beginning the data modeling journey, and those who are relatively new to the journey and can still save themselves-- There is no single data modeling dogma which will solve your data modeling needs in an Enterprise Data Warehouse. Neither Ralph Kimball, Claudia Imhoff, nor Bill Inmon-- the oft cited "experts" in the field-- have ever had to actually live with the theories they espouse, especially in healthcare, in an operational setting over a long period of time.

Your DW data model should address three general needs and "metamodels", as follows:

(1) Data models that are essentially copies and/or denormalized versions of the production source systems, including all of their bizarre naming conventions. Some people call these "Operational Data Stores"-- whatever you want to call them is fine with me, as long as you stick to the basic principle of, "Copy the source system data. Maintain the truth and fidelity of that data, for better or worse, as you bring it into the DW." Maintaining the fidelity of the source system data is critical because, like lost fidelity in an audio recording, you can't go back once it's lost. But, as long as you have the original recording, you can do whatever you want to with the recording downstream. Data Analysts who are familiar with the typically arcane nature of the source systems will love you for giving them a better and easier place to run their queries than what they've had for the past 10-15 years in the source system environments, and they don't have to learn anything new about the data structures since you were so kind as to keep them intact.

(2) Data models that represent specialized data marts, spun off from the source system data models. These specialized data marts are driven by unique requirements-- let's say a financial dashboard or a diabetes registry or a full-fledged clinical quality improvement project-- and each of these unique data marts will also have their own unique and optimal way to model the data, including naming conventions that are customized to the Data Analysts in those unique environments. In 80% or more of the cases, most Data Analysts want simple rows and columns of data--there's a reason Excel is so beloved and there's a reason that the success of multidimensional cubes is not more widespread. In some cases, the analysis requirements might best be served by a star schema data model; or a snowflake; or a single wide and deep table with lots of redundancy, such as a row for every patient's encounter; or even a second or third normal form data model. The bottom line: Go with what works, with the simplest data model, and remember that most people will be very happy with simple rows and columns of data that they can suck into Excel for manipulation.

(3) A Master Reference data model area where you can store things for look-up (e.g., for data quality checks) like ICD-9 codes; CPT codes; SNOMED terms; LOINC terms; Drug codes; the Master Patient Index and Master Provider Index; local standards for department, facility, service line, charge codes; etc.

If you need a decent starting place to grease the wheels of your mind for a generalized healthcare data model, take a look at HL7 RIM and i2b2. They are at different ends of the data modeling spectrum-- RIM is pretty much 3NF and i2b2 is pretty much a star schema with a pinch of EAV-- but for gawd sake don't assume that either of these models will be the single data modeling approach which will satisfy all the needs that you will face in a healthcare data warehouse. Many healthcare organizations drank the Kimball Kool-Aid early in their evolution of BI and DW, and they are now discovering that they painted themselves in a corner that's going to cost a lot of time and money to unpaint.

In summary: (1) Maintain the fidelity of your source system data; (2) Design your subsequent data marts using the data model that makes sense for the analysts and reporting needs associated with that data mart; (3) Provide a robust Master Reference data model area which can be joined to your source system "Operational Data Stores" and specialized data marts.

Don't paint yourselves in a corner. Trust your common sense. Kimball, Imhoff, and Inmon know far less than you do about this topic.

No comments:

SpaceX Inspirations

SpaceX launched a two-astronaut crew yesterday, on a mission to dock with the International Space Station. It was the first human spaceflig...