Stack Exchange

The Busy DBA: Data warehousing, part 1

Post to Twitter

Data Warehousing for the Busy DBA

Part 1: What is a data warehouse?

So, one of the enterprise architects has convinced management that we need a Data Warehouse.  You’re the DBA, you know about databases, right – we need you to go and design it.  Can we have an estimate by next Thursday for the board meeting?

This series of articles walks through key data warehouse concepts.  It is aimed at a database professional – a DBA or SQL developer – who is now getting involved in a data warehouse project.  Therefore the articles assume some background in database development or administration.

The term ‘data warehouse’ gets used quite loosely at times.  To start, here is a definition of what a data warehouse is (and is not) and the reasoning behind that definition.  We’ll also look at some alternatives to a data warehouse if your requirements aren’t congruent with an actual data warehouse project.  A data warehouse is more than just a reporting database.  A full-fledged data warehouse will have most or all of these requirements:

Multiple Data Sources

The system will load data from more than one source.  These data sources could contribute metric (numerical or statistical) data into the same fact table from different sources or they could be external sources of reference data not held at source but needed for reporting.

Clean and Conforming Data

All of the content in any given data item should be correct and behave the same, no matter what source it was loaded from.  The data is transformed into a common format free from leaky abstractions.  Failure to do this means that data in a given data item will need to be treated differently depending on the source, which creates a non-obvious leaky abstraction.

It will not take many incorrect reports produced by naïve end users before the issue starts to erode the system’s credibility.  ‘User training’ is not a good solution.  The system will quickly get a reputation for being incorrect or difficult to use, which will largely relegate its usefulness to reporting teams.  As often as not the reporting teams will be comfortable with their own stove-piped extracts anyway.

Star Schemas

Star schemas are not strictly necessary but you’re doing it the hard way if you don’t make use of them. They are effective for efficient queries over aggregate data as there is typically the only large table involved in the join, and the most expensive query plan operator will be a single table scan.  Partitioning works well on star schemas as there are no joins between large partitioned tables.  Many DBMS platforms also have special query plan operators for efficient star schema queries.

Slowly changing dimensions (SCDs) are by far the simplest structure for capturing historical state; it is much easier to do this in a star schema with SCDs than in a normalised structure.  I can see almost no reason to recommend a 3NF structure for historic data.

Historical Data

Typically a system of this sort has requirements to examine trends in data over time.  A data warehouse system will have historical data and many requirements to run queries over that data.  If you have requirements for real-time data it is likely that requirements for operational reporting are being conflated with the concept of a data warehouse.  This is a known anti-pattern, which will be discussed in more depth later.

Analytical Queries Across Data in Aggregate

Typical data warehouse queries are in aggregate, producing statistical or financial metrics across a large volume of data.  Normally a table scan is the most efficient way to do this sort of query; if a query hits more than a few percent of rows in a fact table it is usually faster to scan it (using sequential I/O) than to use book-marking or index lookup operations.

One key implication of this is that you will typically want to optimise your system for fast table scans, as opposed to fast random access operations.  This has significant implications on storage architecture.

The System is Fronted by Ad-hoc Reporting Tools

One of the benefits of a data warehouse is the ability to provide a self-service query facility with clean data that is safe to be used by relatively naïve end users.  This places two key requirements on the data

  • The data must be well behaved – clean, conformed and consistent.  The data must be ‘safe’ enough to allow non-technical staff to query it with a reasonable expectation of getting a correct result.
  • The data must be structured so it plays nicely with the reporting tool that is being used with the project.

Practically, this implies a data cleansing process, and (most likely) star schemas, although some tools (e.g. MS Access) are designed to use relational data sources.

The Project Will be Highly Political

Data warehouse projects tend to be highly political and characterised by systemic responsibility without authority issues.  You will need to put fingers in a lot of pies in the course of a data warehouse project, and they may not always be welcome.  The #1 failure mode of data warehouse projects is weak business sponsorship. You will find a lot of obstacles, mostly political but some technical – usually requirements for data that isn’t actually recorded anywhere, recorded at the wrong grain or of unusably poor quality.

A strong business sponsor is an absolute necessity to succeed in a data warehousing project of non-trivial scope.  The project will be politicised and painful; the business will feel the pain.  You will have many issues that you are dependent on third parties to fix as you have no access or authority on the affected systems.  These blockages will frequently hold up the project or limit what can be achieved.  Although they’re caused by third parties they will typically be seen as the responsibility of the data warehouse team.

If the project has any profile and budget, these delays will quickly burn through the a reasonable sounding contingency, especially if a large team or (worse yet) a consulting firm has been brought in to do the work.  Data warehouse projects have a reputation for being time-consuming and expensive and the principal drivers of this cost are the zillion and one holdups caused by external stakeholders.

2 Comments

Subscribe to comments with RSS.

  • Eric Higgins says:

    Good article!

  • Thank to share.It feels nice to read you blog throughly…

    Regards: Qadir Shaikh. Visit at http://www.oratc.com

  • Leave a comment

    Log in
    with Stack Exchange
    or