Data warehousing for the busy DBA
Part 2: What a data warehouse isn’t
Many data requirements can be fulfilled without going to the trouble of a data warehouse. In some cases the requirements are in conflict with the main requirements of a data warehouse system.
Data warehouses are not real-time systems
Real-time data warehousing is almost never a genuine requirement. Real-time analytic and aggregate reporting requirements are actually quite rare with very few exceptions. There are two very good reasons not to implement a real-time system unless you really need it:
- They are much more complex and harder to get right. Real-time systems can be made to work on simple data – market feeds, account transactions and balances or web server logs. Complex real-time systems are much harder to build and make stable. Unless there is a genuine requirement (see below) this is an anti-pattern.
- Stable data that can be reported by an ‘as-at’ position is much easier to work with. If you have figures changing in real-time you will have the same report generating different results if run twice. This can generate a whole class of really time-wasting reconciliation bunfights and can erode user confidence in the system.
Data warehouses are not operational reporting systems
One commonly confused requirement is to do operational reports off the data warehouse system. This is often cited as a requirement for real-time data. Generally this requirement is in conflict with the other requirements of a data warehouse, to the extent that it is a known anti-pattern.
Operational reports tend to be exactly that – operational. They are not analytic in nature; they tend to be detail level exception reports or status reports on a process. Normally they act as to-do lists or warnings of the need to rectify something. Typically, the process around an operational report will look something like:
- Run the report, using it to check work that needs to be done, enter the work into the source system, and then re-run the report to see if there’s anything left that needs clearing, or
- Enter data into the system and then run the report straight away including that data.
- Generate a to-do list or a work list for somebody.
These reports are not analytical in nature and are typically detail reports showing lists of specific items. Typically, operational reports are tied to specific processes, and they are not ad-hoc in nature. Normally the best way to fulfil an operational reporting requirement is to run the reports off a replicated copy of the production system.
Data warehouses are not tied to a single system
If you make your warehouse model too tightly coupled to a single system then you will run into major issues if you try to bring data in from other sources. The data warehouse should have a model of its own.
If your requirement is to provide reporting off a single system you can use warehouse-like architecture to implement the facility, but a stovepiped data mart is much easier to implement. If your reporting requirement is tied to a single system it might be worth taking a step back from the concept of a data warehouse and implementing something that will just fulfil that requirement. It will be much quicker and easier, and much more politically acceptable to other stakeholders.
A data warehouse is more than just a central data store
Many projects simply build relatively straightforward feeds from their source systems and drop it into a reporting database with no conformation process and (often) relatively little transformation. This fails a key requirement of a data warehouse in that there is no ‘single source of the truth’. This type of system still requires bespoke queries and extracts to get data from.
A system built like this places a large burden on the reporting layer and will not be able to support an ad-hoc reporting facility. It is unlikely to have much effect on the status quo. It is often called a ‘data warehouse’, but the result is disappointing as it doesn’t materially improve the users’ access to the data.
It is suggested that you resist building this type of system and stick with providing ad-hoc extracts as an alternative. Don’t waste your time with a project like this. Let the business stick with their SAS and MS Access extracts. At best, use the database as a staging area for these extract processes but try to dispel any expectation that the business will get self-service access to the data.
Qlikview is not a data warehouse
Various ad-hoc reporting tools have some ability to put metadata over an arbitrary collection of data sources, or extract the data into a cube format. Examples of this type of tool include Qlikview, MS Powerpivot, Cognos TM1 and Business Objects. These tools all work well with a data warehouse but they are often sold by the vendor as a substitute for one.
The strength of such tools is that you can build a reasonable ad-hoc reporting facility within the limitations of the data. However they do not address conformed data or data cleansing issues and they have limited ability to do significant transformation on the data. They offer no facility for reconciliation controls against source data, so it can be hard to troubleshoot or prove that the data shown in the tool is correct.
The best approach with these tools is either to build a data warehouse and use these tools as a front-end, or to use the tool to extract directly from source and accept the limitations. Do enough to get that on your CV and then go pimp yourself as a B.I. consultant before the business work out how crappy the underlying data is.