• Data Warehouse

  • Data Warehouse

    A data warehouse is considered to be a central repository integrating data from a variety of sources. It can contain current as well as historical data and will be a source of information for management reporting.

    Depending upon the development approach used it might contain a series of data marts or be a single integrated schema.

    It will generally contain some form of staging area for receiving the incoming sources, a data vault or atomic data store and finally a business inteligence layer from where the business will draw their data.

    A scheduler of some form will be used to contol the incoming feeds and a business intelligence reporting tool ( like Business Objects ) will be used to provide management reports.

    Generally the BI tool will be the main method for viewing the information.

  • Kimball Vs Inman

    Ralph Kimball and William Inmon represent two different ways of developing a data warehouse. The approach you take will depend on the nature of your organisation and the nature of your data. Each person is described here with further links and their methods are briefly compared.

    Ralph Kimball
    Ralph suggests that a data warehouse should be built using a series of data marts each of which represent a functional area or part of the business.

    The benefit of this is that the datamarts are relatively quick to develop and easier to control access. Change should be easier as different parts of the business own each datamart.

    The datamarts are based upon star schemas with a central fact table surrounded by a series of dimension tables. The development process would be


    1. 1) Choose Business process
    2. 2) Declare the grain
    3. 3) Identify the dimensions
    4. 4) Identify the fact

    Subsequently the star schema may be normalised ( or snow flaked ) to remove redundancy.

    William Inmon
    Inmon defined a single enterprise wide data warehouse to contain a single version of the truth. He encourages an iterative approach rather than a big bang and suggests that subsequent satelite databases can be created to serve departments within the company.



    Inmon Kimball
    Building DW Time Consuming Quicker

    Maintenance Easy Difficult

    Cost High initially with lower later project costs Low initially with subsequent projects having roughly equal cost.

    Time Longer start up Shorter start up

    Skill Specialised Generalist

    Data Integration Enerprise wide Individual business areas



  • Staging

    The Staging Area
    This is the area in the data warehouse where data is landed from the various data sources for future processing.

    The data structure is likely to be more representative of the raw data being received rather than the later processed form.

    It is likely that some form of scheduler is involved either pulling the data to the staging area or pushing data from a remote source.

    Data may come in via ftp, secure ftp, direct database access, xml via soap or some other exotic access.

    The data that arrives may be cleansed at this point and also deleted after successful processing. It may be that many of the attributes in the records received move no further than staging as they are not required.

  • ETL

    Extract
    Extract data from remote systems and validate to make sure that it meets expected standards. Data quality checks may also be performed.

    Transform
    Transform the data as required, for instance columns in the data may need to be pivotted into rows or data aggregated.

    Load
    Load the data into target tables, at this point final data quality may need to be checked. Did all of the data load, were there unexpected values ?

  • Data Vault

    DV or ADS
    The idea of a data vault or atomic data store is an intermediate area in the data warehouse where all data received can be stored before final processing.

    So no matter how the final loading is carried out the raw data is always stored. If at a later date we find that our transformation or loading rules do not match our needs correctly we still have all of the raw data.

  • BI Layer

    BIW
    The business intellgence warehouse is the final layer in the data warehouse. It is intended to be the layer which the business uses to obtain their data ( mostly through BI tools ).

    If this business is accessing the data vault as well then there must be a problem with the warehouse design. Direct access to the ads or data vault should be discouraged, instead investigate the user need.

  • Reporting Tools

    Business Objects (SAP)
    Universes are created as a meta data layer above the warehouse providing a business view of the data. The user community is then given the ability to create reports by selecting objects.

    A web interface is available and reports can be scheduled and delivered in various forms.

    Cognos (IBM)
    Cognos is a multi application web based system, Cognos connection is the main access point for the system. It allows the user to browse the other functions of the suite, manage security, manage the server, search and schedule reports.

    Query Studio allows simple reports to be created and formatted, diagrams can also be added.

    Report Studio allows the user to create reports in two modes, professional or express. Professional allows a wide range of report functionality and data access to be used. Express is aimed at the non technical user with a simplified interface but does allow dynamic data access.

    Analysis Studio allows large data analysis including OLAP type functionality to enable an anomoly or event to be better understood.

    Event Studio is a notification tool, create agents to report on events in real time.

  • Data Validity

    Data Quality
    So you have created your data warehouse, created your feeds, scheduled data loads and started populating the warehouse. But how do you know that what you have loaded is meaningful ?

    You need to create a set of data quality rules and implement them via some kind of data quality or validation tool. You especially need to concentrate on your reference data or dimensions.

    If slowly changing dimensions, especially the higher order types with history have defects then those problems will propagate out into the rest of the warehouse. These types of problem are costly and time consuming to fix.

    I have included some DQM tools here with links for reference, they are by no means a comprehensive list but just intended to provide a start.

    IBM InfoSphere Quality Stage
    Quality Stage allows rule set up and data quality management for very large data sets.

    Informatica Data Explorer
    Data Explorer finds relationships in the data and discovers anomolies, detect data quality issues using validation rules.

    Business Objects Data Integrator
    Data Integrator is a combined ETL and data quality tool, it can be used to obtain data from remote sources, transform it and look for inconsistencies.

  • OLAP Cubes

    Hyper Cube
    OLAP cubes or hyper cubes are just another way of thinking about the data that we need to manipulate.

    If you think of the data as a hyper cube with a dimension on each axis then it becomes simple to think of operations like slicing or dicing the cube, or rotating to get a better view of a different dimension.

    Hyper Cube
  • Terms

    ADS The atomic data store, an intermediate area where all historical raw data is kept for future reference.

    BI Business INtelligence.

    Data mart A Kimball inspired mini schema within the data warehouse based upon the idea of a star schema and usually related to a specific part of the business or functional area.

    Data Vault A similar idea to the ADS, a large historical data storage vault.

    DQM Data Quality Management, quality checking ( automated ) of the warehouse data.

    ETL Extract, Transform, Load. Describes the three main operations needed to get the data and deliver it to the BI layer.

    Snow Flaking A Kimball inspired method in which a star schema is normalised to produce a snow flake schema pattern.

    Star Schema A Kimball inspired schema approach where dimension tables are clustered around a single fact table.