DATA WAREHOUSING FUNDAMENTALS
Data warehousing is a Techno functional module. It bridges
gap between Technical and Functional activities
Any database has two types of systems:
1) OLTP
(Online Transaction Processing) system
2) OLAP
(Online Analytical Processing) system
DWH is a database containing the collection of current and
historical data in a single platform drawn from different OLTP systems so that
this data can be used for effective reporting and analysis
OLTP
*
Mainly used for Transaction purpose
*
INSERT, UPDATE and DELETE operations are
implemented
*
Data is volatile
*
Normalized data
*
Contains current data (~ 1 yr)
OLAP
*
Mainly used for Reporting and Analysis
*
Only SELECT operation is used for reporting and
analysis
*
Non-volatile data
*
De-normalized data
*
Contains historical data (~ 5 to 25 yrs)
Traditional
Definitions of DWH
• DWH
is a copy of the transactional data specially structured for reporting and
analysis
• DWH
is a Subject oriented, Integrated, Non-volatile and Time-variant data in order
to support the effective decision making of the enterprise
OLTP supports two types of data
1) Master
Data is the data that is created only once, but changed occasionally. e.g.,
Emp#. An emp joins in a company only once, but his salary, designation,… may
change occasionally
2) Transactional
Data is generated due to the business activities. e.g., Price, Quantity.
Purchasing a product is a transaction
Granularity in DWH
Granularity specifies the level of detail with in DWH to be
represented. It is specified by 3 major factors:
1) Existing
level of detail in the OLTP systems
2) Requirements
of the Business users
3) Storage
capacity of DWH
Ageing/Backup Process
As the time passes by; Current data becomes Historical
and Historical becomes Most Historical
So it is not a good idea to keep the most historical data in
the online storage which is less frequently used
Better is to move the data to offline in order to
accommodate incoming data from the OLTP systems
Such a process is called “Ageing Process”
Architecture of a DWH
Data warehousing architecture includes the following
components:
1) Source
Systems
2) Staging
Layer
3) DWH
4) Data
marts
5) Reports
6) Business
users
Objectives of DWH
1) Technical
Objective of constructing a DWH is to separate the transactions from
reporting & analysis
** If transactions and reporting are performed on same
system a Deadlock might occur, which may lead to ineffective reporting **
2) The business users should be self capable of creating
their queries to generate the reports
The technical details of the database can be hidden from the
business users
Once the reports are produced, the business users can
analyze the data multi-dimensionally
Phases of DWH implementations
1) Modeling
the DWH
2) Migrating
the data from OLTP systems to DWH with the help of ETL Process
3) Utilizing
the DWH for producing 2-dimensional and multi-dimensional reports
Multi-Dimensional Modeling
*
A model acts as a blue print for the entire DWH
*
Data modeling is the process of converting the
business requirements into the technical specifications of the database
*
Multi-Dimensional modeling is the principle that
is used to model data marts and DWH
*
Multi-Dimensional modeling is based on the
concept called Star-Schema
STAR SCHEMA
Star Schema consists of two types of tables
1) Dimensional
Tables
2) Fact
Tables
*
Dimensional tables contain the Master data from
OLTP systems, also called as Dimensions
*
Fact tables contain the Transactional data from
OLTP systems, also called as Measures
*
Fact table in a Star Schema contains two
sections:
*
i) Key Section
*
ii) Facts/Measures Section
All the Measures that are generated due to business
activities in the OLTP systems should be recorded in the Fact section of the
Fact table
A Star Schema
contains multiple Dimensional tables and a single Fact table
** If a schema
contains multiple Fact tables, it is called as “Galaxy” or “Congestion” **
*
Among the tables of the Star schema, Dimensional
tables are independent and Fact tables are dependent in nature
*
To have the relationship between dimensional and
fact tables, the primary keys of dimensional tables should be
migrated/transferred as foreign keys to the keys section of the fact table
*
The Keys section of the fact table is a Composite Primary key which is a
collection of different foreign keys from different dimensional tables.
Types of Star
Schema’s
OLAP supports two types of Star schema’s:
1) Star
Flake Star Schema
2) Snow
Flake Star Schema
STAR FLAKE
*
Dimensional table’s are De-Normalized
*
Fact table’s are Normalized
*
Effective for reporting
SNOW FLAKE
*
Dimensional table’s are Normalized
*
Fact table’s are Normalized
*
|
Effective for Transactions
Initial and Delta Extracts
*
Data extracted from OLTP systems into OLAP
systems for the first time is treated as Initial Extract
*
Delta Extract specifies the changes
*
Changes are in the form of Inserts, Updates and
Deletes
*
In the case of Delta Extract, only the records
that has changed after the previous extract should be extracted into OLAP
*
If the DWH maintains only the current data,
Inserts of the OLTP systems are treated as Inserts and Updates are treated as
Updates -- This is treated as Type I
change
*
If OLAP systems maintain both current and
historical data, Inserts are treated as Inserts and Updates are treated as
Inserts – This is treated as Type II change
*
The Primary Keys of the OLTP systems should not
be used as the Primary keys in DWH. It generates its own keys called as Warehouse
Keys or Surrogate Keys for the purpose of internally identifying the
records
*
In the above example EmpNo.’s 102 & 103
becomes duplicate. Hence WHEmpNo is kept as PK
Types of Dimension Tables
Master Data loaded into the Dimension tables is changed
occasionally. These Dimension tables are treated as Slowly Changing Dimension
tables
1) SCD Type 1 -- Maintains Current Data
I – I
U – U
|
2) SCD Type 2 -- Maintains Current Data & Full History
i) Flag Current Data
I – I
U – U
U – I
*
In Flag type, the previous version of the Flag
should be changed from New to Old
*
The demerits of this type is, we cannot know the
exact older version record
*
From above scenario, we cannot say which is the
older among 3002 & 3008. In this situations we go for Version type
*
ii) Version Number
*
I
– I
*
U
– I
*
In this type all the updates are treated as
Inserts
*
This type can identify the old version records
*
iii) Effective Date Range
I
– I
U
– U
U
– I
*
By this type, we can track changes w.r.t Date
3) SCD Type 3 – Maintains Current Data & one time
History
I – I
U – U
Types of Fact tables
1) Transaction
Grain Fact table
2) Periodic
Snapshot Fact table
3) Accumulating
Snapshot Fact table
4) Fact
less Fact table
1) Transaction Grain Fact table
*
The
Grain specifies the level of details, transactions can be recorded
*
The
Transaction Grain Fact table in OLAP system maintains the same level of details
in OLTP system
*
For this
type of tables, the data should be extracted from OLTP to OLAP on daily basis
*
Since
the data is present at the lowest possible level of detail; Analysis can be
done to the lowest possible extent
*
Operational
Data Source (ODS) is the best example of this type
2) Periodic Snapshot Fact table
*
The data
is loaded into the OLAP tables periodically
*
The user
can have the correct estimate of the number of records that can be loaded
*
When
ever the data is loaded, the data for that particular period is summarized and
loaded as a single record
*
These
type of tables regenerate the current data
*
The
records are only inserted into these tables; but not updated
*
Telephone
bill database is the best example for this type of tables
3) Accumulating Snapshot Fact
table
*
All the
values of the records of the Accumulating snapshot fact tables are not known in
advance. We need to wait for the time to pass in order to make the events to
occur to realize these particular values
*
These
contain date columns
*
Data has
to be loaded daily
*
Data in
these tables represent the current data
*
Shipping
Products is an example of Accumulating Fact table
4) Fact less Fact table
*
Some
times regular star schema cannot provide the possible analysis. Such cases need
accompanying star schema where the fact table will have only the key section
but not measures; such type of tables are called Fact less Fact tables
(Coverage tables)
*
Log-in
and Log-out times of an employee doesn’t require any measures. This is an
example of Fact less Fact table
Types of OLAP
1) ROLAP
2) MOLAP
3) HOLAP
4) DOLAP
*
Reports can be created from 2-dimensional or
multi-dimensional structures
*
2-D reports are created from tables
*
Multidimensional reports are created from cubes
*
ROLAP (Relational OLAP)
*
The dimension tables and the fact tables in
Relational OLAP exists as relational tables
*
Since all the data will be present in the
relational database, the current data as well as historical data can
participate in reporting process
*
Drilling the data can be done to the detailed
level
*
Reporting is slow with ROLAP
*
2) MOLAP (Multi-dimensional OLAP)
*
In MOLAP, the data is physically stored in
multi-dimensional databases called Cubes
*
Since the data is directly existing in
multidimensional database, reporting is extremely fast
*
Data in multidimensional data marts is first
migrated from OLTP systems to relational DWH and from relational DWH to
multidimensional data marts
*
Current data will not participate in the
reporting process with MOLAP
*
Drilling the data cannot be done to the detailed
level
*
3) HOLAP (Hybrid OLAP)
*
HOLAP is the combination of ROLAP and MOLAP
*
With HOLAP drilling the data can be done to the
detailed level with fast reporting
*
4) DOLAP (Desktop OLAP)
*
When ever the executives present the information
at the time of seminars; critical information is stored in the desktop
databases like MS-Access, Excel sheets and XML files. Such type of structures
are called DLOP structures
Types of Dimensions
Dimension
A dimension table typically has two types of
columns, primary keys to fact tables and textual\descriptive data.
Eg: Time, Customer
Types of Dimensions
- Slowly Changing
Dimensions
- Rapidly Changing
Dimensions
- Junk Dimensions
- Inferred
Dimensions
- Conformed
Dimensions
- Degenerate
Dimensions
- Role Playing
Dimensions
- Shrunken
Dimensions
- Static
Dimensions
Slowly Changing Dimensions
Attributes of a dimension that would undergo changes over time. It
depends on the business requirement whether particular attribute history of
changes should be preserved in the data warehouse. This is called a slowly
changing attribute and a dimension containing such an attribute is called a
slowly changing dimension.
Rapidly Changing Dimensions
A dimension attribute that changes frequently is a rapidly
changing attribute. If you don’t need to track the changes, the rapidly
changing attribute is no problem, but if you do need to track the changes,
using a standard slowly changing dimension technique can result in a huge
inflation of the size of the dimension. One solution is to move the attribute
to its own dimension, with a separate foreign key in the fact table. This new
dimension is called a rapidly changing dimension.
Junk Dimensions
A junk dimension is a single table with a combination of different
and unrelated attributes to avoid having a large number of foreign keys in the
fact table. Junk dimensions are often created to manage the foreign keys
created by rapidly changing dimensions.
Inferred Dimensions
While loading fact records, a dimension record may not yet be
ready. One solution is to generate a surrogate key with null for all the other
attributes. This should technically be called an inferred member, but is often
called an inferred dimension.
Conformed Dimensions
A dimension that is used in multiple locations is called a
conformed dimension. A conformed dimension may be used with multiple fact
tables in a single database, or across multiple data marts or data warehouses.
Degenerate Dimensions
A degenerate dimension is when the dimension attribute is stored
as part of fact table, and not in a separate dimension table. These are
essentially dimension keys for which there are no other attributes. In a data
warehouse, these are often used as the result of a drill through query to
analyze the source of an aggregated number in a report. You can use these
values to trace back to transactions in the OLTP system.
Role Playing Dimensions
A role-playing dimension is one where the same dimension key —
along with its associated attributes — can be joined to more than one foreign
key in the fact table. For example, a fact table may include foreign keys for
both ship date and delivery date. But the same date dimension attributes apply
to each foreign key, so you can join the same dimension table to both foreign
keys. Here the date dimension is taking multiple roles to map ship date as well
as delivery date, and hence the name of role playing dimension.
Shrunken Dimensions
A shrunken dimension is a subset of another dimension. For
example, the orders fact table may include a foreign key for product, but the
target fact table may include a foreign key only for productcategory, which is
in the product table, but much less granular. Creating a smaller dimension
table, with productcategory as its primary key, is one way of dealing with this
situation of heterogeneous grain. If the product dimension is snowflaked, there
is probably already a separate table for productcategory, which can serve as
the shrunken dimension.
Static Dimensions
Static
dimensions are not extracted from the original data source, but are created
within the context of the data warehouse. A static dimension can be loaded
manually — for example with status codes — or it can be generated by a
procedure, such as a date or time dimension.
No comments:
Post a Comment