Data Warehouse

Schema

A schema defines the structure of data, including fields or columns, constraints, and the relationships between them. It serves as a blueprint for organizing data, outlining how information is arranged and connected within a system. In the context of databases or data models, a schema establishes the rules and format for storing and accessing data.

Data Type

A data type specifies the nature of data, such as simple types like number, string, date, or float, and complex types like array, struct, or map. It defines the kind of values that can be stored in a field, determining how data is processed and interpreted. Data types are fundamental for ensuring data integrity and enabling appropriate operations on stored information.

Metadata

Metadata provides additional information about data, including descriptions, functional fields, and authorship details. It includes technical information such as when data was created, last modified, and the context in which it exists. Metadata acts as a dictionary or guide, offering crucial details that help users understand and manage the primary data effectively.

Data formats

Data Warehouse

Definition

A data warehouse is a centralized repository that stores large volumes of structured and integrated data from multiple sources. It is designed to support business intelligence, reporting, and data analysis by providing a consolidated view of organizational data.

Bottom-Up Approach to Building a Data Warehouse

The Bottom-up approach is a method for constructing a data warehouse that starts with individual, department-specific data marts and integrates them into a comprehensive enterprise-wide data warehouse.

Definition

The Bottom-up approach begins by identifying and developing small, targeted data marts that serve the immediate needs of specific business units or departments. These data marts are built incrementally and can be integrated later into a larger, more comprehensive Enterprise Data Warehouse or EDW.

Instructions

  1. Start Small

    Begin by analyzing the data requirements of individual business units or departments.

  2. Develop Targeted Marts:

    Create Data Marts that are specific to a single area of the organization. This allows for quick delivery of insights and "quick wins" for specific teams.

  3. Implement Incrementally:

    Build and deploy these data marts one at a time, focusing on immediate needs.

  4. Integrate Over Time:

    As more data marts are created, they can be gradually integrated to form a larger, more holistic data warehouse. This approach allows for faster implementation and is more agile, as changes can be made to individual marts without disrupting the entire system.

Top-Down Approach to Building a Data Warehouse

The Top-down approach is a strategic methodology for building a data warehouse that starts with a comprehensive, enterprise-wide view and then develops smaller, integrated components.

Definition

The Top-down approach begins with a strategic and comprehensive methodology that considers the entire organization's data requirements. It involves building a centralized Enterprise Data Warehouse or EDW first, which then serves as the foundation for developing various data marts for different departments.

Instructions

  1. Strategic Planning:

    Begin with a strategic plan that considers the data needs of the entire organization.

  2. Build the Core:

    Design and build a centralized Enterprise Data Warehouse or EDW that provides a holistic view of the company's data.

  3. Develop Marts:

    Create data marts from the central EDW to serve the specific needs of different departments.

  4. Ensure Consistency:

    This approach ensures data consistency across the organization by centralizing the data model and integration processes. While it provides a comprehensive view, it can be more rigid and time-consuming to implement.

Build approach FEELT TABELL

Operation Data Store or ODS

Designed to integrate and consolidate data from various operational systems in near real-time. ODS focuses on transactional data and serves as an interim repository for data before it gets loaded into the data warehouse.

Enterprise Data Warehouse or EDW:

A comprehensive and centralized repository that integrates data from various sources across an entire organization. It serves as the single source of truth for business intelligence and decision-making.

Data Mart

A smaller, more department- specific subset of an Enterprise Data Warehouse. It's designed to serve the needs of a particular business unit, department, or user group. Data marts are optimized for specific business functions or departments, allowing for faster query performance and analysis.

Preparation of a data warehouse

Storage and analysis of a data warehouse

Fact table

A Fact table in a data warehousing data model consists of one or more numeric facts of importance for a business.

Remark

Facts must be consistent with the granularity.

Types of facts

Features of a fact table

The level of detail available in a given fact table as well as to the level of detail provided by a star schema.
It is usually given as the number of records per key within the table. In short, the grain of the fact table is the grain of the star schema.
In a Data Warehouse, dimensions represents all points of interest for business and an entry point for fact tables.
And once in your model and can be reused multiple times with different fact tables.
Consider a model containing multiple fact tables, representing different data marts. Now look for a dimension that is common to these facts tables.
Factless table captures the many-to-many relationship between dimensions but contains no numeric or text facts.
They are often used to record events or coverage information.

Normalisation

Normalization is a process in database design that organizes data into separate tables to minimize redundancy and improve data integrity.
The purpose of normalization is to ensures that each piece of data is stored in only one place, reducing the risk of inconsistencies.

Denormalisation

Denormalization involves combining related data from multiple tables into a single table to improve query performance.
The purpose of denormalization is to speeds up queries by reducing the need for joins.

Star schema

Star schema is a data warehouse schema where there is only one “fact table” and many denormalized dimension tables.
Fact table contains primary keys from all the dimension tables and other numeric columns columns of additive, numeric facts.

Grain of a star schema

In Data warehousing, grain refers to the level of detail available in a given fact table as well as to the level of detail provided by a star schema.
It is usually given as the number of records per key within the table. In general, the grain of the fact table is the grain of the star schema.

Snowflake schema

Snowflake schema contain normalized dimension tables in a tree like structure with many nesting levels.
Snowflake schema is easier to maintain but queries require more joins because of nested levels.

Data Warehouse Processing Architecture

Data Lake

A data lake is a central repository that stores large volumes of raw and unprocessed data from various sources. It typically uses a flat architecture, storing data in its native format, such as CSV, JSON, or Parquet.
Data lakes offer scalability and flexibility, allowing organizations to collect and store vast amounts of data for future processing and analysis.

Data Lake use cases

Datalakehouse

Datalakehouse, also known as a unified analytics platform, is an evolution of the datalake concept.
It combines the strengths of a datalake with the capabilities of a data warehouse.
A datalakehouse provides a unified and structured view of data by introducing schema enforcement and indexing on top of the raw data stored in the datalake.
It enables organizations to perform both batch and real-time analytics on the same platform, offering a more streamlined and efficient data processing and analysis experience.

Data LakeHouse use cases

Data Lake features

Data Visualisation

Data visualization is the representation of information and data using charts, graphs, maps, and other visual tools.
These visualizations allow us to easily understand any patterns, trends, or outliers in a data set. Data visualization also presents data to the general public or specific audiences without technical knowledge in an accessible manner.

Data transformations

Aggregations

Aggregations in the context of databases and data analysis involve the grouping and summarization of data. Here is a list of common aggregation functions used in SQL and other data analysis tools:

Slowly Changing Dimension

In the context of data warehousing and database design, Slowly Changing Dimensions (SCD) refer to the way in which historical data is managed when there are changes to the dimension attributes over time.
There are several types of Slowly Changing Dimensions, commonly denoted as SCD Type 1, SCD Type 2, and SCD Type 3.

SCD Type 1

In SCD Type 1, when a change occurs, the existing dimension record is simply updated with the new data. There is no tracking of historical values, so only the current information is retained.

SCD Type 2

In SCD Type 2, a new record is added to the dimension table for each change. The original record is marked as inactive, and the new record contains the updated information. This way, a history of changes is preserved.

SCD Type 3

In SCD Type 3, a limited amount of historical data is maintained. Instead of creating a new record for each change, only certain attributes are updated, and additional columns are added to the table to capture this limited history.

SCD Type 4

SCD Type 4 is a hybrid approach that combines elements of both SCD Type 1 and SCD Type 2. It typically involves creating a separate table to store historical changes while maintaining a current record in the main dimension table.

ELT

ELT is a variation of the Extract, Transform, Load (ETL), a data integration process in which transformation takes place on an intermediate server before it is loaded into the target. In contrast, ELT allows raw data to be loaded directly into the target and transformed there.
With an ELT approach, a data extraction tool is used to obtain data from a source or sources, and the extracted data is stored in a staging area or database. Any required business rules and data integrity checks can be run on the data in the staging area before it is loaded into the data warehouse. All data transformations occur in the data warehouse after the data is loaded.

ELT Process

ELT Process