Data Warehouse
Schema
A schema defines the structure of data, including fields or columns, constraints, and the relationships between them. It serves as a blueprint for organizing data, outlining how information is arranged and connected within a system. In the context of databases or data models, a schema establishes the rules and format for storing and accessing data.
Data Type
A data type specifies the nature of data, such as simple types like number, string, date, or float, and complex types like array, struct, or map. It defines the kind of values that can be stored in a field, determining how data is processed and interpreted. Data types are fundamental for ensuring data integrity and enabling appropriate operations on stored information.
Metadata
Metadata provides additional information about data, including descriptions, functional fields, and authorship details. It includes technical information such as when data was created, last modified, and the context in which it exists. Metadata acts as a dictionary or guide, offering crucial details that help users understand and manage the primary data effectively.
Data formats
-
Structured
Highly organized and fits into a predefined model or schema. It typically includes data in tables with rows and columns, such as relational databases. Structured data is easily queryable and suitable for analysis using SQL or other structured query languages.
-
Semi-structured
Contains some structure but doesn't conform to a rigid schema. It might contain tags, metadata, or other markers that provide some level of organization. Examples include JSON, XML, and NoSQL databases. Although it doesn't adhere to a strict schema, it has some level of self-describing structure.
-
Unstructured
Lacks a predefined data model or organization. It is raw and doesn't fit neatly into databases or spreadsheets. Examples include text documents, images, videos, social media posts, emails, and audio files. Analyzing unstructured data often involves techniques like natural language processing or machine learning.
Data Warehouse
Definition
A data warehouse is a centralized repository that stores large volumes of structured and integrated data from multiple sources. It is designed to support business intelligence, reporting, and data analysis by providing a consolidated view of organizational data.
Bottom-Up Approach to Building a Data Warehouse
The Bottom-up approach is a method for constructing a data warehouse that starts with individual, department-specific data marts and integrates them into a comprehensive enterprise-wide data warehouse.
Definition
The Bottom-up approach begins by identifying and developing small, targeted data marts that serve the immediate needs of specific business units or departments. These data marts are built incrementally and can be integrated later into a larger, more comprehensive Enterprise Data Warehouse or EDW.
Instructions
-
Start Small
Begin by analyzing the data requirements of individual business units or departments.
-
Develop Targeted Marts:
Create Data Marts that are specific to a single area of the organization. This allows for quick delivery of insights and "quick wins" for specific teams.
-
Implement Incrementally:
Build and deploy these data marts one at a time, focusing on immediate needs.
-
Integrate Over Time:
As more data marts are created, they can be gradually integrated to form a larger, more holistic data warehouse. This approach allows for faster implementation and is more agile, as changes can be made to individual marts without disrupting the entire system.
Top-Down Approach to Building a Data Warehouse
The Top-down approach is a strategic methodology for building a data warehouse that starts with a comprehensive, enterprise-wide view and then develops smaller, integrated components.
Definition
The Top-down approach begins with a strategic and comprehensive methodology that considers the entire organization's data requirements. It involves building a centralized Enterprise Data Warehouse or EDW first, which then serves as the foundation for developing various data marts for different departments.
Instructions
-
Strategic Planning:
Begin with a strategic plan that considers the data needs of the entire organization.
-
Build the Core:
Design and build a centralized Enterprise Data Warehouse or EDW that provides a holistic view of the company's data.
-
Develop Marts:
Create data marts from the central EDW to serve the specific needs of different departments.
-
Ensure Consistency:
This approach ensures data consistency across the organization by centralizing the data model and integration processes. While it provides a comprehensive view, it can be more rigid and time-consuming to implement.
Build approach FEELT TABELL
Operation Data Store or ODS
Designed to integrate and consolidate data from various operational systems in near real-time. ODS focuses on transactional data and serves as an interim repository for data before it gets loaded into the data warehouse.
Enterprise Data Warehouse or EDW:
A comprehensive and centralized repository that integrates data from various sources across an entire organization. It serves as the single source of truth for business intelligence and decision-making.
Data Mart
A smaller, more department- specific subset of an Enterprise Data Warehouse. It's designed to serve the needs of a particular business unit, department, or user group. Data marts are optimized for specific business functions or departments, allowing for faster query performance and analysis.
Preparation of a data warehouse
-
Data Integration
Process of bringing together data from multiple sources and transforming it into a single, consistent format.
-
Data Modeling
Design the structure of the data warehouse. This includes defining the entities, attributes, and relationships between the entities.
-
Data Loading
Identify and correct errors and inconsistencies in the data. This is an important step in the data warehousing process, as it ensures that the data is accurate and reliable.
Storage and analysis of a data warehouse
-
Data Loading
Populate the data warehouse with data from the source systems. This can be done on a batch or real-time basis.
-
Metadata
Describe the structure of the data warehouse, the data types, and the relationships between the entities.
-
Data Access and Reporting
Data access and reporting tools allow users to query the data warehouse and generate reports. These tools are typically used by business analysts, data scientists, and other decision-makers.
Fact table
A Fact table in a data warehousing data model consists of one or more numeric facts of importance for a business.
Remark
Facts must be consistent with the granularity.
Types of facts
-
Additive
Can be summed up through all of the dimensions in the fact table.
-
Semi-Additive
Can be summed up for some of the dimensions in the fact table, but not the others.
-
Non-Additive
Cannot be summed up for any of the dimensions present in the fact table.
Features of a fact table
The level of detail available in a given fact table as well as to the
level of detail provided by a star schema.
It is usually given as the number of records per key within the table.
In short, the grain of the fact table is the grain of the star schema.
In a Data Warehouse, dimensions represents all points of interest
for business and an entry point for fact tables.
And once in your model and can be reused multiple times with different
fact tables.
Consider a model containing multiple fact tables, representing different
data marts. Now look for a dimension that is common to these facts
tables.
Factless table captures the many-to-many relationship between dimensions
but contains no numeric or text facts.
They are often used to record events or coverage information.
Normalisation
Normalization is a process in database design that organizes data into
separate tables to minimize redundancy and improve data integrity.
The purpose of normalization is to ensures that each piece of data is
stored in only one place, reducing the risk of inconsistencies.
Denormalisation
Denormalization involves combining related data from multiple tables
into a single table to improve query performance.
The purpose of denormalization is to speeds up queries by reducing the
need for joins.
Star schema
Star schema is a data warehouse schema where there is only one “fact
table” and many denormalized dimension tables.
Fact table contains primary keys from all the dimension tables and other
numeric columns columns of additive, numeric facts.
Grain of a star schema
In Data warehousing, grain refers to the level of detail available in a
given fact table as well as to the level of detail provided by a star
schema.
It is usually given as the number of records per key within the table.
In general, the grain of the fact table is the grain of the star schema.
Snowflake schema
Snowflake schema contain normalized dimension tables in a tree like
structure with many nesting levels.
Snowflake schema is easier to maintain but queries require more joins
because of nested levels.
Data Warehouse Processing Architecture
-
The scalability of the solution should be demonstrated by the ability to process a huge volume of data and stream it to different destinations, at high speed, in various formats. The data stream should be processed and presented in the required format, at the right time and location, with the minimum impact to the existing infrastructure.
-
The architecture should be extensible; new functionality can be implemented in an existing service without regression or alteration. Newer technologies, such as artificial intelligence, can be implemented in an existing service by extending the service's APIs.
-
Data security is a critical aspect of the data governance strategy. Data security controls at the source include establishing data access controls and data encryption. Data security controls at the perimeter include data security policies and monitoring access to the data.
-
It should be simple and straightforward, and users should be able to work with the data in an efficient and effective manner.
Data Lake
A data lake is a central repository that stores large volumes of raw and
unprocessed data from various sources. It typically uses a flat
architecture, storing data in its native format, such as CSV, JSON, or
Parquet.
Data lakes offer scalability and flexibility, allowing organizations to
collect and store vast amounts of data for future processing and
analysis.
Data Lake use cases
-
Data exploration and analytics
Data scientists and analysts can explore and analyze diverse datasets within the data lake. This supports ad-hoc queries, data discovery, and the extraction of valuable insights.
-
Big Data processing
Well-suited for processing big data workloads, including large-scale batch processing and parallel data processing. This is essential for applications in industries like finance, healthcare, and e-commerce.
-
Machine Learning and AI
Provide a rich source of data for training machine learning models and implementing artificial intelligence applications. The diverse and voluminous data in the lake can be used to improve model accuracy and robustness.
-
Log and Event data
Store logs, events, and other streaming data. This is crucial for monitoring system performance, tracking user activities, and troubleshooting issues.
-
IoT Storage
Internet of Things or IoT devices generate vast amounts of data. Data lakes are ideal for storing and analyzing this data to gain insights into device behavior, predict maintenance needs, and optimize operations.
Datalakehouse
Datalakehouse, also known as a unified analytics platform, is an
evolution of the datalake concept.
It combines the strengths of a datalake with the capabilities of a data
warehouse.
A datalakehouse provides a unified and structured view of data by
introducing schema enforcement and indexing on top of the raw data
stored in the datalake.
It enables organizations to perform both batch and real-time analytics
on the same platform, offering a more streamlined and efficient data
processing and analysis experience.
Data LakeHouse use cases
-
Analyze data near of real-time
Can accommodate streaming data alongside batch processing. This is beneficial for applications such as fraud detection, IoT analytics, and monitoring.
-
Unified analytics
Allow organizations to perform unified analytics by integrating structured and unstructured data in a single platform. This supports comprehensive business intelligence and analytics, providing a holistic view of the data.
-
Operation Analytics
Can support operational analytics by integrating data from sensors, machines, and production systems. This facilitates real-time monitoring, predictive maintenance, and process optimization.
-
Advanced analytics and ML
Enable organizations to apply advanced analytics and machine learning algorithms to their data. The unified architecture simplifies data preparation and model training by leveraging both structured and unstructured data.
-
360 degree view
can consolidate data from various touchpoints, providing a 360-degree view. This supports personalized marketing, customer service, and experience optimization.
Data Lake features
-
ACID Transactions
Delta Lake provides ACID (Atomicity, Consistency, Isolation, and Durability) transactions, which ensures that data updates are always consistent and complete. This makes it well-suited for critical workloads where data integrity is paramount.
-
Scalable Metadata
Delta Lake's metadata is designed to scale to petabyte-scale tables with billions of partitions and files. It uses a tiered storage approach to efficiently store and manage metadata across multiple levels.
-
Time Travel
Delta Lake enables time travel, allowing users to query and analyze historical versions of data as they evolve over time. This is particularly useful for auditing purposes, identifying data lineage, and performing rollbacks.
-
Unified Batch or Streaming
Delta Lake supports both batch and streaming workloads, providing a unified table format for both types of data processing. This eliminates the need to manage separate tables for batch and streaming data, simplifying data management and processing.
-
Schema Evolution or Enforcement
Delta Lake supports schema evolution, allowing users to add or modify table schema without disrupting existing data. It also provides schema enforcement, ensuring that data always adheres to the defined schema.
-
Audit History
Delta Lake maintains a detailed audit history of all data changes, providing a comprehensive record of data lineage and modifications. This is valuable for auditing purposes and identifying data quality issues.
Data Visualisation
Data visualization is the representation of information and data using
charts, graphs, maps, and other visual tools.
These visualizations allow us to easily understand any patterns, trends,
or outliers in a data set. Data visualization also presents data to the
general public or specific audiences without technical knowledge in an
accessible manner.
-
Storytelling
People are drawn to colors and patterns in clothing, arts and culture, architecture, and more. Data is no different—colors and patterns allow us to visualize the story within the data.
-
Accessibility
Information is shared in an accessible, easy-to-understand manner for a variety of audiences.
-
Visualize relationships
It's easier to spot the relationships and patterns within a data set when the information is presented in a graph or chart.
-
Exploration
More accessible data means more opportunities to explore, collaborate, and inform actionable decisions.
Data transformations
-
Filtering
Selecting a subset of rows or columns based on specified criteria.
-
Sorting
Arranging data in a specific order, often based on one or more columns.
-
Aggregation
Combining multiple rows into a single summary row, usually involves mathematical operations like sum, average, count, etc.
-
Joining
Combining data from multiple sources based on a common key or attribute. • Mapping: Replacing values in a column with corresponding values from a lookup table.
-
Derivation
Creating new columns or calculations based on existing data.
-
Normalization
Ensuring data consistency by organizing it into a standardized format, often involves breaking down data into smaller, more manageable tables.
-
Cleaning
Handling missing or incorrect data, dealing with duplicates, and ensuring data quality.
-
Validation
Checking data integrity and accuracy to ensure it meets predefined standards.
-
Data Type Conversion
Converting data from one type to another, such as changing a string to a date or a number.
-
Splitting
Breaking down a column into multiple columns, usually when dealing with composite data.
-
Concatenation
Combining data from multiple columns into a single column.
-
Duplication Removal
Eliminating duplicate records from the dataset.
-
Pivoting or Unpivoting
Transforming data from a wide format to a long format (or vice versa)
-
Standardization
Ensuring consistency in units, formats, or naming conventions across the dataset.
-
Enrichment
Adding additional information to the data from external sources.
-
Masking or Anonymization
Protecting sensitive information by replacing, encrypting, or anonymizing certain data.
-
Splitting by Business Logic
Dividing data into subsets based on business rules or criteria.
Aggregations
Aggregations in the context of databases and data analysis involve the grouping and summarization of data. Here is a list of common aggregation functions used in SQL and other data analysis tools:
-
Count:
Counts the number of rows in a result set
-
Min:
Finds the minimum value in a column
-
Max:
Finds the maximum value in a column
-
Sum:
Calculates the sum of values in a numeric column
-
Avg:
Computes the average value of a numeric column
-
Count Distinct:
Count without duplicate values from a result set
Slowly Changing Dimension
In the context of data warehousing and database design, Slowly Changing
Dimensions (SCD) refer to the way in which historical data is managed
when there are changes to the dimension attributes over time.
There are several types of Slowly Changing Dimensions, commonly denoted
as SCD Type 1, SCD Type 2, and SCD Type 3.
SCD Type 1
In SCD Type 1, when a change occurs, the existing dimension record is simply updated with the new data. There is no tracking of historical values, so only the current information is retained.
SCD Type 2
In SCD Type 2, a new record is added to the dimension table for each change. The original record is marked as inactive, and the new record contains the updated information. This way, a history of changes is preserved.
SCD Type 3
In SCD Type 3, a limited amount of historical data is maintained. Instead of creating a new record for each change, only certain attributes are updated, and additional columns are added to the table to capture this limited history.
SCD Type 4
SCD Type 4 is a hybrid approach that combines elements of both SCD Type 1 and SCD Type 2. It typically involves creating a separate table to store historical changes while maintaining a current record in the main dimension table.
ELT
ELT is a variation of the Extract, Transform, Load (ETL), a data
integration process in which transformation takes place on an
intermediate server before it is loaded into the target. In contrast,
ELT allows raw data to be loaded directly into the target and
transformed there.
With an ELT approach, a data extraction tool is used to obtain data from
a source or sources, and the extracted data is stored in a staging area
or database. Any required business rules and data integrity checks can
be run on the data in the staging area before it is loaded into the data
warehouse. All data transformations occur in the data warehouse after
the data is loaded.
ELT Process
-
A more recent technology made possible by high-speed, cloud-based servers:
ELT is a relatively new technology made possible because of modern, cloud-based server technologies. Cloud-based data warehouses offer near-endless storage capabilities and scalable processing power. For example, platforms like Amazon Redshift and Google Big Query make ELT pipelines possible because of their incredible processing capabilities.
-
Ingest anything and everything as the data becomes available:
ELT paired with a data lake lets you immediately ingest an ever-expanding pool of raw data as it becomes available. There's no requirement to transform the data into a special format before saving it in the data lake.
ELT Process
-
Transforms only the data you need:
ELT transforms only the data required for a particular analysis. Although it can slow down the process of analyzing the data, it offers more flexibility because you can transform the data in different ways on the fly to produce different types of metrics, forecasts, and reports. Conversely, with ETL, the entire ETL pipeline and the structure of the data in the OLAP warehouse may require modification if the previously decided upon structure doesn't allow for a new type of analysis.
-
ELT has more specific use cases than ETL:
It's important to note that the tools and systems of ELT are still evolving, so they're not as reliable as ETL paired with an OLAP database. Although it takes more effort to set up, ETL provides more accurate insights when dealing with massive pools of data. Also, ELT developers who know how to use ELT technology are generally more difficult to find than ETL developers.