In this post, I have explained what are Slowly Changing Dimensions (SCD) and how you can implement it in your projects in the efficient manner.
What is Slowly Changing Dimension ?
A Dimension table stores attributes and descriptive details about the measures calculated in the Fact tables. With time it is expected that some changes will happen to these attributes stored in the Dimension tables. Hence the word "Changing".
However those changes may not be very frequent. Thereby adding the word "Slowly".
The Dimension table which is expecting some changes over time is called Slowly Changing Dimension.
Slowly Changing Dimension Types
There are 3 standard type of Slowly Changing Dimension tables.
- SCD-1: It overwrite the existing data with current information. So no history is maintained. One row is available at any time for the individual entities.
- SCD-2: It enters new row when ever a new information arrives for existing entity. It maintains all the history of changes in the dimension table. This results in table size growing rapidly.
- SCD-3: It updates new value for selective columns in the existing rows. It maintains current and previous (or initial) value for some identified columns. Partial history is maintained for the dimension table.
Now there are some hybrid SCD types as well. Like SCD-4 which is a combination of SCD-1 and SCD-2. In this you maintain 2 Dimension tables. The first dimension table holds the actual records corresponding to SCD-1 type. An additional table is created typically called as history dimension table which works as SCD-2 dimension table.
SCD-0 type are those tables for which we do not expect any changes to come. In some cases , we even ignore any changes coming to SCD-0 table.Common example is Date Dimension.
Slowly Changing Dimension – SQLs
We will consider employee table and we will see what happens when some changes are expected in the data that already resides in the table. We will also see how new records are handled.
Structure of Employee table:
create table dim_employee_stg (emp_id varchar(20) NOT NULL, -- emp_id can be alphanumeric else we can create it as numeric alsoemp_name varchar(100) NOT NULL,emp_dob date NOT NULL,emp_dept_id integer,emp_city varchar(50),emp_joining_date date NOT NULL,emp_email varchar(100) NOT NULL,emp_contact varchar(10) -- 10 digit mobile number. Do not create it as Integer as it may be more than permissible range for INT.);
We will refer to above mentioned table as staging table i.e. _stg table
SCD – Table Design consideration:
When building SCD table , we should consider adding following 3 technical columns in addition to existing business columns. This should be added to almost all the dimension tables in most of the cases.
- Surrogate Key – It is good practice to have a unique column in the dimension table. The unique column is generally sourced to FACT tables to establish primary – foreign key relationship. This is generally the auto-increment/identity column in the table and is selected as PRIMARY Key as well.
- Name: emp_sk
- Record Insert Timestamp – This technical column helps in determining when actually the data is loaded into the table. It is typically loaded with default value of current timestamp.
- Name: record_insert_ts
- Record Update Timestamp – This technical column helps in determining when actually the data is updated in the table. It is typically loaded with default value of current timestamp when we run any UPDATE statement on the table data.
- Name: record_update_ts
- Also there are 3 ways to manage it when record is loaded for the first time.
- NULL : Set Update timestamp as NULL when you are loading record for the first time. Since no UPDATE has happened the value is set to NULL. This may require additional check to handle NULL values while using in Filter condition. Also in table design we may want to keep this column as NOT NULL so this option may not work for you. Example: Default value NULL
- Same as INSERT timestamp : We can initialise this column value as same as INSERT timestamp. With this approach we can easily check if INSERT & UPDATE timestamp both are same it means no UPDATE has come for this record.. Example: Default value CURRENT_TIMESTAMP
- Set it to FUTURE timestamp: We can initialise this column value as a very distant future timestamp. With this approach we can easily check if UPDATE timestamp is a static future date it means no UPDATE has come for this record. This is preferred method and I have seen and used this in most of my projects. Example: Default value 2099-12-31 00:00:00
- Batch ID: Additionally, you may want to add one more technical column to support data lineage. This is batch id coming from your scheduler.
- Name: batch_id
- Other technical columns could be : Source Info , ETL user info. You may add more columns if required as per the project.
However above mentioned technical columns should be sufficient to cover most the project scenarios.
Now let us see table design considerations specific to SCD Type tables. Following updates are expected in the table data:
- emp_dept_id: Employee may change the department so this value is expected to change.
- emp_city: Employee may change the office location when joining the new department.
- emp_contact: Employee may change the contact number when moving from one city to another.
SCD – Type 1 Table Design consideration:
We will add emp_sk , record_insert_ts and record_update_ts for SCD-1 table. This should be sufficient as per our requirement. So the SCD-1 table definition should be:
create table dim_employee_scd1(emp_sk bigint generated always as identity NOT NULL PRIMARY KEY, -- added SK keyemp_id varchar(20) NOT NULL, -- can add UNIQUE constraint to this columnemp_name varchar(100) NOT NULL,emp_dob date NOT NULL,emp_dept_id integer,emp_city varchar(50),emp_joining_date date NOT NULL,emp_email varchar(100) NOT NULL,emp_contact varchar(10),record_insert_ts timestamp default current_timestamp,record_update_ts timestamp default '2099-12-31 00:00:00');
There will be two queries executed for incoming data. One for the UPDATE of existing records and other for the INSERT of new records into the table.
SQL to load data into SCD-1 table.
updatedim_employee_scd1 tgtsetemp_dept_id = stg.emp_dept_id,emp_city = stg.emp_city,emp_contact = stg.emp_contact,record_update_ts = current_timestampfromdim_employee_stg stgwheretgt.emp_id = stg.emp_id -- to run only for matching existing records.;INSERT into dim_employee_scd1 (emp_id,emp_name,emp_dob,emp_dept_id,emp_city,emp_joining_date,emp_email,emp_contact)select stg.emp_id, stg.emp_name, stg.emp_dob, stg.emp_dept_id, stg.emp_city, stg.emp_joining_date, stg.emp_email, stg.emp_contactfrom dim_employee_stg stgleft outer joindim_employee_scd1 tgtonstg.emp_id = tgt.emp_idwheretgt.emp_id is null; -- to get only the new records that does not exists in the Target table already.
SCD – Type 2 Table Design consideration:
We will add emp_sk , record_insert_ts , record_update_ts for SCD-2 table. This should be sufficient as per our requirement. So the SCD-2 table definition should be similar to SCD-1 table. However the loading logic will change.
create table dim_employee_scd2(emp_sk bigint generated always as identity NOT NULL PRIMARY KEY, -- added SK keyemp_id varchar(20) NOT NULL, -- can add UNIQUE constraint to this columnemp_name varchar(100) NOT NULL,emp_dob date NOT NULL,emp_dept_id integer,emp_city varchar(50),emp_joining_date date NOT NULL,emp_email varchar(100) NOT NULL,emp_contact varchar(10),record_insert_ts timestamp default current_timestamp,record_update_ts timestamp default '2099-12-31 00:00:00');
There will be two queries executed for incoming data. One for the UPDATE of existing records and other for the INSERT of new records into the table.
In SCD-2 , the record with record_update_ts as '2099-12-31 00:00:00' is considered as active record because no update has come for this one.
It is still applicable as-is. Once the update come for any record we “close” the record by updating the value for this column by current timestamp. In some cases depending on business requirement, you may want to close the record from yesterday so use previous day while running update statement.
SQL to load data into SCD-2 table.
updatedim_employee_scd2 tgtsetrecord_update_ts = current_timestamp -- in case of daily job you may want to set it to yesterday date as well.fromdim_employee_stg stgwheretgt.emp_id = stg.emp_id -- to run only for matching existing records.and record_update_ts = '2099-12-31 00:00:00' -- to close only the active record and not update all the existing history of rows.;INSERT into dim_employee_scd2 (emp_id,emp_name,emp_dob,emp_dept_id,emp_city,emp_joining_date,emp_email,emp_contact)select stg.emp_id, stg.emp_name, stg.emp_dob, stg.emp_dept_id, stg.emp_city, stg.emp_joining_date, stg.emp_email, stg.emp_contactfrom dim_employee_stg stg ; -- no need to check for DELTA records. Load everything coming from Source.
The latest record will have record_update_ts as '2099-12-31 00:00:00'. You should always use this filter while reading from SCD2 table for active records.
SCD – Type 3 Table Design consideration:
We will add emp_sk , record_insert_ts , record_update_ts for SCD-3 table. Additionally we will add previous_value column for columns for which we are expecting change.
So we will add 3 more columns:
- prev_emp_dept_id
- prev_emp_city
- prev_emp_contact
The previous value will be same as current value for new records.
This should be sufficient as per our requirement. So the table definition for SCD-3 will look like.
create table dim_employee_scd3(emp_sk bigint generated always as identity NOT NULL PRIMARY KEY, -- added SK keyemp_id varchar(20) NOT NULL, -- can add UNIQUE constraint to this columnemp_name varchar(100) NOT NULL,emp_dob date NOT NULL,emp_dept_id integer,prev_emp_dept_id integer,emp_city varchar(50),prev_emp_city varchar(50),emp_joining_date date NOT NULL,emp_email varchar(100) NOT NULL,emp_contact varchar(10),prev_emp_contact varchar(10),record_insert_ts timestamp default current_timestamp,record_update_ts timestamp default '2099-12-31 00:00:00');
There will be two queries executed for incoming data. One for the UPDATE of existing records and other for the INSERT of new records into the table.
In SCD-3 , we will run update for the existing records coming from source. The previous value is updated by current value. The current value is updated by new value coming from Source.
SQL to load data into SCD-3 table.
updatedim_employee_scd3 tgtsetprev_emp_dept_id = tgt.emp_dept_id, -- you can add case statament to check if value for this column has changed. or else just overwrite everytime with the same value.emp_dept_id = stg.emp_dept_id,prev_emp_city = tgt.emp_city,emp_city = stg.emp_city,prev_emp_contact = tgt.emp_contact,emp_contact = stg.emp_contact,record_update_ts = current_timestamp -- in case of daily job you may want to set it to yesterday date as well.fromdim_employee_stg stgwheretgt.emp_id = stg.emp_id -- to run only for matching existing records;INSERT into dim_employee_scd3 (emp_id,emp_name,emp_dob,emp_dept_id,prev_emp_dept_id,emp_city,prev_emp_city,emp_joining_date,emp_email,emp_contact,prev_emp_contact)select stg.emp_id, stg.emp_name, stg.emp_dob, stg.emp_dept_id, stg.emp_dept_id, stg.emp_city, stg.emp_city, stg.emp_joining_date, stg.emp_email, stg.emp_contact, stg.emp_contactfrom dim_employee_stg stg left outer joindim_employee_scd3 tgtonstg.emp_id = tgt.emp_idwheretgt.emp_id is null; -- to get only the new records that does not exists in the Target table already. Also load previous value with current value.
The previous value will have same value as current value for new records.
Testing the Queries for sample data
Day 1 Data:
insert into dim_employee_stg values ('P101','Mark','1985-02-09',20,'Amsterdam','2022-05-14','mark@xyzcompany.com','9876543210');insert into dim_employee_stg values ('P102','Jerry','1989-05-13',20,'Amsterdam','2022-03-24','jerry@xyzcompany.com','9988776655');insert into dim_employee_stg values ('P103','Sophia','1982-08-18',20,'Amsterdam','2022-02-10','sophia@xyzcompany.com','9999022022');insert into dim_employee_stg values ('P104','Bernard','1983-11-22',20,'Amsterdam','2022-05-14','bernard@xyzcompany.com','9898987654');insert into dim_employee_stg values ('P105','Amanda','1988-12-26',20,'Amsterdam','2022-05-14','amanda@xyzcompany.com','9999988888');
Staging Table:
SCD-1: It will be insert only for Day 1. Since no records exists in the table to update.
SCD-2: It will be insert only for Day 1. Since no records exists in the table to update.
SCD-3: It will be insert only for Day 1. Since no records exists in the table to update. Also the previous value will equal to current value for identified columns.
Day 2 Data
Staging Table
--day2:delete from dim_employee_stg;insert into dim_employee_stg values ('P101','Mark','1985-02-09',25,'Paris','2022-05-14','mark@xyzcompany.com','9876543210');insert into dim_employee_stg values ('P103','Sophia','1982-08-18',28,'Berlin','2022-02-10','sophia@xyzcompany.com','9999022022');insert into dim_employee_stg values ('P106','Marie','1993-05-26',20,'Amsterdam','2022-06-24','marie@xyzcompany.com','9988776543');
SCD-1 : The old value for 2 employee record is updated with new value. The value city & emp_dept_id is updated.
SCD-2 : The old row is closed by updating record_update_ts. New row is added for existing and new records. You can also call update_ts as end_ts depending on business requirement.
SCD-3 : The previous value is now changed and you can see the current & previous value is not same for 2 rows for which update has come.
Day 3 Data
Staging Table
--day3:delete from dim_employee_stg;insert into dim_employee_stg values ('P101','Mark','1985-02-09',30,'London','2022-05-14','mark@xyzcompany.com','9999888877');insert into dim_employee_stg values ('P107','Jamie','1994-08-12',20,'Amsterdam','2022-06-24','jamie@xyzcompany.com','9988776522');
SCD-1 : The old value for 1 employee record is updated with new value. The value city & emp_dept_id is updated.
SCD-2 : The old row is closed by updating record_update_ts. New row is added for existing and new records.
SCD-3 : The previous value is now changed and you can see the current & previous value is not same for 1 row for which update has come.
So In this post, we saw definition of SCD and how do we implement SCD 1, SCD 2, SCD 3 using SQL.
Note: I have executed these queries in Postgresql 14. You may have to change some query syntax depending on the RDBMS you are using.