data modeling
byMadison Schott, Analytics Engineer and Blogger
Slowly changing dimensions are a key aspect of database design that directly affects how an analytics team can operate. They determine what can and cannot be measured over time. I’ve experienced first hand how choosing the wrong slowly changing dimension can impact a business.
At one of my previous companies, we discovered months into our data stack rebuild that historical records weren’t being properly tracked. We had no way to see how products were being changed over time! This ended up creating a huge gap in metrics that we were trying to report on. By reading this article and understanding the difference in the types of slowly changing dimensions, you will be able to avoid a costly mistake like this one.
What are slowly changing dimensions?
Slowly changing dimensions refer to how data in your data warehouse changes over time. Slowly changing dimensions have the same natural key but other data columns that may or may not change over time depending on the type of dimensions that it is.
Slowly changing dimensions are important in data analytics to track how a record is changing over time. The way the database is designed directly reflects whether historical attributes can be tracked or not, determining different metrics available for the business to use.
For example, if data is constantly being overwritten for one natural key, the business will never be able to see how changes in that row’s attributes affect key performance indicators. If a company continually iterates on a product and its different features, but doesn’t track how those features have changed, it will have no idea how customer retention, revenue, customer acquisition cost, or other marketing analytics were directly impacted by those changes.
Types of slowly changing dimensions
Type 0
Type 0 refers to dimensions that never change. You can think of these as mapping tables in your data warehouse that will always remain the same, such as states, zipcodes, and county codes. Date_dim tables that you may use to simplify joins are also considers type 0 dimensions. In addition to mapping tables, other pieces of data like social security number and date of birth are considered type 0 dimensions.
Type 1
Type 1 refers to data that is overwritten by new data without keeping a historical record of that old piece of data. With this type, there is no way to keep track of changes over time. I’ve seen many companies use this type of dimension accidentally, not realizing that they can never get the old values back. When implementing this dimension, make sure you do not need to track the trends in that data column over time.
A good example of this is customer addresses. You don’t need to keep track of how a customer’s address has changed over time, you just need to know you are sending an order to the right place.
Type 2
Type 2 dimensions are always created as a new record. If a detail in the data changes, a new row will be added to the table with a new primary key. However, the natural key would remain the same in order to map a record change to one another. Type 2 dimensions are the most common approach to tracking historical records.
There are a few different ways you can handle type 2 dimensions from an analytics perspective. The first is by adding a flag column to show which record is currently active. This is the approach Fivetran takes with data tables that have CDC implemented. Instead of deleting any historic records, they will add a new one with the _FIVETRAN_DELETED column set to FALSE. The old record will then be set to TRUE for this _FIVETRAN_DELETED column. Now, when querying this data, you can use this column to filter for records that are active while still being able to get historical records if needed.
You can also handle type 2 dimensions by adding a timestamp column or two to show when a new record was created or made active and when it was made ineffective. Instead of checking for whether a record is active or not, you can find the most recent timestamp and assume that is the active data row. You can then piece together the timestamps to get a full picture of how a row has changed over time.
Here's how the data might look in tabular format for the `DimProductType` table:
I’ve seen this type of dimension used often with ever-changing product packs. A company may offer a bundle of products on their website for a discounted price. However, sometimes a certain sku of a product is sold out or unavailable, and they have to adjust what is in that bundle. They can add or remove products by adding a timestamp or flag column denoting which products are in the pack and when. This way, the business can look at how a change in products in the pack have affected sales.
Type 3
Type 3 dimensions track changes in a row by adding a new column. Instead of adding a new row with a new primary key like with type 2 dimensions, the primary key remains the same and an additional column is appended. This is good if you need your primary key ro remain unique and only have one record for each natural key. However, you can really only track one change in a record rather than multiple changes over time. Think of this as a dimension you’d want to use for one-time changes.
For example, let’s say your warehouse location is changing. Because you don’t expect the address of your warehouse to change more than once, you add a `current_address` column with the address of your new warehouse. You then change the original address column name to be `previous_address` and store your old address information.
Type 4
Type 4 dimensions exist as records in two different tables- a current record table and a historical record table. All of the records that are active in a given moment will be in one table and then all of the records considered historical will exist in a separate history table. This is a great way of keeping track of records that have many changes over time.
This is actually the dimension I am most familiar with because of how data was tracked at one of my previous companies. Because order information was constantly changing, we used history tables to track these changing order details. This was particularly helpful for keeping track of what was in a user’s cart at any given moment. The history tables allowed us to see what customers added or removed from their order and then compare it to the actual order that they placed.
How to implement slowly changing dimensions in a data warehouse
It is best to consider slowly changing dimensions from the very beginning of creating your database. However, many of us do not have the luxury of redesigning a database from scratch or being involved in this process from the very beginning, unless you are working for a startup who is just beginning to build out its data infrastructure.
I recommend starting by assessing what currently exists in your database. Which dimensions can you find? Be sure to document the different types and how they relate to one another. If you have dimensions that are type 0 and shouldn’t be, start with those. You want to make sure you are adding historical tracking as soon as possible if it is needed.
Determining where type 2, 3, or 4 is really a matter of preference and what makes sense to your business. Would you rather use a flag column or a timestamp column? How often will these dimensions be changing? Does it make more sense to add a historical table that can be used to store a lot of records? This is a process that should involve data engineers, analytics engineers, and data analysts.
Keep in mind that you will have to consider how you want to handle previous records that haven’t been tracked over time. You may decide to forgo all of this historical data and build with the future in mind, or you may want to try to piece together old records and create your own version of a flag or timestamp column.
We tried doing this at one of my previous companies when we realized we weren’t tracking changes in products over time. It was pretty messy, I won’t lie. It involved parsing through Slack messages, Excel spreadsheets, and asking for time estimations from stakeholders. While the data wasn’t entirely accurate, it was able to give us a rough estimate of certain changes that we could use in analysis. However, we had to take these results with a grain of salt due to the lack of accuracy.
Techniques for maintaining slowly changing dimensions
Maintaining slowly changing dimensions can be simple as long as you keep the right things in mind from the start of any new data collection process. Whenever creating a new table, or ingesting a new data source, think about how historical records will be tracked over time. Does the way the data is being ingested consider historical tracking? What happens if a record is updated? Is it being overwritten or is a new record being created in its place?
Here are some specific things to keep in mind:
ETL (Extract, Transform, Load) process for SCD maintenance
ETL extracts data in bulk in order to move and process changes in data. Because of this, it tends to be slower at capturing changes. ETL is most commonly associated with type 3 and type 4 changing dimensions since it is adding new rows of data rather than comparing to existing rows.
While ETL can still be effective at capturing changes, it often requires more processing power and greater storage due to the excess records being created. However, it can be helpful for creating a greater audit trail through multiple records rather than having one record that is continually updated.
Change Data Capture (CDC) for SCD maintenance
CDC is a set of design patterns for keeping track of historical changes in your data. This process occurs in real-time, ensuring records are updated as they are ingested. CDC is critical for handling ever-changing data and ensures data quality is high.
You always want to make sure you have this enabled on your database if using type 2 changing dimensions within your data warehouse. To use CDC, make sure you enable it on the database level and then at the table level. Only tables using type 2 changing dimensions need this enabled.
Ingestion tools like Fivetran require you to have this enabled on your tables due to its real-time batch processing. Their connectors are optimized using CDC, creating the FIVETRAN_DELETED and timestamp columns as mentioned earlier to create historical records for each data record.
Tracking history with effective dates
If using timestamp columns to track changes, it is important that you use the same column value from your data source to do so. Using different timestamp values could lead to inconsistencies and holes in your data’s history. I always recommend using a timestamp that represents a true creation and deletion time for your records. Ideally, this is a column generated by your database itself.
Keep your data up to date
Slowly changing dimensions are an integral part of a well-designed database. When you consider the pros and cons of each type for your data tables from the very start, you make analytics that much easier. Keeping track of historical records allows the data team to present a full picture of changing metrics, enabling data-driven business decisions. Whether you decide to use ETL, CDC, or date tracking, considering slowly changing dimensions will allow you to create a thorough analytics-forward culture within your organization.
ThoughtSpot exists to help you empower that data-driven culture. Our AI-powered analytics and natural language search experience is designed to help front-line decision-makers find the answers they need in real time—no technical training required. Try it for yourself with our 14-day free trial.