Handling SCD Type 2 with Hive 3.x and HDFS: A Comprehensive Guide

By using the power of open source running in your data center

Nov 01, 2023

Slowly Changing Dimensions (SCD) have always posed unique challenges in data warehousing. The modern capabilities of Hive 3.x have made handling SCD Type 2 more efficient and scalable, especially when coupled with the robust storage mechanism of HDFS. In this post, we'll dive deep into how you can leverage the power of Hive 3.x to implement SCD Type 2.

white and orange open LED signage turned on — Photo by Finn Hackshaw on Unsplash

Introduction to SCD Type 2:

SCD Type 2 allows us to keep a full history of data changes. When an attribute changes, instead of updating the record, a new record is added with a new effective date, and the previous record's end date is populated.

Prerequisites:

Apache Hive 3.x with ACID v2 enabled.
HDFS for data storage.
Familiarity with SQL and basic Hive concepts.

Step-by-step Implementation:

1. Setting Up Your Hive Environment:

Before diving in, ensure ACID operations are activated:

SET hive.support.concurrency = true;

SET hive.enforce.bucketing = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

SET hive.compactor.initiator.on = true;

SET hive.compactor.worker.threads = 1;

2. Table Creation:

Start by setting up a partitioned and bucketed employees table, optimized for transactional operations.

CREATE TABLE employees (

employee_id INT,

employee_name STRING,

department STRING,

salary FLOAT,

effective_date DATE,

end_date DATE,

is_current BOOLEAN

)

PARTITIONED BY (year INT)

CLUSTERED BY (employee_id) INTO 8 BUCKETS

STORED AS ORC

TBLPROPERTIES ('transactional'='true');

3. Handling Incoming Data:

Staging new data is a crucial step. Create an incoming_data table to stage this data before merging.

CREATE TABLE incoming_data (

employee_id INT,

employee_name STRING,

department STRING,

salary FLOAT,

effective_date DATE

)

STORED AS ORC;

Load the new data:

LOAD DATA INPATH '/path/to/hdfs/new_data' INTO TABLE incoming_data;

4. Merging Data Using ACID Operations: SCD Type 2 Implementation:

With Hive 3.x, you can use the MERGE statement, which is optimized for such operations.

MERGE INTO employees AS target

USING incoming_data AS source

ON target.employee_id = source.employee_id AND target.year = year(source.effective_date)

WHEN MATCHED AND target.is_current = TRUE THEN

UPDATE SET

target.end_date = source.effective_date,

target.is_current = FALSE

WHEN NOT MATCHED THEN

INSERT VALUES (

source.employee_id,

source.employee_name,

source.department,

source.salary,

source.effective_date,

NULL,

TRUE,

year(source.effective_date)

);

5. Ongoing Operations:

Repeat steps 3 and 4 whenever you have new incoming data.

Notes:

The MERGE statement first checks for matching records. For matches, it updates the current records end_date and sets is_current to FALSE. For new records, it inserts them as current records with a NULL end_date.
Proper use of bucketing can help improve the performance of the MERGE statement.
Keep your ORC files optimized. Over time, as more updates and deletes are performed, you may have many small ORC files or files with many "deleted" rows. Periodic compaction (both minor and major) can help maintain query performance.
As always, test thoroughly on a subset of your data before implementing on the full dataset, especially when working with production data.

Using Hive 3.x with its ACID v2 features is a more efficient way to handle SCD Type 2 on large datasets compared to previous versions of Hive.

Conclusion:

Hive 3.x brings a fresh perspective to handling SCD Type 2, making the process more efficient and straightforward. By leveraging the improved ACID operations and the robustness of HDFS, data engineers can now seamlessly maintain historical data changes without compromising on performance.

Let me know your comments or feedback. Thanks for reading!

TransformWithData

Discussion about this post