Lesson 124
SCD 4 : how it works>
scd type 4 is an extension to scd type 2.
In scd type 2, all latest events of records and old events(history) of records available in single table.
In scd type 4, all latest events of records are available in active table,
all old events of records (history) are available in history table.
adv:
when you need only latest, you will contact active table.
when you need history, you will contact history table.
info --> source table.
its delta is info_delta.txt
targets --> 1. info_active 2. info_history
-----------------------
source table : info
create table info(id int, name varchar(100), sal int, company varchar(100))
info
------------------------------
id name sal company
-------------------------------
101 aa 60000 ibm
102 bb 70000 infy
------------------------------------
after incremental load:
delta ---> info_delta.txt
Day1:
info_delta.txt
-------------------------------------
id name sal company
-------------------------------
101 aa 60000 ibm
102 bb 70000 infy
After scd4 job.
info_active
-------------------------------------------
skey id name sal company status
---------------------------------------------------
1 101 aa 60000 ibm 1
2 102 bb 70000 infy 1
--------------------------------------------------
info_history
-------------------------------------------------
skey id name sal company status
-----------------------------------------------------
---- 0 records --------
Day2:
source : info (table)
info
----------------------------------------
id name sal company
---------------------------------------
101 aa 70000 ibm (updated)
102 bb 80000 infy(updated)
103 cc 90000 ibm (inserted)
info_delta.txt
-------------------------------------
id name sal company
-------------------------------
101 aa 70000 ibm (updated)
102 bb 80000 infy(updated)
103 cc 90000 ibm (inserted)
After scd4 job.
info_active
-------------------------------------------
skey id name sal company status
---------------------------------------------------
1 101 aa 70000 ibm 1
2 102 bb 80000 infy 1
3 103 cc 90000 ibm 1
--------------------------------------------------
info_history
-------------------------------------------------
skey id name sal company status
-----------------------------------------------------
1 101 aa 60000 ibm 0
2 102 bb 70000 infy 0
----------------------------------------------------
Day3:
source : info (table)
info
----------------------------------------
id name sal company
---------------------------------------
101 aa 99000 ibm (updated)
102 bb 80000 infy
103 cc 100000 ibm (updated)
104 dd 200000 ibm (inserted)
info_delta.txt
-------------------------------------
id name sal company
-------------------------------
101 aa 99000 ibm (updated)
103 cc 100000 ibm (updated)
104 dd 200000 ibm (inserted)
After scd4 job.
info_active
-------------------------------------------
skey id name sal company status
---------------------------------------------------
1 101 aa 99000 ibm 1
2 102 bb 80000 infy 1
3 103 cc 100000 ibm 1
4 104 dd 200000 ibm 1
--------------------------------------------------
info_history
-------------------------------------------------
skey id name sal company status
-----------------------------------------------------
1 101 aa 60000 ibm -1
2 102 bb 70000 infy 0
3 101 aa 70000 ibm 0
4 103 cc 90000 ibm 0
----------------------------------------------------
how to query once scd4 implemented.
ex 1:
i want latest of 101
select * from info_active where id=101 ;
ex 2:
i want all previous events of 101
select * from info_history where id=101;
ex 3:
i want recent 2nd event of 101
select * from info_history where id=101 and status=0
ex 4:
i want recent all 3 events of 101 (recent 1, recent 2 , recent 3)
select * from info_active where id=101
union all
select * from info_history where id=101 and status >= -1
################################################################################
Above is explanatory notes about " scd type 4 " works.
--> type 4 is extension to type 2.
################################################################################
DataFlow for scd type 4 implementation:
---------------------------------------
two tables required in target database.
Azure sql:
1. active table --> info_active
2. history table --> info_history
create table info_active(id int,
name varchar(100),
sal int,
company varchar(100))
create table info_history(skey int identity,
id int,
name varchar(100),
sal int,
company varchar(100),
status int)
delta file : which is output of incremental load job:
------------------------------------------------------
info_delta.txt
-------------------
id,name,sal,company
....
....
-----------------------
DataFlow for scd4:
in this dataflow 3 subflows are involved.
Flow1: delta-->alterrow2 --> upsertAction (on active table)
Flow2: active --> exists1(with delta)
-->derivedColumn1-->InsertHistory
Flow3: history --> exists2(with active)
--> derivedColumn2 -->
alterrow2 --> UpdateHistory
----------------------------------------------------------
lets check each independent flow:
Flow1:
1. source :
output stream : delta
data set : ds_delta (info_delta.txt)
projection :
id as integer
sal as integer
2. AlterRow:
output stream : alterrow1
AlterRow condition:
action : upsert
condition : 1 == 1
3. sink:
output stream : upsertActive
sink Settings :
action : upsert
list of key columns: id
--------------------------------------------------------
in same dataflow:
Flow2:
4. Source :
output stream : active
data set : ds_active (info_active table)
5. exists :
output stream : exists1
left stream : active
right stream : delta
exists condition : active@id == delta@id
6. derived column:
output stream : derivedColumn1
Add Columns:
column expression
----------------------------
status 0
----------------------------
7. sink:
output stream : InsertHistory
data set : ds_history (info_history table)
-------------------------------------------------------------
in same dataflow ,
Flow3:
8. Source:
output stream : history
data set : ds_history
9. exists :
output stream : exists2
left stream : history
right stream : active
exists codition:
history@id == active@id
10. Derived Column:
output stream : derivedColumn2
Add Columns:
column expression
---------------------------
status status-1
-----------------------------
11. Alter row :
output stream : AlterRow2
alterrow condition:
action : update
condition : 1 == 1
12. Sink :
output stream : UpdateHistory
dataset : ds_history
sink settings:
action : update
list of key columns: skey
---------------------------------
now 3 flows we have.
flow1 --> sink : upsertActive
flow2 --> sink : insertHistory
flow3 --> sink : updateHistory
expected functionality,
at first common records of active and history should updated in history table.(status = status-1).
next, common records from active of active and delta should be inserted into history table with status 0.
then finally , all delta records should be upserted into active table.
all above operations should be done in a sequence , but not parallely.
solution:
go to dataflow settings.
and set order of sinks in a sequence.
DataFlow Settings:
sink name write Order
-------------------------------------
UpdateHistory 1
InsertHistory 2
UpsertActive 3
-----------------------------------------
scd type 0--> captures only insertions (but not updations)
action : insert on target
scd type 1 --> captures insertions and updations
action : upsert on target
problem : no history
scd type 2 ---> catures insertions and updations along with history.
-->it maintains record version for every inserted record.
---> for active records status 1, for history records status less than or equal to zero.
problem:
more space occupancy bcoz of long history
and retrieval slower .
action: Update and insert in sequence (not as combination)
scd type 3 :
it captures insertions and updations along with recent history of specified number of events of given specific attibutes(columns).
each record in target available only 1 time.
Action : upsert
scd type 4:
it captures insertions and updations along with history.
but all active records into seperate table,
all hitorical records into seperate table.
ex: active into info_active
history into info_history.
Action :
3 actions in a sequence.
1. update on history
2. insert into history
3. upsert on active
------------------------------------
Complete and Continue