Lesson 124


SCD 4 : how it works>


 scd type 4 is an extension to scd type 2. 


In scd type 2, all latest events of records and old events(history) of records available in single table. 


In scd type 4, all latest events of records are available in active table,

all old events of records (history) are available in history table. 


adv:

when you need only latest, you will contact active table.

    when you need history, you will contact history table. 



info --> source table. 

its delta is  info_delta.txt


targets -->  1. info_active  2. info_history


-----------------------


 source table :  info

 

 create table info(id int, name varchar(100), sal int, company varchar(100))

info

------------------------------

id name sal company

-------------------------------

101 aa 60000 ibm

102 bb 70000 infy

------------------------------------


 after incremental load:

   delta --->  info_delta.txt


Day1:  

info_delta.txt

-------------------------------------

id name sal company

-------------------------------

101 aa 60000 ibm

102 bb 70000 infy


After scd4 job. 


info_active

-------------------------------------------

skey id name sal company status

---------------------------------------------------

1 101 aa 60000 ibm 1

2 102 bb 70000 infy 1

--------------------------------------------------


     info_history

-------------------------------------------------

skey id name sal company status

-----------------------------------------------------

  ---- 0 records --------


Day2:


source : info (table)


info

----------------------------------------

id name sal company

---------------------------------------

101 aa 70000 ibm (updated)

102 bb 80000 infy(updated)

103   cc 90000 ibm (inserted)


info_delta.txt

-------------------------------------

id name sal company

-------------------------------

101 aa 70000 ibm (updated)

102 bb 80000 infy(updated)

103   cc 90000 ibm (inserted)



After scd4 job. 


info_active

-------------------------------------------

skey id name sal company status

---------------------------------------------------

1 101 aa 70000 ibm 1

2 102 bb 80000 infy 1

3 103   cc 90000 ibm 1


--------------------------------------------------


     info_history

-------------------------------------------------

skey id name sal company status

-----------------------------------------------------

1 101 aa 60000 ibm 0

2 102 bb 70000 infy  0

----------------------------------------------------



Day3:


source : info (table)


info

----------------------------------------

id name sal company

---------------------------------------

101 aa 99000 ibm (updated)

102 bb 80000 infy

103   cc 100000 ibm (updated)

104 dd 200000 ibm (inserted)


info_delta.txt

-------------------------------------

id name sal company

-------------------------------

101 aa 99000 ibm (updated)

103   cc 100000 ibm (updated)

104 dd 200000 ibm (inserted)


After scd4 job. 


info_active

-------------------------------------------

skey id name sal company status

---------------------------------------------------

1 101 aa 99000 ibm 1

2 102 bb 80000 infy 1

3 103   cc 100000 ibm 1

4 104 dd 200000 ibm 1

--------------------------------------------------


     info_history

-------------------------------------------------

skey id name sal company status

-----------------------------------------------------

1 101 aa 60000 ibm -1

2 102 bb 70000 infy  0

3 101 aa 70000 ibm 0

4 103   cc 90000 ibm 0

----------------------------------------------------


how to query once scd4 implemented. 


 ex 1:

i want latest of 101

select * from info_active where id=101 ;


 ex 2:

i want all previous events of 101


   select * from info_history where id=101;


  ex 3:

i want recent 2nd event of 101

select * from info_history where id=101 and status=0


  ex 4:

i want recent all 3 events of 101 (recent 1, recent 2 , recent 3)


select * from info_active where id=101

union all

select * from info_history where id=101 and status >= -1


################################################################################

Above is explanatory notes about " scd type 4 " works. 

 --> type 4 is extension to type 2. 

################################################################################


DataFlow for scd type 4 implementation:

---------------------------------------

two tables required in target database. 


Azure sql:

 1. active table --> info_active

 2. history table --> info_history


create table info_active(id int, 

name varchar(100),

sal int, 

company varchar(100))



create table info_history(skey int identity,

id int, 

name varchar(100),

sal int, 

company varchar(100),

status int)


delta file : which is output of incremental load job:

------------------------------------------------------

info_delta.txt

-------------------

id,name,sal,company

....

....

-----------------------


DataFlow for scd4:

 in this dataflow 3 subflows are involved. 


 Flow1: delta-->alterrow2 --> upsertAction (on active table)

 Flow2: active --> exists1(with delta)

-->derivedColumn1-->InsertHistory

 Flow3: history --> exists2(with active)

--> derivedColumn2 -->

alterrow2 --> UpdateHistory

 ----------------------------------------------------------

 lets check each independent flow:


 Flow1:

 1. source :

output stream : delta

data set : ds_delta (info_delta.txt)

projection :

id as integer

sal as integer

 2. AlterRow:

output stream : alterrow1

AlterRow condition:

action : upsert

condition : 1 == 1

 3. sink:

output stream : upsertActive

sink Settings :

action : upsert

list of key columns: id

 --------------------------------------------------------

 in same dataflow:

 Flow2:


 4. Source :

output stream : active

data set : ds_active (info_active table)

 5. exists :

output stream : exists1

left stream : active

right stream : delta

exists condition : active@id == delta@id

 6. derived column:

output stream : derivedColumn1

Add Columns:

column expression

----------------------------

status 0

----------------------------

 7. sink:

output stream : InsertHistory

data set : ds_history  (info_history table)

 -------------------------------------------------------------

 in same dataflow ,

 Flow3:


 8. Source:

output stream : history

data set : ds_history 

 9. exists :

output stream : exists2

left stream : history

right stream : active

exists codition:

history@id == active@id


 10. Derived Column:

output stream : derivedColumn2

Add Columns:

column expression

---------------------------

status status-1

-----------------------------

 11. Alter row :

output stream : AlterRow2

alterrow condition:

action : update 

condition : 1 == 1

 12. Sink :

output stream : UpdateHistory

dataset : ds_history

sink settings:

action : update

list of key columns: skey


---------------------------------

  now 3 flows we have. 

 flow1 --> sink : upsertActive

 flow2 --> sink : insertHistory

 flow3 --> sink : updateHistory


 expected functionality, 

  at first common records of active and history should updated in history table.(status = status-1).

  next, common records from active of active and delta should be inserted into history table with status 0. 

  then finally , all delta records should be upserted into active table. 


  all above operations should be done in a sequence , but not parallely. 


 solution:

go to dataflow settings.

and set order of sinks in a sequence. 

DataFlow Settings:

sink name write Order

-------------------------------------

UpdateHistory 1

InsertHistory 2

UpsertActive 3

-----------------------------------------






 



 




































scd type 0--> captures only insertions (but not updations)

  action : insert on target

scd type 1 --> captures insertions and updations 

action : upsert on target

problem : no history

scd type 2 ---> catures insertions and updations along with history. 

-->it maintains record version for every inserted record.

---> for active records status 1, for history records status less than or equal to zero. 

problem:

more space occupancy bcoz of long history

and retrieval slower . 

    action:  Update and insert in sequence (not as combination)

scd type 3 :

it captures insertions and updations along with recent history of specified number of events of given specific attibutes(columns). 

    each record in target available only 1 time. 

   Action : upsert 

scd type 4:

it captures insertions and updations along with history.

but all active records into seperate table,

all hitorical records into seperate table. 


ex: active into  info_active

 history into info_history.


Action :

3 actions in a sequence.

1. update on history

2. insert into history

3. upsert on active

------------------------------------


Complete and Continue