Lesson 120
workflow:
two pipelines.
1.pipeline1 ---> incremental loading. --> which produces delta
2. pipleline2 ---> scd type 2 implementation . --> which reflects delta into target table along with history.
how to keep these two pipelines in one single pipeline.
-----------------------------------------------
pipeline1---->pipeline2
pipeline1(incremental load)
source : table
sink : datalake (delta.txt)
pipleline2(scdtype2)
source : previous pipeline output: (delta.txt)
sink : table
once pipeline1 executed in datalake delta.txt will be created.
but metadata updation will take generall 3 to 4 seconds time.
meanwhile if pipeline to started execution, pipeline2's input is delta.txt,
but information(metadata) is not available in master node of datalake.
so, pipeline2 treats, delta.txt is not available. so pipeline2 execution will be failed.
solution:
pipeline1----> wait(10 secs) ----> pipeline2 (scd2).
once pipeline1 output is produced,
before executing pipeline2, it waits for 10 secs, so that, metadata of delta.txt is available(ready). so that pipeline2 execution will be continued.
---------------------------------------------------
scd type 2 implementation:
-------------------------
source :
data lake file : delta.txt (output of incremental loading).
delta.txt
--------------------------
id,name,sal,company
101,aaa,80000,ibm
102,bbb,85000,infy
103,ccc,70000,wipro
--------------------------
sink(target) : azure sql table for scd type 2.
create table scd2(skey int identity , id int , name varchar(100), sal int, company varchar(100), status int)
insert into scd2 values
(101, 'aaa', 50000, 'ibm', 1),
(103, 'ccc', 60000, 'ibm' , 1)
------------------------
after scd type 2 job,
expected functionality.
delta records should be inserted with status 1 into target scd2 table.
before insertion of delta,
delta common records status should be updated to 0.
scd2
------------------------------------------------
skey id name sal company status
--------------------------------------------------------
1 101 'aaa' 50000 'ibm' 0
2 103 'ccc' 60000 'ibm' 0
3 101 aaa 80000 ibm 1
4 102 bbb 85000 infy 1
5 103 ccc 70000 wipro 1
--------------------------------------------------------------
-------------------------------
day2
delta.txt
id,name,sal,company
101,aaa,90000,infy
104,ddd,100000,wipro
after scd type 2 job;
scd2
------------------------------------------------
skey id name sal company status
--------------------------------------------------------
1 101 'aaa' 50000 'ibm' 0
2 103 'ccc' 60000 'ibm' 0
3 101 aaa 80000 ibm 0
4 102 bbb 85000 infy 1
5 103 ccc 70000 wipro 1
6 101 aaa 90000 infy 1
7 104 ddd 100000 wipro 1
------------------------------------------------------------
--------------------------------------------------------------
day3:
id,name,sal,company
101,aaa,100000,infy
104,ddd,200000,wipro
after scd type 2 job:
scd2
------------------------------------------------
skey id name sal company status
--------------------------------------------------------
1 101 'aaa' 50000 'ibm' 0
2 103 'ccc' 60000 'ibm' 0
3 101 aaa 80000 ibm 0
4 102 bbb 85000 infy 1
5 103 ccc 70000 wipro 1
6 101 aaa 90000 infy 0
7 104 ddd 100000 wipro 0
8 101 aaa 100000 infy 1
9 104 ddd 200000 wipro 1
-------------------------------------------------------------
------------------------------------------------------------
DataFlow for scd type 2:
---------------------------
1. source:
output stream : delta
data set : ds_delta
projection:
id as integer
sal as integer
2. source:
output stream : scd2
data set : ds_scd2
3. Derived column( to delta)
output stream : derived1
Add columns:
Column expression
------------------------
status 1
-------------------------
4. sink:
output stream : insertAction
data set : ds_scd2
5. exists (to scd2) :
output stream : exists
left stream : scd2
right stream : delta
exists type : 1. exists(select) 2. does not exist
exists condition:
left right
-------------------------------
scd2@id == delta@id
---------------------------
6. derived column( to exists)
output stream : derived2
Add columns
Columns expression
---------------------------
status 0
-------------------------
7. Alter row (to derived2)
output stream : alter1
Action : update
condition : 1==1
8. sink: (to alter1)
output stream : updateAction
data set : ds_scd2
sink settings:
action : update
list of columns: skey
Go to settings of DataFlow:
set sequence of flows to be executied
(note: if you dont set sequence ---> all flows will be executed parallely).
sequence 1 . updateAction
sequence 2 . insertAction.
----------------
create pipeline , and call this dataflow.
---------------------------------------------------------
what will be there in scd2 table.
---> all latest records with status 1 ,
all history records with status 0.
------------------------------------------------------------------------
How to implement status with new versioning feature. to improve retrieval fast in scd type2.
-------------------------------------------------------------------------
expected status and version:
skey id name sal company status
--------------------------------------------------------
1 101 aaa 60000 ibm -3
2 101 aaa 70000 ibm -2
3 101 aaa 80000 ibm -1
4 101 aaa 90000 ibm 0
5 101 aaa 100000 infy 1
--------------------------------------------------------
Advantage of this model:
all latest records are with status 1.
all old records are with status<1
how to retrieve specific information .
ex:
i want recent 2 info of 101
(2-n) = 2-2 = 0
select * from scd2 where id=101 and status=0
ex:
i want recent 4 event of 101
select * from scd2 where id=101 and status = -2
ex:
i want all recent 3 events of 101
select * from scd2 where id=101 and status >= -1
-------------------------------------------------------------------------
how to implement:
dataflow:
------------
1. source:
output stream : delta
data set : ds_delta
projection:
id as integer
sal as integer
2. source:
output stream : scd2
data set : ds_scd2
3. Derived column( to delta)
output stream : derived1
Add columns:
Column expression
------------------------
status 1
-------------------------
4. sink:
output stream : insertAction
data set : ds_scd2
Flow : delta--> derived1 ---> sink(insert)
5. exists (to scd2) :
output stream : exists
left stream : scd2
right stream : delta
exists type : 1. exists(select) 2. does not exist
exists condition:
left right
-------------------------------
scd2@id == delta@id
---------------------------
6. derived column( to exists)
output stream : derived2
Add columns
Columns expression
---------------------------
status status-1
-------------------------
(meaning ---> status = status-1)
7. Alter row (to derived2)
output stream : alter1
Action : update
condition : 1==1
8. sink: (to alter1)
output stream : updateAction
data set : ds_scd2
sink settings:
action : update
list of columns: skey
Go to settings of DataFlow:
set sequence of flows to be executied
(note: if you dont set sequence ---> all flows will be executed parallely).
sequence 1 . updateAction
sequence 2 . insertAction.
----------------
create pipeline , and call this dataflow.
---------------------------------------------------------
Complete and Continue