Lesson 120




workflow:

two pipelines.

1.pipeline1 ---> incremental loading. --> which produces delta


2. pipleline2 ---> scd type 2 implementation . --> which reflects delta into target table along with history. 


 how to keep these two pipelines in one single pipeline. 


-----------------------------------------------


 pipeline1---->pipeline2


 pipeline1(incremental load)

source : table

sink : datalake (delta.txt)

 pipleline2(scdtype2)

source : previous pipeline output: (delta.txt)

sink : table


 once pipeline1 executed in datalake delta.txt will be created. 

  but metadata updation will take generall 3 to 4 seconds time. 

  meanwhile if pipeline to started execution, pipeline2's input is delta.txt, 

 but information(metadata)  is not available  in master node of datalake. 

  so, pipeline2 treats, delta.txt is not available. so pipeline2 execution will be failed. 


solution:


  pipeline1----> wait(10 secs) ----> pipeline2 (scd2). 


   once pipeline1 output is produced, 

before executing pipeline2,  it waits for 10 secs, so that, metadata of delta.txt is available(ready).  so that pipeline2 execution will be continued. 


---------------------------------------------------


scd type 2 implementation:

 -------------------------


source :

   data lake file : delta.txt (output of incremental loading). 



delta.txt

--------------------------

id,name,sal,company

101,aaa,80000,ibm

102,bbb,85000,infy

103,ccc,70000,wipro

--------------------------


sink(target) : azure sql table for scd type 2. 


   create table scd2(skey int identity , id int , name varchar(100), sal int, company varchar(100), status int)


insert into scd2 values

(101, 'aaa', 50000, 'ibm', 1),

(103, 'ccc', 60000, 'ibm' , 1)


------------------------





after scd type 2 job, 

  expected functionality. 

 delta records should be inserted with status 1 into target scd2 table. 

 before insertion of delta, 

  

  delta common records status should be updated to 0. 


scd2

------------------------------------------------

skey id name sal company status

--------------------------------------------------------

1 101 'aaa' 50000 'ibm' 0

2 103 'ccc' 60000 'ibm' 0

3 101 aaa 80000 ibm 1

4 102 bbb 85000 infy 1

5 103 ccc 70000 wipro 1

--------------------------------------------------------------



-------------------------------


day2


delta.txt

id,name,sal,company

101,aaa,90000,infy

104,ddd,100000,wipro


after scd type 2 job;


scd2

------------------------------------------------

skey id name sal company status

--------------------------------------------------------

1 101 'aaa' 50000 'ibm' 0

2 103 'ccc' 60000 'ibm' 0

3 101 aaa 80000 ibm 0

4 102 bbb 85000 infy 1

5 103 ccc 70000 wipro 1

6 101 aaa 90000 infy 1

7 104 ddd 100000 wipro 1

------------------------------------------------------------

--------------------------------------------------------------



day3:


id,name,sal,company

101,aaa,100000,infy

104,ddd,200000,wipro


after scd type 2 job:


scd2

------------------------------------------------

skey id name sal company status

--------------------------------------------------------

1 101 'aaa' 50000 'ibm' 0

2 103 'ccc' 60000 'ibm' 0

3 101 aaa 80000 ibm 0

4 102 bbb 85000 infy 1

5 103 ccc 70000 wipro 1

6 101 aaa 90000 infy 0

7 104 ddd 100000 wipro 0

8 101 aaa 100000 infy 1

9 104 ddd 200000 wipro 1

-------------------------------------------------------------


------------------------------------------------------------







DataFlow for scd type 2:

---------------------------


1. source:

output stream : delta

data set : ds_delta

projection:

id as integer

sal as integer

2. source:

output stream : scd2

data set : ds_scd2


3. Derived column( to delta)

output stream : derived1

Add columns:

Column expression

------------------------

status 1

-------------------------

4. sink:

output stream : insertAction

data set : ds_scd2

5. exists (to scd2) :

output stream : exists

left stream : scd2

right stream : delta

exists type : 1. exists(select) 2. does not exist

exists condition:

left right

-------------------------------

scd2@id == delta@id

---------------------------

6. derived column( to exists)

output stream : derived2

Add columns

Columns expression

---------------------------

status 0

-------------------------

7. Alter row (to derived2)

output stream : alter1

Action : update

condition : 1==1

8. sink: (to alter1)

output stream : updateAction 

data set : ds_scd2

sink settings:

action : update

list of columns: skey

Go to settings of DataFlow:

  set sequence of flows to be executied

  (note: if you dont set sequence ---> all flows will be executed parallely). 

   sequence 1 .  updateAction

   sequence 2 .  insertAction. 

----------------

 create pipeline , and call this dataflow. 

 ---------------------------------------------------------

what will be there in scd2 table.

---> all latest records with status 1 , 

all history records with status 0. 

 ------------------------------------------------------------------------


How to implement status with new versioning feature. to improve retrieval fast in scd type2. 

-------------------------------------------------------------------------



  expected status and version:


skey id name sal company status

--------------------------------------------------------

1 101 aaa   60000 ibm -3

2 101 aaa   70000 ibm -2

3 101 aaa 80000 ibm -1

4 101 aaa 90000 ibm 0

5 101 aaa 100000 infy 1

--------------------------------------------------------


Advantage of this model:

   all latest records are with status 1. 

all old records are with status<1


how to retrieve specific information . 


  ex:

   i want recent 2 info of 101

(2-n) = 2-2 = 0


  select * from scd2 where id=101 and status=0


  ex: 

i want recent 4 event of 101


  select * from scd2 where id=101 and status = -2


  ex:

i want all recent 3 events of 101


  select * from scd2 where id=101 and status >= -1   

-------------------------------------------------------------------------


how to implement:


dataflow:

------------

1. source:

output stream : delta

data set : ds_delta

projection:

id as integer

sal as integer

2. source:

output stream : scd2

data set : ds_scd2


3. Derived column( to delta)

output stream : derived1

Add columns:

Column expression

------------------------

status 1

-------------------------

4. sink:

output stream : insertAction

data set : ds_scd2


 Flow : delta--> derived1 ---> sink(insert)

5. exists (to scd2) :

output stream : exists

left stream : scd2

right stream : delta

exists type : 1. exists(select) 2. does not exist

exists condition:

left right

-------------------------------

scd2@id == delta@id

---------------------------

6. derived column( to exists)

output stream : derived2

Add columns

Columns expression

---------------------------

status status-1

-------------------------

(meaning ---> status = status-1)

7. Alter row (to derived2)

output stream : alter1

Action : update

condition : 1==1

8. sink: (to alter1)

output stream : updateAction 

data set : ds_scd2

sink settings:

action : update

list of columns: skey

Go to settings of DataFlow:

  set sequence of flows to be executied

  (note: if you dont set sequence ---> all flows will be executed parallely). 

   sequence 1 .  updateAction

   sequence 2 .  insertAction. 

----------------

 create pipeline , and call this dataflow. 

 ---------------------------------------------------------



Complete and Continue