Lesson 7 : ADF Introduction *** Difference between ADF Version 1 and Version 2 *** DataFlows and Power Query as new features in Version 2 *** DataFlows are for Data Transformations *** Power Query is for Data Preperations and Data Wrangling Activities *** Building Blocks of ADF *** pipeline *** activity *** DataSets *** Linked Service *** Integration Runtime *** Auto Integration RunTime *** Self Hosted Integration Runtime *** SSIS Integration RunTime
ADF session 7 notes:
--------------------
ADF --> Azure data factory.
--> is a cloud data orchestration service (in general, we call it as cloud etl/elt ).
ADF has two versions:
1. version 1 ---> extract and load operations available.
to perform transformation additional azure services are used.
1. hdinsights (hadoop framework in azure cloud)
--> Pig( is a dataflow language-- to perform tranformations)
---> Mapreduce
2. azure functions ( c# coding is used)
3. stored procedures (to process structured data), etc
4. databricks (spark --> scala/python/R/sql)
2. version 2.
---> dataflows ---> are used to perform tranformations( zero coding)
--->powerquery ---> used for data preperations(data wranlging )
with adf 2 ---> extract load and trasform.
extract and load is done adf pipeline core activity.
all tranformations are done with "DataFlows"
Buiding Blocks of ADF:
----------------------
1. pipelines
2. activities
3. datasets
4. linked services
5. integration runtime
1. pipeline:--->
is set of sequence of activities.
(control flow of activities)
2. activity:?
---> is an operation on data/data object
3. dataset ?
dataset is snapshot/alias of phisical data object
---> is pointer to physical data object.
oracle--> db name : salesdb ---> table : sales
azure data lake ---> container : mydata
storage account : mydatalake
task1:
extract data from sales table and dump into azure datalake (mydata)
we need to create two datasets in adf:
data set1 ---> ds_oracle_salesdb_sales
data set2 ---> ds_datalake_gen2_mydatalake_mydata
copy data (activity)
source : ds_oracle_salesdb_sales
sink: ds_datalake_gen2_mydatalake_mydata
task2:
extract data from oracle table to datalake and perform transformations
traformed data should be loaded into synapse table.
How many datasets are required ? ---> 3 datasets.
dataset1 ---> ds_oracle_salesdb_sales
dataset2 --> ds_datalake_gen2_mydatalake_mydata
dataset3 --> ds_synapse_dwhdb_sales
4. Linked service:
---------------------
is a connection string.
if storage service is a database,
Linked service has following information:
server name (oracle)
database : salesdb
user :
password :
if storage service is a datalake:
Linked service has following information:
1. data lake(gen1/gen2/blob service)
2. storage account name
3. authentication
naming standards.
1. oracle --> salesdb --> 100 tables
ls_oracle_salesdb
2. datalake gen2 --> storage account : mystorage101 --> 15 containers
ls_datalakegen2_mystorage101
3. azure sql ---> db : mydb --> 200 tables
ls_azuresql_mydb
task:
in oracle under salesdb , there are two tables ( sales, products)
extract those two tables and load into datalake storage account mystorage101
join the records of sales and products, joined data load into synapse(syndb into sales table). perform aggregations, aggregated data load into oracle salesdb summary table.
how many linked services. ---> 3 linked services.
task2:
oracle ---> salesdb ---> sales table
oracle ----> productsdb ---> products table.
extract sales and products tables, and dump into datalake
3 linked service are required.
ls_oracle_salesdb
ls_oracle_productsdb
ls_datalakegen2_mystorage101
---------------------------------
Integration runtime:
------------------
is prerequisite to Linked Service.
responsibilties:
1. establish connection.
2. data transportation.
3. activity execution.
3 types of Integration runtimes:
1. Auto Integration Runtime.
2. Self Hosted Integration Runtime
3. SSIS Integration Runtime.
1. Auto IR
-------------
old:
if storage service is part of azure
1. azure blob
2. azure data lake gen1
3. azure data lake gen2
4. azure sql
5. azure synapse
6. azure cosmos
new:
if stoage is on azure cloud or other cloud (data is publically exposed) and adf has connector with that service.
1. azure blob -- auto ir
2. azure data lake gen1 -- auto ir
3. azure data lake gen2 -- auto ir
4. azure sql -- auto ir
5. azure synapse -- auto ir
6. azure cosmos -- auto ir
7. snowflake -- auto ir
8. amazon s3 -- auto ir
9. redshift -- auto ir
2. self hosted IR:
--------------------
old: if storage service is not part of azure
new: if storage is on onpremise or if storage is on cloud but adf does not have connector
storage mode ir
---------------------------
oracle onprem self ir
db2 onprem self ir
bluemix cloud self ir
sql server onprem self ir
------------------------------------
in azure,
i created a VM (image is Windows 2000)
on that vm , I have installed "Sql server".
---> self hosted ir
------------------------------------------
3. ssis ir
------------
ssis --> sql server integration service --> ms onpremise etl.
ssis ir --> to run existed ssis packages in azure adf.
--------------
client has 300 packages, --> with out redeveloping adf pipelines we can execute in adf using ssis ir
and he wants 200 new functionalities. --> you have to develop adf pipelines(200)
summary:
ir-->
1. connection establishment with storage service
2. data transportation.
3. activity execution.
3 types of ir:
1. auto ir
2. self hosted ir
3. ssis ir.
---------------------------------------------------