Lesson 7 : ADF Introduction *** Difference between ADF Version 1 and

Previous Lesson Complete and Continue

Lesson 7 : ADF Introduction * Difference between ADF Version 1 and Version 2 * DataFlows and Power Query as new features in Version 2 * DataFlows are for Data Transformations * Power Query is for Data Preperations and Data Wrangling Activities * Building Blocks of ADF * pipeline * activity * DataSets * Linked Service * Integration Runtime * Auto Integration RunTime * Self Hosted Integration Runtime *** SSIS Integration RunTime

ADF session 7 notes:

--------------------

ADF --> Azure data factory.

--> is a cloud data orchestration service (in general, we call it as cloud etl/elt ).

ADF has two versions:

1. version 1 ---> extract and load operations available.

to perform transformation additional azure services are used.

1. hdinsights (hadoop framework in azure cloud)

--> Pig( is a dataflow language-- to perform tranformations)

---> Mapreduce

2. azure functions ( c# coding is used)

3. stored procedures (to process structured data), etc

4. databricks (spark --> scala/python/R/sql)

2. version 2.

---> dataflows ---> are used to perform tranformations( zero coding)

--->powerquery ---> used for data preperations(data wranlging )

with adf 2 ---> extract load and trasform.

extract and load is done adf pipeline core activity.

all tranformations are done with "DataFlows"

Buiding Blocks of ADF:

----------------------

1. pipelines

2. activities

3. datasets

4. linked services

5. integration runtime

1. pipeline:--->

is set of sequence of activities.

(control flow of activities)

2. activity:?

---> is an operation on data/data object

3. dataset ?

dataset is snapshot/alias of phisical data object

---> is pointer to physical data object.

oracle--> db name : salesdb ---> table : sales

azure data lake ---> container : mydata

storage account : mydatalake

task1:

extract data from sales table and dump into azure datalake (mydata)

we need to create two datasets in adf:

data set1 ---> ds_oracle_salesdb_sales

data set2 ---> ds_datalake_gen2_mydatalake_mydata

copy data (activity)

source : ds_oracle_salesdb_sales

sink: ds_datalake_gen2_mydatalake_mydata

task2:

extract data from oracle table to datalake and perform transformations

traformed data should be loaded into synapse table.

How many datasets are required ? ---> 3 datasets.

dataset1 ---> ds_oracle_salesdb_sales

dataset2 --> ds_datalake_gen2_mydatalake_mydata

dataset3 --> ds_synapse_dwhdb_sales

4. Linked service:

---------------------

is a connection string.

if storage service is a database,

Linked service has following information:

server name (oracle)

database : salesdb

user :

password :

if storage service is a datalake:

Linked service has following information:

1. data lake(gen1/gen2/blob service)

2. storage account name

3. authentication

naming standards.

1. oracle --> salesdb --> 100 tables

ls_oracle_salesdb

2. datalake gen2 --> storage account : mystorage101 --> 15 containers

ls_datalakegen2_mystorage101

3. azure sql ---> db : mydb --> 200 tables

ls_azuresql_mydb

task:

in oracle under salesdb , there are two tables ( sales, products)

extract those two tables and load into datalake storage account mystorage101

join the records of sales and products, joined data load into synapse(syndb into sales table). perform aggregations, aggregated data load into oracle salesdb summary table.

how many linked services. ---> 3 linked services.

task2:

oracle ---> salesdb ---> sales table

oracle ----> productsdb ---> products table.

extract sales and products tables, and dump into datalake

3 linked service are required.

ls_oracle_salesdb

ls_oracle_productsdb

ls_datalakegen2_mystorage101

---------------------------------

Integration runtime:

------------------

is prerequisite to Linked Service.

responsibilties:

1. establish connection.

2. data transportation.

3. activity execution.

3 types of Integration runtimes:

1. Auto Integration Runtime.

2. Self Hosted Integration Runtime

3. SSIS Integration Runtime.

1. Auto IR

-------------

old:

if storage service is part of azure

1. azure blob

2. azure data lake gen1

3. azure data lake gen2

4. azure sql

5. azure synapse

6. azure cosmos

new:

if stoage is on azure cloud or other cloud (data is publically exposed) and adf has connector with that service.

1. azure blob -- auto ir

2. azure data lake gen1 -- auto ir

3. azure data lake gen2 -- auto ir

4. azure sql -- auto ir

5. azure synapse -- auto ir

6. azure cosmos -- auto ir

7. snowflake -- auto ir

8. amazon s3 -- auto ir

9. redshift -- auto ir

2. self hosted IR:

--------------------

old: if storage service is not part of azure

new: if storage is on onpremise or if storage is on cloud but adf does not have connector

storage mode ir

---------------------------

oracle onprem self ir

db2 onprem self ir

bluemix cloud self ir

sql server onprem self ir

------------------------------------

in azure,

i created a VM (image is Windows 2000)

on that vm , I have installed "Sql server".

---> self hosted ir

------------------------------------------

3. ssis ir

------------

ssis --> sql server integration service --> ms onpremise etl.

ssis ir --> to run existed ssis packages in azure adf.

--------------

client has 300 packages, --> with out redeveloping adf pipelines we can execute in adf using ssis ir

and he wants 200 new functionalities. --> you have to develop adf pipelines(200)

summary:

ir-->

1. connection establishment with storage service

2. data transportation.

3. activity execution.

3 types of ir:

1. auto ir

2. self hosted ir

3. ssis ir.

---------------------------------------------------

Complete and Continue