Azure Data Factory (ADF) now has built-in functionality that supports ingesting data from xls and xlsx files. These files could be located in different places, including as Amazon S3, Amazon S3 Compatible Storage, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP/SFTP, Google Cloud Storage, HDFS, HTTP and Oracle Cloud Storage. Prior to ADF supporting such functionality, data engineers needed to apply workarounds, such as using PowerShell scripts or Azure Functions to convert the excel file into CSV.
In this post, I will develop an ADF pipeline to load an excel file from Azure Data Lake Gen 2 into an Azure SQL Database.
Step 1 – About the source file:
I have an excel workbook titled ‘2018-2020.xlsx’ sitting in Azure Data Lake Gen2 under the “excel dataset” folder. In this workbook, there are two sheets, “Data” and “Note”. The “Data” sheet contains exchange rates per date for different currencies, while the “Note” sheet has the full list of currencies with their codes and names.
Step 2 – Target database:
I provisioned an Azure SQL database called One51Training. This database will host the Exchange Rate data. As you can see, there are no tables created yet. I will configure the ADF pipeline to create one table per sheet. The table structure will reflect both the header and columns within each sheet.
Step 3 Building the data pipeline:
Now is the time to build and configure the ADF pipeline. I’ll be using the Copy Activity for the data transfer. To do this, the following ADF components are needed:
Linked Services: Contains the source connection details and credentials.
Datasets: Represents a named logical view of the source data.
Pipeline: It is the logical workflow of data transfer activities.
The following procedure outlines the required configuration:
In addition, I created a parameter to hold the sheet’s name. As I mentioned earlier, the excel file has two sheets, the first one has the rates, and the second one has the currency names and codes.
One of the advantages of using parameters is reusability, and I will leverage that in this case, as ADF will iterate through all sheets available in the Excel file.
[
{“SheetName”:”Data”},
{“SheetName”:”Note”}
]
Use the Pipeline parameter as Foreach Items setting
Edit the “ForEach” loop, add a Copy activity followed by configuring both Source and Sink:
For the Sink configuration, I am using the Auto-Create table option and adding a Drop table statement as a Pre-Copy script:
Add dynamic content:
@{Concat(‘DROP TABLE IF EXISTS ‘,item().SheetName)}
One51 helps Organisations with their data ingestion and data integration needs by using an approach and the technology that make this type of solutions easy to implement and a more cost-effective alternative to the traditional custom ETL development.
Contact us to know more about our Services.