Client: Butter & Bread Data
Duration: 5 months
Tools: Snowflake, Apache Airflow, Azure Data Lake Storage Gen2 (ADLS), REST APIs.
Goal: To extract government data from publicly accessible APIs and transform nested XML structures into a tabular format in Snowflake using Airflow.
Outcomes
Description
The Dutch government provides extensive public datasets on topics like buildings, addresses, land plots, the Consumer Price Index (CPI), company information, and insolvency records. These datasets offer significant value to companies, yet they are often delivered in XML format—a nested structure that doesn’t easily translate to relational tables. Tools like Azure Data Factory (ADF) face limitations in handling APIs and complex XML structures, making data extraction challenging.
To address these needs, I built custom pipelines with Apache Airflow for both Eneco and Butter & Bread Data-hosted products. Airflow, an open-source tool known for its flexibility and scalability, enabled seamless integration with the Snowflake API, streamlining the entire extraction and transformation process. Each pipeline followed a structured process: downloading files, unpacking and iteratively extracting data from nested XML structures, applying quality checks, and loading the data into Snowflake tables. Additionally, the pipelines were designed to handle larger-than-memory datasets, efficiently processing files up to 50GB while keeping memory usage under 4GB.
Over the course of this project, I developed pipelines for more than 12 data sources, creating standardized and reusable components for future data ingestion tasks. This experience gave me valuable insights into XML-based government data structures and strengthened my expertise in designing scalable, efficient data pipelines tailored to specific project requirements.