Data Engineering Roadmap

Things to Learn:

Bash Scripting
Airflow - And triggering of events
How to ingest huge data - Multi threading and much more
Why do pipelines fail and how to handle pipeline failures
Checking data type for each row
How to check meta data and its manipulation of triggers in workflow

<aside> <img src="/icons/bullseye_gray.svg" alt="/icons/bullseye_gray.svg" width="40px" /> Milestone 1

Week 1 - 2: Big Data Fundamentals / Data Lake Storage

Week 3:

Understanding how distributed processing works internally

Week 4: Apache Spark Core API’s

Week 5:

Getting started with Data frames and Spark SQL

Week 6:

More of Spark Data frame transformations

Week 7: Apache Spark Caching

Week 8:

Spark Architecture and Aggregate functions

Week 9:

Internals of Spark (calculating initial number of partitions, parallelism, partitioning, bucketing)

Week 10 - 11: Performance Tuning in Spark

Week 12 -13

Week 14: Best Industry Practices for the project

Git & GitHub
CICD

Week 15: Apache Hive

</aside>

<aside> <img src="/icons/bullseye_gray.svg" alt="/icons/bullseye_gray.svg" width="40px" /> Milestone 2

Week 16: Azure Cloud Fundamentals

Week 17:

Azure storage & Databricks fundamentals

Week 18:

Delta Lake

Week 19:

Delta Architecture

Week 20:

Databricks unity catalog

Week 21 - 22:

Azure Data Factory

Week 23:

Azure synapse

Week 24:

Capstone project using Azure Cloud

</aside>

<aside> <img src="/icons/bullseye_gray.svg" alt="/icons/bullseye_gray.svg" width="40px" /> Milestone 3

Week 25:

Data Modeling & System Design

Week 26 - 27:

Spark structured streaming

Week 28:

Databricks Autoloader

Week 29:

Apache Kafka

Week 30 - 32: Big Data on AWS Cloud

</aside>