Things to Learn:
- Bash Scripting
- Airflow - And triggering of events
- How to ingest huge data - Multi threading and much more
- Why do pipelines fail and how to handle pipeline failures
- Checking data type for each row
- How to check meta data and its manipulation of triggers in workflow
<aside>
<img src="/icons/bullseye_gray.svg" alt="/icons/bullseye_gray.svg" width="40px" /> Milestone 1
Week 1 - 2: Big Data Fundamentals / Data Lake Storage
Week 3:
- Understanding how distributed processing works internally
Week 4: Apache Spark Core API’s
Week 5:
- Getting started with Data frames and Spark SQL
Week 6:
- More of Spark Data frame transformations
Week 7: Apache Spark Caching
Week 8:
- Spark Architecture and Aggregate functions
Week 9:
- Internals of Spark (calculating initial number of partitions, parallelism, partitioning, bucketing)
Week 10 - 11: Performance Tuning in Spark
Week 12 -13
Week 14: Best Industry Practices for the project
Week 15: Apache Hive
</aside>
<aside>
<img src="/icons/bullseye_gray.svg" alt="/icons/bullseye_gray.svg" width="40px" /> Milestone 2
Week 16: Azure Cloud Fundamentals
Week 17:
- Azure storage & Databricks fundamentals
Week 18:
Week 19:
Week 20:
Week 21 - 22:
Week 23:
Week 24:
- Capstone project using Azure Cloud
</aside>
<aside>
<img src="/icons/bullseye_gray.svg" alt="/icons/bullseye_gray.svg" width="40px" /> Milestone 3
Week 25:
- Data Modeling & System Design
Week 26 - 27:
- Spark structured streaming
Week 28:
Week 29:
Week 30 - 32: Big Data on AWS Cloud
</aside>