BADM 558 - Big Data Infrastructure
Program-level details: See program/curriculum.md
Status: Draft Initial outline; pending instructor review.
| Credits: 4 | Term: Spring 2027 (Weeks 1-8) | Instructor: Ashish |
Course Vision
Students master cloud-based big data infrastructure and modern data engineering tools. Using GCP services, dbt, and both BigQuery and Snowflake, students build scalable data pipelines, work with cloud storage (GCS), distributed processing (Spark via Dataproc), and data warehouses. By course end, students understand how to architect data systems for large-scale applications using the modern data stack.
Bridge Module: Linux CLI & Cloud Orientation (Pre-Course, ~8 hours)
Complete before Week 1. Available in Canvas as a self-paced module with self-check quizzes. Designed for students with no prior command-line experience.
| Unit | Topics | Format | Self-Check |
|---|---|---|---|
| 1. The Command Line from Scratch (2 hrs) | What is a terminal, navigating directories, listing files, creating/moving/deleting files and folders | Jupyter-based terminal exercises (no local install needed) | Quiz: navigate to a directory, create a file, move it |
| 2. File Permissions & SSH (1.5 hrs) | Read/write/execute permissions, chmod, connecting to remote machines via SSH, key pairs | Guided walkthrough + practice exercises | Quiz: set permissions on a file, SSH into a provided server |
| 3. Shell Scripting Basics (1.5 hrs) | Variables, loops, conditionals in bash, writing a simple automation script | Write-along exercises | Quiz: write a script that processes 3 CSV files |
| 4. GCP Account Setup (2 hrs) | Create GCP account, navigate the console, understand regions/services/free tier, set up IAM, configure gcloud CLI | Step-by-step guided walkthrough with screenshots | Checkpoint: successfully run gsutil ls from your terminal |
| 5. Cloud Cost Awareness (1 hr) | Free tier limits, setting billing alerts, estimating costs, shutting down resources to avoid charges | Walkthrough + cost calculator exercise | Quiz: estimate monthly cost for a given GCS + Compute Engine scenario |
Readiness check: Students who pass all 5 self-check quizzes (70% threshold) are cleared for Week 1. Students who don’t pass receive targeted resources and can retake.
Note: This replaces the previous 2-3 hour Linux CLI module. Career pivoters from non-technical backgrounds need more scaffolding to be confident with the command line and cloud console before diving into GCP, Spark, and dbt.
Learning Outcomes (L-C-E Framework)
Literacy (Foundational Awareness)
- L1: Explain cloud computing benefits (scalability, cost, flexibility) and describe major cloud providers
- L2: Understand when big data infrastructure is needed vs. traditional databases
- L3: Recognize GCP service categories (compute, storage, database, analytics) and modern data stack components (dbt, BigQuery, Snowflake)
Competency (Applied Skills)
- C1: Build data pipelines using GCP (GCS, Compute Engine, Cloud Functions)
- C2: Use Apache Spark for distributed processing of large datasets
- C3: Design and query data warehouses using both BigQuery and Snowflake
- C4: Build and test dbt models with documentation and lineage tracking
- C5: Implement basic data security and IAM practices
Expertise (Advanced Application)
- E1: Architect an end-to-end big data solution (ingest -> store -> process -> analyze)
- E2: Optimize data pipelines for cost and performance
- E3: Implement monitoring, logging, and error handling for production systems
Week-by-Week Breakdown
| Week | Topic | Lectures | Project Work | Studio Session | Assessment |
|---|---|---|---|---|---|
| 1 | Cloud fundamentals + GCP overview | 3 videos | GCP account setup, GCS bucket creation | GCP fundamentals - regions, services, free tier | Assignment 1 |
| 2 | GCS + data lakes + Compute Engine essentials | 2 videos | Data ingestion to GCS, launch instances | GCS + Compute Engine workshop - buckets, permissions, instance types, security groups | Assignment 2 + M1 |
| 3 | dbt fundamentals | 2 videos | Build dbt models for team dataset | dbt workshop - models, tests, documentation, lineage graphs | Assignment 3 + M2 |
| 4 | Spark fundamentals + RDD/DataFrames | 3 videos | Dataproc cluster setup, first job | Spark basics - distributed processing, lazy evaluation | Assignment 4 |
| 5 | Spark SQL + data processing | 3 videos | Process team dataset at scale | Spark SQL - DataFrame operations at scale | Assignment 5 + M3 |
| 6 | Data warehousing: BigQuery + Snowflake | 2 videos | Load warehouse, platform comparison | BigQuery vs. Snowflake - columnar storage, SQL queries, platform comparison | Assignment 6 |
| 7 | Data pipelines + orchestration | 2 videos | Wire up end-to-end pipeline | Cloud Functions + Cloud Composer - serverless automation | Assignment 7 + M4 |
| 8 | Security, monitoring, synthesis | 1 video | Final integration + reflection | GCP security + monitoring - Cloud Logging, alerts, cost | Assignment 8 + Final deliverable + Oral defense |
Team Project: End-to-End Data Pipeline (Team of 3)
One major team project runs across all 8 weeks. Teams progressively build a complete big data pipeline — from cloud setup and data ingestion through Spark processing, warehousing, and orchestration — culminating in a live demo and oral defense.
Pipeline Components:
- Ingestion: Data sources ingested into GCS (public datasets, APIs, logs)
- Transformation: dbt models with tests, documentation, and lineage
- Processing: Spark job (via Dataproc) cleaning + transforming data at scale
- Storage: Results in data lake (GCS) and warehouse (BigQuery/Snowflake)
- Orchestration: Scheduled pipeline via Cloud Functions + Cloud Composer
- Reporting: Dashboard or SQL queries showing results
Final Deliverables:
- GCS bucket with proper structure (raw, processed, output folders)
- dbt project with staging/mart models, schema tests, documentation, lineage graph
- PySpark application (via Dataproc) processing 1GB+ dataset with performance analysis
- Output loaded to both BigQuery and Snowflake (platform comparison write-up)
- Cloud Functions for scheduled ingestion
- Cloud Composer DAG orchestrating the full pipeline
- Cloud Monitoring + alerts
- Cost estimation (monthly running cost)
- Architecture diagram + operational runbook
- GitHub repo with all code + infrastructure templates
- Peer evaluation of team contributions
Rubric (5 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| Pipeline Architecture | Well-designed, scalable, modular with clear data flow | Functional, some inefficiency | Basic or problematic design |
| Data Engineering | Clean dbt models with tests + efficient Spark transformations | Functional models and code | Minimal tests, inefficient code |
| Platform Comparison | Thoughtful Redshift vs. Snowflake analysis with performance data | Basic comparison | Minimal or missing |
| Automation & Monitoring | Fully automated via Cloud Composer, comprehensive Cloud Monitoring + alerts | Mostly automated, basic monitoring | Manual steps, minimal visibility |
| Production Readiness | Error handling, rollback plan, cost-optimized, tested | Mostly robust | Lacks error handling |
Weekly Assignments (Individual)
Hands-on labs and deliverables that build skills toward the team project. Each assignment is due at the end of its week.
| Week | Assignment | Focus |
|---|---|---|
| 1 | GCP account setup + GCS bucket creation | Cloud fundamentals, IAM, bucket policies |
| 2 | Data ingestion demo — upload 2+ sources to GCS | GCS organization, data formats, access control |
| 3 | dbt model review — build and test staging/mart models | dbt fundamentals, schema tests, documentation |
| 4 | Dataproc cluster setup + first Spark job submission | PySpark environment, RDD/DataFrame basics |
| 5 | Spark data processing — clean and aggregate a large dataset | Spark SQL, DataFrame operations, optimization |
| 6 | Warehouse query lab — load data to BigQuery and Snowflake | Columnar storage, SQL queries, platform comparison |
| 7 | Pipeline orchestration — wire up Cloud Functions + Cloud Composer | Serverless automation, scheduling, error handling |
| 8 | Code review + reflection | Peer code review, cost analysis, lessons learned |
Project Milestones (Team)
Progressive checkpoints ensuring teams are on track for the final pipeline.
| Milestone | Due | Deliverable |
|---|---|---|
| M1: Cloud Infrastructure | End of Week 2 | GCS bucket with ingested data, IAM roles, cost estimate |
| M2: Transformation Layer | End of Week 3 | Working dbt project with tests and lineage graph |
| M3: Processing Layer | End of Week 5 | PySpark application (Dataproc) processing team dataset, performance benchmarks |
| M4: Integration & Orchestration | End of Week 7 | End-to-end pipeline running on schedule with Cloud Monitoring |
Final Project Deliverable (Week 8, Team)
Complete end-to-end data pipeline with all components integrated, architecture diagram, operational runbook, and GitHub repo. Graded on the full rubric above.
Oral Defense (Week 8, Team)
Each team presents their pipeline architecture and gives a live demo. All team members must answer questions individually. Evaluates technical depth, design rationale, and ability to explain decisions without AI assistance.
AI Tools Integration
Weeks 1-3 (Cloud Setup + dbt):
- Use Claude/ChatGPT to:
- Explain GCP services + when to use each
- Debug IAM permission issues
- Generate dbt model templates and tests
- Review bucket policies
Weeks 4-6 (Spark Processing + Warehousing):
- Use AI to:
- Suggest Spark DataFrame transformations
- Debug PySpark errors
- Compare BigQuery vs. Snowflake features
- Generate cost calculations
Weeks 7-8 (Pipeline):
- Use AI to:
- Generate Cloud Functions code
- Debug Cloud Composer DAG workflows
- Suggest monitoring strategies
- Review error handling patterns
Studio Session Topics:
- Week 1: Cloud fundamentals + GCP services overview
- Week 2: GCS + Compute Engine essentials workshop
- Week 3: dbt fundamentals — models, tests, documentation
- Week 4: Spark architecture + RDD/DataFrame concepts (Dataproc)
- Week 5: Spark SQL optimization + cost implications
- Week 6: Data warehouse comparison — BigQuery vs. Snowflake
- Week 7: Orchestration + scheduling (Cloud Functions, Cloud Composer, Airflow)
- Week 8: Team presentations + operational best practices
Assessment Summary
| Component | Weight | Notes |
|---|---|---|
| Weekly assignments | 30% | 8 individual labs/deliverables |
| Project milestones | 25% | 4 team checkpoints (M1-M4) |
| Final project deliverable | 15% | Week 8, team |
| Oral defense | 25% | Week 8, team (individual Q&A) |
| Studio participation | 5% | Weekly attendance + engagement |
No traditional exam. One major team project with weekly individual assignments building toward it.
AI Usage Levels (AIAS)
| Assessment | AIAS Level | AI Permitted |
|---|---|---|
| Weekly assignments (Weeks 1-3) | 2 | AI for AWS setup, dbt template generation, IAM debugging — with attribution |
| Weekly assignments (Weeks 4-6) | 2 | AI for PySpark debugging, SQL optimization — with attribution |
| Weekly assignments (Weeks 7-8) | 3 | AI as collaborator for Lambda code, monitoring strategy — with full disclosure |
| Project milestones | 3 | AI as collaborator for pipeline design and implementation — with full disclosure |
| Final project deliverable | 3 | AI as collaborator — with full disclosure of all AI-assisted components |
| Oral defense | 0 | No AI |
| Studio participation | 1 | AI for exploration during exercises |
Technology Stack
- Cloud: GCP (GCS, Compute Engine, Cloud Functions, Cloud SQL, BigQuery, Cloud Composer)
- Data Warehousing: BigQuery + Snowflake (students learn both platforms)
- Data Transformation: dbt (models, tests, documentation)
- Processing: Apache Spark (PySpark) via Cloud Dataproc, Python
- Languages: Python, SQL
- Infrastructure: Terraform (optional)
- Monitoring: Cloud Monitoring, Cloud Logging
- IDE: VS Code with GitHub Copilot; Google Colab (browser alternative for non-Spark work)
- Notebooks: Jupyter with Spark kernel (via VS Code or JupyterHub)
- Version Control: GitHub
Prerequisites
- Completion of BADM 554 (SQL) + FIN 550 (Python) (or equivalent)
- Comfortable with Linux command line (pre-course module required)
| Course Sequence: ← FIN 550 — Big Data Analytics in Finance | Next: Agentic AI for Analytics → |