Last updated: April 09, 2026

BADM 558 - Big Data Infrastructures

Program-level details: See program/curriculum.md

Status: Draft Initial outline; pending instructor review.

Credits: 4

Term: Spring 2027 (Weeks 1-8)

Instructor: Ashish

Course Vision

Students master cloud-based big data infrastructure and modern data engineering tools. Using GCP services, dbt, and both BigQuery and Snowflake, students build scalable data pipelines, work with cloud storage (GCS), distributed processing (Spark via Dataproc), and data warehouses. By course end, students understand how to architect data systems for large-scale applications using the modern data stack.

Bridge Module: Linux CLI & Cloud Orientation (Pre-Course, ~8 hours)

Complete before Week 1. Available in Canvas as a self-paced module with self-check quizzes. Designed for students with no prior command-line experience.

Unit	Topics	Format	Self-Check
1. The Command Line from Scratch (2 hrs)	What is a terminal, navigating directories, listing files, creating/moving/deleting files and folders	Jupyter-based terminal exercises (no local install needed)	Quiz: navigate to a directory, create a file, move it
2. File Permissions & SSH (1.5 hrs)	Read/write/execute permissions, chmod, connecting to remote machines via SSH, key pairs	Guided walkthrough + practice exercises	Quiz: set permissions on a file, SSH into a provided server
3. Shell Scripting Basics (1.5 hrs)	Variables, loops, conditionals in bash, writing a simple automation script	Write-along exercises	Quiz: write a script that processes 3 CSV files
4. GCP Account Setup (2 hrs)	Create GCP account, navigate the console, understand regions/services/free tier, set up IAM, configure gcloud CLI	Step-by-step guided walkthrough with screenshots	Checkpoint: successfully run `gsutil ls` from your terminal
5. Cloud Cost Awareness (1 hr)	Free tier limits, setting billing alerts, estimating costs, shutting down resources to avoid charges	Walkthrough + cost calculator exercise	Quiz: estimate monthly cost for a given GCS + Compute Engine scenario

Readiness check: Students who pass all 5 self-check quizzes (70% threshold) are cleared for Week 1. Students who don’t pass receive targeted resources and can retake.

Note: This replaces the previous 2-3 hour Linux CLI module. Career pivoters from non-technical backgrounds need more scaffolding to be confident with the command line and cloud console before diving into GCP, Spark, and dbt.

Learning Outcomes (L-C-E Framework)

Literacy (Foundational Awareness)

L1: Explain cloud computing benefits (scalability, cost, flexibility) and describe major cloud providers
L2: Understand when big data infrastructure is needed vs. traditional databases
L3: Recognize GCP service categories (compute, storage, database, analytics) and modern data stack components (dbt, BigQuery, Snowflake)

Competency (Applied Skills)

C1: Build data pipelines using GCP (GCS, Compute Engine, Cloud Functions)
C2: Use Apache Spark for distributed processing of large datasets
C3: Design and query data warehouses using both BigQuery and Snowflake
C4: Build and test dbt models with documentation and lineage tracking
C5: Implement basic data security and IAM practices

Expertise (Advanced Application)

E1: Architect an end-to-end big data solution (ingest -> store -> process -> analyze)
E2: Optimize data pipelines for cost and performance
E3: Implement monitoring, logging, and error handling for production systems

Week-by-Week Breakdown

Week	Topic	Lectures	Project Work	Studio Session	Assessment
1	Cloud fundamentals + GCP overview	3 videos	GCP account setup, GCS bucket creation	GCP fundamentals - regions, services, free tier	Assignment 1
2	GCS + data lakes + Compute Engine essentials	2 videos	Data ingestion to GCS, launch instances	GCS + Compute Engine workshop - buckets, permissions, instance types, security groups	Assignment 2 + M1
3	dbt fundamentals	2 videos	Build dbt models for team dataset	dbt workshop - models, tests, documentation, lineage graphs	Assignment 3 + M2
4	Spark fundamentals + RDD/DataFrames	3 videos	Dataproc cluster setup, first job	Spark basics - distributed processing, lazy evaluation	Assignment 4
5	Spark SQL + data processing	3 videos	Process team dataset at scale	Spark SQL - DataFrame operations at scale	Assignment 5 + M3
6	Data warehousing: BigQuery + Snowflake	2 videos	Load warehouse, platform comparison	BigQuery vs. Snowflake - columnar storage, SQL queries, platform comparison	Assignment 6
7	Data pipelines + orchestration	2 videos	Wire up end-to-end pipeline	Cloud Functions + Cloud Composer - serverless automation	Assignment 7 + M4
8	Security, monitoring, synthesis	1 video	Final integration + reflection	GCP security + monitoring - Cloud Logging, alerts, cost	Assignment 8 + Final deliverable + Oral defense

Team Project: End-to-End Data Pipeline (Team of 3)

One major team project runs across all 8 weeks. Teams progressively build a complete big data pipeline — from cloud setup and data ingestion through Spark processing, warehousing, and orchestration — culminating in a live demo and oral defense.

Pipeline Components:

Ingestion: Data sources ingested into GCS (public datasets, APIs, logs)
Transformation: dbt models with tests, documentation, and lineage
Processing: Spark job (via Dataproc) cleaning + transforming data at scale
Storage: Results in data lake (GCS) and warehouse (BigQuery/Snowflake)
Orchestration: Scheduled pipeline via Cloud Functions + Cloud Composer
Reporting: Dashboard or SQL queries showing results

Final Deliverables:

GCS bucket with proper structure (raw, processed, output folders)
dbt project with staging/mart models, schema tests, documentation, lineage graph
PySpark application (via Dataproc) processing 1GB+ dataset with performance analysis
Output loaded to both BigQuery and Snowflake (platform comparison write-up)
Cloud Functions for scheduled ingestion
Cloud Composer DAG orchestrating the full pipeline
Cloud Monitoring + alerts
Cost estimation (monthly running cost)
Architecture diagram + operational runbook
GitHub repo with all code + infrastructure templates
Peer evaluation of team contributions

Rubric (5 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
Pipeline Architecture	Well-designed, scalable, modular with clear data flow	Functional, some inefficiency	Basic or problematic design
Data Engineering	Clean dbt models with tests + efficient Spark transformations	Functional models and code	Minimal tests, inefficient code
Platform Comparison	Thoughtful Redshift vs. Snowflake analysis with performance data	Basic comparison	Minimal or missing
Automation & Monitoring	Fully automated via Cloud Composer, comprehensive Cloud Monitoring + alerts	Mostly automated, basic monitoring	Manual steps, minimal visibility
Production Readiness	Error handling, rollback plan, cost-optimized, tested	Mostly robust	Lacks error handling

Weekly Assignments (Individual)

Hands-on labs and deliverables that build skills toward the team project. Each assignment is due at the end of its week.

Week	Assignment	Focus
1	GCP account setup + GCS bucket creation	Cloud fundamentals, IAM, bucket policies
2	Data ingestion demo — upload 2+ sources to GCS	GCS organization, data formats, access control
3	dbt model review — build and test staging/mart models	dbt fundamentals, schema tests, documentation
4	Dataproc cluster setup + first Spark job submission	PySpark environment, RDD/DataFrame basics
5	Spark data processing — clean and aggregate a large dataset	Spark SQL, DataFrame operations, optimization
6	Warehouse query lab — load data to BigQuery and Snowflake	Columnar storage, SQL queries, platform comparison
7	Pipeline orchestration — wire up Cloud Functions + Cloud Composer	Serverless automation, scheduling, error handling
8	Code review + reflection	Peer code review, cost analysis, lessons learned

Project Milestones (Team)

Progressive checkpoints ensuring teams are on track for the final pipeline.

Milestone	Due	Deliverable
M1: Cloud Infrastructure	End of Week 2	GCS bucket with ingested data, IAM roles, cost estimate
M2: Transformation Layer	End of Week 3	Working dbt project with tests and lineage graph
M3: Processing Layer	End of Week 5	PySpark application (Dataproc) processing team dataset, performance benchmarks
M4: Integration & Orchestration	End of Week 7	End-to-end pipeline running on schedule with Cloud Monitoring

Final Project Deliverable (Week 8, Team)

Complete end-to-end data pipeline with all components integrated, architecture diagram, operational runbook, and GitHub repo. Graded on the full rubric above.

Oral Defense (Week 8, Team)

Each team presents their pipeline architecture and gives a live demo. All team members must answer questions individually. Evaluates technical depth, design rationale, and ability to explain decisions without AI assistance.

AI Tools Integration

Weeks 1-3 (Cloud Setup + dbt):

Use Claude/ChatGPT to:
- Explain GCP services + when to use each
- Debug IAM permission issues
- Generate dbt model templates and tests
- Review bucket policies

Weeks 4-6 (Spark Processing + Warehousing):

Use AI to:
- Suggest Spark DataFrame transformations
- Debug PySpark errors
- Compare BigQuery vs. Snowflake features
- Generate cost calculations

Weeks 7-8 (Pipeline):

Use AI to:
- Generate Cloud Functions code
- Debug Cloud Composer DAG workflows
- Suggest monitoring strategies
- Review error handling patterns

Studio Session Topics:

Week 1: Cloud fundamentals + GCP services overview
Week 2: GCS + Compute Engine essentials workshop
Week 3: dbt fundamentals — models, tests, documentation
Week 4: Spark architecture + RDD/DataFrame concepts (Dataproc)
Week 5: Spark SQL optimization + cost implications
Week 6: Data warehouse comparison — BigQuery vs. Snowflake
Week 7: Orchestration + scheduling (Cloud Functions, Cloud Composer, Airflow)
Week 8: Team presentations + operational best practices

Assessment Summary

Component	Weight	Notes
Weekly assignments	30%	8 individual labs/deliverables
Project milestones	25%	4 team checkpoints (M1-M4)
Final project deliverable	15%	Week 8, team
Oral defense	25%	Week 8, team (individual Q&A)
Studio participation	5%	Weekly attendance + engagement

No traditional exam. One major team project with weekly individual assignments building toward it.

AI Usage Levels (AIAS)

Assessment	AIAS Level	AI Permitted
Weekly assignments (Weeks 1-3)	2	AI for AWS setup, dbt template generation, IAM debugging — with attribution
Weekly assignments (Weeks 4-6)	2	AI for PySpark debugging, SQL optimization — with attribution
Weekly assignments (Weeks 7-8)	3	AI as collaborator for Lambda code, monitoring strategy — with full disclosure
Project milestones	3	AI as collaborator for pipeline design and implementation — with full disclosure
Final project deliverable	3	AI as collaborator — with full disclosure of all AI-assisted components
Oral defense	0	No AI
Studio participation	1	AI for exploration during exercises

Technology Stack

Cloud: GCP (GCS, Compute Engine, Cloud Functions, Cloud SQL, BigQuery, Cloud Composer)
Data Warehousing: BigQuery + Snowflake (students learn both platforms)
Data Transformation: dbt (models, tests, documentation)
Processing: Apache Spark (PySpark) via Cloud Dataproc, Python
Languages: Python, SQL
Infrastructure: Terraform (optional)
Monitoring: Cloud Monitoring, Cloud Logging
IDE: VS Code with GitHub Copilot; Google Colab (browser alternative for non-Spark work)
Notebooks: Jupyter with Spark kernel (via VS Code or JupyterHub)
Version Control: GitHub

Prerequisites

Completion of BADM 554 (SQL) + FIN 550 (Python) (or equivalent)
Comfortable with Linux command line (pre-course module required)

Course Sequence: ← FIN 550 — Big Data Analytics in Finance

Next: Agentic AI for Analytics →

MSBAi Curriculum Site

MSBAi - Online Master of Science in Business Analytics

AI-First curriculum design documentation for the MSBAi program launching Fall 2026