Last updated: April 05, 2026

← MSBAi Home

BADM 558 - Big Data Infrastructure

Program-level details: See program/curriculum.md

Status: Draft Initial outline; pending instructor review.
Credits: 4 Term: Spring 2027 (Weeks 1-8) Instructor: Ashish

Course Vision

Students master cloud-based big data infrastructure and modern data engineering tools. Using GCP services, dbt, and both BigQuery and Snowflake, students build scalable data pipelines, work with cloud storage (GCS), distributed processing (Spark via Dataproc), and data warehouses. By course end, students understand how to architect data systems for large-scale applications using the modern data stack.

Bridge Module: Linux CLI & Cloud Orientation (Pre-Course, ~8 hours)

Complete before Week 1. Available in Canvas as a self-paced module with self-check quizzes. Designed for students with no prior command-line experience.

Unit Topics Format Self-Check
1. The Command Line from Scratch (2 hrs) What is a terminal, navigating directories, listing files, creating/moving/deleting files and folders Jupyter-based terminal exercises (no local install needed) Quiz: navigate to a directory, create a file, move it
2. File Permissions & SSH (1.5 hrs) Read/write/execute permissions, chmod, connecting to remote machines via SSH, key pairs Guided walkthrough + practice exercises Quiz: set permissions on a file, SSH into a provided server
3. Shell Scripting Basics (1.5 hrs) Variables, loops, conditionals in bash, writing a simple automation script Write-along exercises Quiz: write a script that processes 3 CSV files
4. GCP Account Setup (2 hrs) Create GCP account, navigate the console, understand regions/services/free tier, set up IAM, configure gcloud CLI Step-by-step guided walkthrough with screenshots Checkpoint: successfully run gsutil ls from your terminal
5. Cloud Cost Awareness (1 hr) Free tier limits, setting billing alerts, estimating costs, shutting down resources to avoid charges Walkthrough + cost calculator exercise Quiz: estimate monthly cost for a given GCS + Compute Engine scenario

Readiness check: Students who pass all 5 self-check quizzes (70% threshold) are cleared for Week 1. Students who don’t pass receive targeted resources and can retake.

Note: This replaces the previous 2-3 hour Linux CLI module. Career pivoters from non-technical backgrounds need more scaffolding to be confident with the command line and cloud console before diving into GCP, Spark, and dbt.

Learning Outcomes (L-C-E Framework)

Literacy (Foundational Awareness)

Competency (Applied Skills)

Expertise (Advanced Application)

Week-by-Week Breakdown

Week Topic Lectures Project Work Studio Session Assessment
1 Cloud fundamentals + GCP overview 3 videos GCP account setup, GCS bucket creation GCP fundamentals - regions, services, free tier Assignment 1
2 GCS + data lakes + Compute Engine essentials 2 videos Data ingestion to GCS, launch instances GCS + Compute Engine workshop - buckets, permissions, instance types, security groups Assignment 2 + M1
3 dbt fundamentals 2 videos Build dbt models for team dataset dbt workshop - models, tests, documentation, lineage graphs Assignment 3 + M2
4 Spark fundamentals + RDD/DataFrames 3 videos Dataproc cluster setup, first job Spark basics - distributed processing, lazy evaluation Assignment 4
5 Spark SQL + data processing 3 videos Process team dataset at scale Spark SQL - DataFrame operations at scale Assignment 5 + M3
6 Data warehousing: BigQuery + Snowflake 2 videos Load warehouse, platform comparison BigQuery vs. Snowflake - columnar storage, SQL queries, platform comparison Assignment 6
7 Data pipelines + orchestration 2 videos Wire up end-to-end pipeline Cloud Functions + Cloud Composer - serverless automation Assignment 7 + M4
8 Security, monitoring, synthesis 1 video Final integration + reflection GCP security + monitoring - Cloud Logging, alerts, cost Assignment 8 + Final deliverable + Oral defense

Team Project: End-to-End Data Pipeline (Team of 3)

One major team project runs across all 8 weeks. Teams progressively build a complete big data pipeline — from cloud setup and data ingestion through Spark processing, warehousing, and orchestration — culminating in a live demo and oral defense.

Pipeline Components:

  1. Ingestion: Data sources ingested into GCS (public datasets, APIs, logs)
  2. Transformation: dbt models with tests, documentation, and lineage
  3. Processing: Spark job (via Dataproc) cleaning + transforming data at scale
  4. Storage: Results in data lake (GCS) and warehouse (BigQuery/Snowflake)
  5. Orchestration: Scheduled pipeline via Cloud Functions + Cloud Composer
  6. Reporting: Dashboard or SQL queries showing results

Final Deliverables:

Rubric (5 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
Pipeline Architecture Well-designed, scalable, modular with clear data flow Functional, some inefficiency Basic or problematic design
Data Engineering Clean dbt models with tests + efficient Spark transformations Functional models and code Minimal tests, inefficient code
Platform Comparison Thoughtful Redshift vs. Snowflake analysis with performance data Basic comparison Minimal or missing
Automation & Monitoring Fully automated via Cloud Composer, comprehensive Cloud Monitoring + alerts Mostly automated, basic monitoring Manual steps, minimal visibility
Production Readiness Error handling, rollback plan, cost-optimized, tested Mostly robust Lacks error handling

Weekly Assignments (Individual)

Hands-on labs and deliverables that build skills toward the team project. Each assignment is due at the end of its week.

Week Assignment Focus
1 GCP account setup + GCS bucket creation Cloud fundamentals, IAM, bucket policies
2 Data ingestion demo — upload 2+ sources to GCS GCS organization, data formats, access control
3 dbt model review — build and test staging/mart models dbt fundamentals, schema tests, documentation
4 Dataproc cluster setup + first Spark job submission PySpark environment, RDD/DataFrame basics
5 Spark data processing — clean and aggregate a large dataset Spark SQL, DataFrame operations, optimization
6 Warehouse query lab — load data to BigQuery and Snowflake Columnar storage, SQL queries, platform comparison
7 Pipeline orchestration — wire up Cloud Functions + Cloud Composer Serverless automation, scheduling, error handling
8 Code review + reflection Peer code review, cost analysis, lessons learned

Project Milestones (Team)

Progressive checkpoints ensuring teams are on track for the final pipeline.

Milestone Due Deliverable
M1: Cloud Infrastructure End of Week 2 GCS bucket with ingested data, IAM roles, cost estimate
M2: Transformation Layer End of Week 3 Working dbt project with tests and lineage graph
M3: Processing Layer End of Week 5 PySpark application (Dataproc) processing team dataset, performance benchmarks
M4: Integration & Orchestration End of Week 7 End-to-end pipeline running on schedule with Cloud Monitoring

Final Project Deliverable (Week 8, Team)

Complete end-to-end data pipeline with all components integrated, architecture diagram, operational runbook, and GitHub repo. Graded on the full rubric above.

Oral Defense (Week 8, Team)

Each team presents their pipeline architecture and gives a live demo. All team members must answer questions individually. Evaluates technical depth, design rationale, and ability to explain decisions without AI assistance.

AI Tools Integration

Weeks 1-3 (Cloud Setup + dbt):

  1. Use Claude/ChatGPT to:
    • Explain GCP services + when to use each
    • Debug IAM permission issues
    • Generate dbt model templates and tests
    • Review bucket policies

Weeks 4-6 (Spark Processing + Warehousing):

  1. Use AI to:
    • Suggest Spark DataFrame transformations
    • Debug PySpark errors
    • Compare BigQuery vs. Snowflake features
    • Generate cost calculations

Weeks 7-8 (Pipeline):

  1. Use AI to:
    • Generate Cloud Functions code
    • Debug Cloud Composer DAG workflows
    • Suggest monitoring strategies
    • Review error handling patterns

Studio Session Topics:

Assessment Summary

Component Weight Notes
Weekly assignments 30% 8 individual labs/deliverables
Project milestones 25% 4 team checkpoints (M1-M4)
Final project deliverable 15% Week 8, team
Oral defense 25% Week 8, team (individual Q&A)
Studio participation 5% Weekly attendance + engagement

No traditional exam. One major team project with weekly individual assignments building toward it.

AI Usage Levels (AIAS)

Assessment AIAS Level AI Permitted
Weekly assignments (Weeks 1-3) 2 AI for AWS setup, dbt template generation, IAM debugging — with attribution
Weekly assignments (Weeks 4-6) 2 AI for PySpark debugging, SQL optimization — with attribution
Weekly assignments (Weeks 7-8) 3 AI as collaborator for Lambda code, monitoring strategy — with full disclosure
Project milestones 3 AI as collaborator for pipeline design and implementation — with full disclosure
Final project deliverable 3 AI as collaborator — with full disclosure of all AI-assisted components
Oral defense 0 No AI
Studio participation 1 AI for exploration during exercises

Technology Stack

Prerequisites


Course Sequence:FIN 550 — Big Data Analytics in Finance Next: Agentic AI for Analytics