← MSBAi Home

BADM 558 - Big Data Infrastructure

Program-level details: See program/CURRICULUM.md

Credits: 4 Term: Spring 2027 (Weeks 1-8)

Course Vision

Students master cloud-based big data infrastructure and modern data engineering tools. Using AWS services, dbt, and both Redshift and Snowflake, students build scalable data pipelines, work with cloud storage (S3), distributed processing (Spark), and data warehouses. By course end, students understand how to architect data systems for large-scale applications using the modern data stack.

Pre-Course Requirement

Linux CLI Module (2-3 hours, self-paced async): Complete before Week 1. Covers SSH, file operations, permissions, and basic shell scripting. Available via DataCamp or course-provided materials.

Learning Outcomes (L-C-E Framework)

Literacy (Foundational Awareness)

Competency (Applied Skills)

Expertise (Advanced Application)

Week-by-Week Breakdown

Week Topic Lectures Project Work Studio Session Assessment
1 Cloud fundamentals + AWS overview 3 videos Project 1A: AWS account setup AWS fundamentals - regions, services, free tier AWS setup quiz
2 S3 + data lakes + EC2 compute essentials 2 videos Project 1 work: Upload data to S3, launch instances S3 + EC2 workshop - buckets, permissions, instance types, security groups Data upload + instance launch
3 dbt fundamentals 2 videos Project 1 work: Build dbt models dbt workshop - models, tests, documentation, lineage graphs dbt model review
4 Spark fundamentals + RDD/DataFrames 3 videos Project 2A: Spark cluster setup Spark basics - distributed processing, lazy evaluation Spark job submission
5 Spark SQL + data processing 3 videos Project 2 work: Process large dataset Spark SQL - DataFrame operations at scale Code review
6 Data warehousing: Redshift + Snowflake 2 videos Project 2 work: Load warehouse Redshift vs. Snowflake - columnar storage, SQL queries, platform comparison Warehouse query
7 Data pipelines + orchestration 2 videos Project 3A: Build end-to-end pipeline Lambda + Step Functions - serverless automation Mid-course checkpoint
8 Security, monitoring, synthesis 1 video Project 3 complete + reflection AWS security + monitoring - logging, alerts, cost Final presentations + team oral defense

Projects (3 per course)

Project 1: Cloud Data Lake + dbt Setup (Weeks 1-3, Individual, 20% of grade)

Problem Statement: Build a cloud data lake on AWS and implement dbt for data transformation. Ingest data from multiple sources into S3, organize with proper conventions, and build a dbt project with tested, documented models.

Deliverables:

Rubric (5 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
S3 Organization Well-structured with clear partitioning Organized, some redundancy Messy or hard to navigate
dbt Models Clean models with tests, docs, and lineage Functional models, basic tests Minimal or untested models
Data Ingestion Multiple sources, automated process 2 sources, some manual steps Limited sources or manual only
Security Proper IAM roles, bucket policies Basic security Missing security
Cost Awareness Monitored costs, optimized Aware of costs No cost tracking

Project 2: Spark Data Processing + Warehousing (Weeks 4-6, Individual, 35% of grade)

Problem Statement: Process a large dataset using Apache Spark. Clean, transform, and aggregate data. Load results into both Redshift and Snowflake to compare platforms.

Dataset Options:

Deliverables:

Rubric (5 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
Spark Code Efficient transformations, proper caching Functional code Inefficient or problematic
Data Quality Handles missing values, edge cases Handles common cases Errors or data loss
Platform Comparison Thoughtful Redshift vs. Snowflake analysis Basic comparison Minimal or missing
Performance Optimized partitioning, justifies choices Adequate performance Unoptimized or slow
Documentation Clear explanation of transformations Adequate explanation Minimal docs

Project 3: End-to-End Data Pipeline (Weeks 7-8, Team of 3-4, 35% of grade)

Problem Statement: Design and implement a complete big data pipeline as a team. Automate daily/weekly data ingestion, processing, and loading. Build a simple dashboard or reporting layer.

Pipeline Components:

  1. Ingestion: Scheduled data pull (S3 from API, CloudWatch logs, or other source)
  2. Processing: Spark job cleaning + transforming data
  3. Storage: Results stored in data lake (S3) or warehouse (Redshift/Snowflake)
  4. Reporting: Simple dashboard or SQL queries showing results

Deliverables:

Rubric (5 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
Pipeline Architecture Well-designed, scalable, modular Functional, some inefficiency Basic or problematic design
Automation Fully automated, scheduled correctly Mostly automated Manual steps remaining
Monitoring Comprehensive logging + alerts Basic monitoring Minimal visibility
Oral Defense Clear explanation, confident demo, handles Q&A well Adequate presentation Unclear or unprepared
Production Readiness Error handling, rollback plan, tested Mostly robust Lacks error handling

AI Tools Integration

Weeks 1-3 (Cloud Setup + dbt):

  1. Use Claude/ChatGPT to:
    • Explain AWS services + when to use each
    • Debug IAM permission issues
    • Generate dbt model templates and tests
    • Review bucket policies

Weeks 4-6 (Spark Processing + Warehousing):

  1. Use AI to:
    • Suggest Spark DataFrame transformations
    • Debug PySpark errors
    • Compare Redshift vs. Snowflake features
    • Generate cost calculations

Weeks 7-8 (Pipeline):

  1. Use AI to:
    • Generate Lambda function code
    • Debug Step Functions workflows
    • Suggest monitoring strategies
    • Review error handling patterns

Studio Session Topics:

Assessment Summary

Component Weight Notes
Project 1 (Data Lake + dbt) 20% Weeks 1-3, individual
Project 2 (Spark + Warehousing) 35% Weeks 4-6, individual
Project 3 (End-to-End Pipeline) 35% Weeks 7-8, team (includes oral defense)
Studio participation 10% Weekly attendance

No traditional exam. Project-based with infrastructure focus.

Technology Stack

Prerequisites


Last Updated: February 2026