BADM 558 - Big Data Infrastructure

Program-level details: See program/CURRICULUM.md

Credits: 4

Term: Spring 2027 (Weeks 1-8)

Course Vision

Students master cloud-based big data infrastructure and modern data engineering tools. Using AWS services, dbt, and both Redshift and Snowflake, students build scalable data pipelines, work with cloud storage (S3), distributed processing (Spark), and data warehouses. By course end, students understand how to architect data systems for large-scale applications using the modern data stack.

Pre-Course Requirement

Linux CLI Module (2-3 hours, self-paced async): Complete before Week 1. Covers SSH, file operations, permissions, and basic shell scripting. Available via DataCamp or course-provided materials.

Learning Outcomes (L-C-E Framework)

Literacy (Foundational Awareness)

L1: Explain cloud computing benefits (scalability, cost, flexibility) and describe major cloud providers
L2: Understand when big data infrastructure is needed vs. traditional databases
L3: Recognize AWS service categories (compute, storage, database, analytics) and modern data stack components (dbt, Snowflake)

Competency (Applied Skills)

C1: Build data pipelines using AWS (S3, EC2, Lambda)
C2: Use Apache Spark for distributed processing of large datasets
C3: Design and query data warehouses using both Redshift and Snowflake
C4: Build and test dbt models with documentation and lineage tracking
C5: Implement basic data security and IAM practices

Expertise (Advanced Application)

E1: Architect an end-to-end big data solution (ingest -> store -> process -> analyze)
E2: Optimize data pipelines for cost and performance
E3: Implement monitoring, logging, and error handling for production systems

Week-by-Week Breakdown

Week	Topic	Lectures	Project Work	Studio Session	Assessment
1	Cloud fundamentals + AWS overview	3 videos	Project 1A: AWS account setup	AWS fundamentals - regions, services, free tier	AWS setup quiz
2	S3 + data lakes + EC2 compute essentials	2 videos	Project 1 work: Upload data to S3, launch instances	S3 + EC2 workshop - buckets, permissions, instance types, security groups	Data upload + instance launch
3	dbt fundamentals	2 videos	Project 1 work: Build dbt models	dbt workshop - models, tests, documentation, lineage graphs	dbt model review
4	Spark fundamentals + RDD/DataFrames	3 videos	Project 2A: Spark cluster setup	Spark basics - distributed processing, lazy evaluation	Spark job submission
5	Spark SQL + data processing	3 videos	Project 2 work: Process large dataset	Spark SQL - DataFrame operations at scale	Code review
6	Data warehousing: Redshift + Snowflake	2 videos	Project 2 work: Load warehouse	Redshift vs. Snowflake - columnar storage, SQL queries, platform comparison	Warehouse query
7	Data pipelines + orchestration	2 videos	Project 3A: Build end-to-end pipeline	Lambda + Step Functions - serverless automation	Mid-course checkpoint
8	Security, monitoring, synthesis	1 video	Project 3 complete + reflection	AWS security + monitoring - logging, alerts, cost	Final presentations + team oral defense

Projects (3 per course)

Project 1: Cloud Data Lake + dbt Setup (Weeks 1-3, Individual, 20% of grade)

Problem Statement: Build a cloud data lake on AWS and implement dbt for data transformation. Ingest data from multiple sources into S3, organize with proper conventions, and build a dbt project with tested, documented models.

Deliverables:

AWS S3 bucket created with proper structure (raw, processed, output folders)
At least 2 data sources ingested:
- Public dataset (Kaggle, government data, etc.)
- API data (weather, stocks, social media, etc.)
dbt project with:
- Staging and mart models
- Schema tests (not null, unique, accepted values)
- Documentation and lineage graph
Data organization with clear naming conventions
Basic access control (IAM roles, bucket policies)
Cost estimation document
GitHub repo with infrastructure code

Rubric (5 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
S3 Organization	Well-structured with clear partitioning	Organized, some redundancy	Messy or hard to navigate
dbt Models	Clean models with tests, docs, and lineage	Functional models, basic tests	Minimal or untested models
Data Ingestion	Multiple sources, automated process	2 sources, some manual steps	Limited sources or manual only
Security	Proper IAM roles, bucket policies	Basic security	Missing security
Cost Awareness	Monitored costs, optimized	Aware of costs	No cost tracking

Project 2: Spark Data Processing + Warehousing (Weeks 4-6, Individual, 35% of grade)

Problem Statement: Process a large dataset using Apache Spark. Clean, transform, and aggregate data. Load results into both Redshift and Snowflake to compare platforms.

Dataset Options:

1GB+ public dataset (Wikipedia, Amazon reviews, GitHub, Twitter, etc.)
User’s own data (CSV, Parquet, JSON)
Synthetic/generated big data

Deliverables:

PySpark application (Python + Spark DataFrame API)
Data cleaning + transformation logic
Aggregations and analytical queries
Output loaded to both Redshift and Snowflake (platform comparison write-up)
Performance analysis:
- Execution time with different cluster sizes
- Cost estimates for 1-node, 5-node, 10-node clusters
- Optimization recommendations
Jupyter notebook explaining approach + results
GitHub repo with PySpark code + data schema

Rubric (5 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
Spark Code	Efficient transformations, proper caching	Functional code	Inefficient or problematic
Data Quality	Handles missing values, edge cases	Handles common cases	Errors or data loss
Platform Comparison	Thoughtful Redshift vs. Snowflake analysis	Basic comparison	Minimal or missing
Performance	Optimized partitioning, justifies choices	Adequate performance	Unoptimized or slow
Documentation	Clear explanation of transformations	Adequate explanation	Minimal docs

Project 3: End-to-End Data Pipeline (Weeks 7-8, Team of 3-4, 35% of grade)

Problem Statement: Design and implement a complete big data pipeline as a team. Automate daily/weekly data ingestion, processing, and loading. Build a simple dashboard or reporting layer.

Pipeline Components:

Ingestion: Scheduled data pull (S3 from API, CloudWatch logs, or other source)
Processing: Spark job cleaning + transforming data
Storage: Results stored in data lake (S3) or warehouse (Redshift/Snowflake)
Reporting: Simple dashboard or SQL queries showing results

Deliverables:

AWS Lambda function(s) for scheduled ingestion
Spark job for data processing
AWS Step Functions workflow orchestrating the pipeline
CloudWatch monitoring + alarms
Cost estimation (monthly running cost)
Architecture diagram
Operational runbook (how to monitor + troubleshoot)
GitHub repo with all code + infrastructure templates
Team oral defense: present pipeline architecture + live demo (20% of Project 3 grade)
Peer evaluation of team contributions

Rubric (5 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
Pipeline Architecture	Well-designed, scalable, modular	Functional, some inefficiency	Basic or problematic design
Automation	Fully automated, scheduled correctly	Mostly automated	Manual steps remaining
Monitoring	Comprehensive logging + alerts	Basic monitoring	Minimal visibility
Oral Defense	Clear explanation, confident demo, handles Q&A well	Adequate presentation	Unclear or unprepared
Production Readiness	Error handling, rollback plan, tested	Mostly robust	Lacks error handling

AI Tools Integration

Weeks 1-3 (Cloud Setup + dbt):

Use Claude/ChatGPT to:
- Explain AWS services + when to use each
- Debug IAM permission issues
- Generate dbt model templates and tests
- Review bucket policies

Weeks 4-6 (Spark Processing + Warehousing):

Use AI to:
- Suggest Spark DataFrame transformations
- Debug PySpark errors
- Compare Redshift vs. Snowflake features
- Generate cost calculations

Weeks 7-8 (Pipeline):

Use AI to:
- Generate Lambda function code
- Debug Step Functions workflows
- Suggest monitoring strategies
- Review error handling patterns

Studio Session Topics:

Week 1: Cloud fundamentals + AWS services overview
Week 2: S3 + EC2 essentials workshop
Week 3: dbt fundamentals — models, tests, documentation
Week 4: Spark architecture + RDD/DataFrame concepts
Week 5: Spark SQL optimization + cost implications
Week 6: Data warehouse comparison — Redshift vs. Snowflake
Week 7: Orchestration + scheduling (Lambda, Step Functions, Airflow)
Week 8: Team presentations + operational best practices

Assessment Summary

Component	Weight	Notes
Project 1 (Data Lake + dbt)	20%	Weeks 1-3, individual
Project 2 (Spark + Warehousing)	35%	Weeks 4-6, individual
Project 3 (End-to-End Pipeline)	35%	Weeks 7-8, team (includes oral defense)
Studio participation	10%	Weekly attendance

No traditional exam. Project-based with infrastructure focus.

Technology Stack

Cloud: AWS (S3, EC2, Lambda, RDS, Redshift, Step Functions)
Data Warehousing: Redshift + Snowflake (students learn both platforms)
Data Transformation: dbt (models, tests, documentation)
Processing: Apache Spark (PySpark), Python
Languages: Python, SQL
Infrastructure: CloudFormation or Terraform (optional)
Monitoring: CloudWatch, X-Ray
Notebook: Jupyter with Spark kernel

Prerequisites

Completion of BADM 554 (SQL) + FIN 550 (Python) (or equivalent)
Comfortable with Linux command line (pre-course module required)

Last Updated: February 2026

MSBAi Curriculum Site

MSBAi - Online Master of Science in Business Analytics

AI-First curriculum design documentation for the MSBAi program launching Fall 2026

BADM 558 - Big Data Infrastructure

Course Vision

Pre-Course Requirement

Learning Outcomes (L-C-E Framework)

Literacy (Foundational Awareness)

Competency (Applied Skills)

Expertise (Advanced Application)

Week-by-Week Breakdown

Projects (3 per course)

Project 1: Cloud Data Lake + dbt Setup (Weeks 1-3, Individual, 20% of grade)

Project 2: Spark Data Processing + Warehousing (Weeks 4-6, Individual, 35% of grade)

Project 3: End-to-End Data Pipeline (Weeks 7-8, Team of 3-4, 35% of grade)

AI Tools Integration

Assessment Summary

Technology Stack

Prerequisites