BADM 558 - Big Data Infrastructure
Program-level details: See program/CURRICULUM.md
| Credits: 4 | Term: Spring 2027 (Weeks 1-8) |
Course Vision
Students master cloud-based big data infrastructure and modern data engineering tools. Using AWS services, dbt, and both Redshift and Snowflake, students build scalable data pipelines, work with cloud storage (S3), distributed processing (Spark), and data warehouses. By course end, students understand how to architect data systems for large-scale applications using the modern data stack.
Pre-Course Requirement
Linux CLI Module (2-3 hours, self-paced async): Complete before Week 1. Covers SSH, file operations, permissions, and basic shell scripting. Available via DataCamp or course-provided materials.
Learning Outcomes (L-C-E Framework)
Literacy (Foundational Awareness)
- L1: Explain cloud computing benefits (scalability, cost, flexibility) and describe major cloud providers
- L2: Understand when big data infrastructure is needed vs. traditional databases
- L3: Recognize AWS service categories (compute, storage, database, analytics) and modern data stack components (dbt, Snowflake)
Competency (Applied Skills)
- C1: Build data pipelines using AWS (S3, EC2, Lambda)
- C2: Use Apache Spark for distributed processing of large datasets
- C3: Design and query data warehouses using both Redshift and Snowflake
- C4: Build and test dbt models with documentation and lineage tracking
- C5: Implement basic data security and IAM practices
Expertise (Advanced Application)
- E1: Architect an end-to-end big data solution (ingest -> store -> process -> analyze)
- E2: Optimize data pipelines for cost and performance
- E3: Implement monitoring, logging, and error handling for production systems
Week-by-Week Breakdown
| Week | Topic | Lectures | Project Work | Studio Session | Assessment |
|---|---|---|---|---|---|
| 1 | Cloud fundamentals + AWS overview | 3 videos | Project 1A: AWS account setup | AWS fundamentals - regions, services, free tier | AWS setup quiz |
| 2 | S3 + data lakes + EC2 compute essentials | 2 videos | Project 1 work: Upload data to S3, launch instances | S3 + EC2 workshop - buckets, permissions, instance types, security groups | Data upload + instance launch |
| 3 | dbt fundamentals | 2 videos | Project 1 work: Build dbt models | dbt workshop - models, tests, documentation, lineage graphs | dbt model review |
| 4 | Spark fundamentals + RDD/DataFrames | 3 videos | Project 2A: Spark cluster setup | Spark basics - distributed processing, lazy evaluation | Spark job submission |
| 5 | Spark SQL + data processing | 3 videos | Project 2 work: Process large dataset | Spark SQL - DataFrame operations at scale | Code review |
| 6 | Data warehousing: Redshift + Snowflake | 2 videos | Project 2 work: Load warehouse | Redshift vs. Snowflake - columnar storage, SQL queries, platform comparison | Warehouse query |
| 7 | Data pipelines + orchestration | 2 videos | Project 3A: Build end-to-end pipeline | Lambda + Step Functions - serverless automation | Mid-course checkpoint |
| 8 | Security, monitoring, synthesis | 1 video | Project 3 complete + reflection | AWS security + monitoring - logging, alerts, cost | Final presentations + team oral defense |
Projects (3 per course)
Project 1: Cloud Data Lake + dbt Setup (Weeks 1-3, Individual, 20% of grade)
Problem Statement: Build a cloud data lake on AWS and implement dbt for data transformation. Ingest data from multiple sources into S3, organize with proper conventions, and build a dbt project with tested, documented models.
Deliverables:
- AWS S3 bucket created with proper structure (raw, processed, output folders)
- At least 2 data sources ingested:
- Public dataset (Kaggle, government data, etc.)
- API data (weather, stocks, social media, etc.)
- dbt project with:
- Staging and mart models
- Schema tests (not null, unique, accepted values)
- Documentation and lineage graph
- Data organization with clear naming conventions
- Basic access control (IAM roles, bucket policies)
- Cost estimation document
- GitHub repo with infrastructure code
Rubric (5 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| S3 Organization | Well-structured with clear partitioning | Organized, some redundancy | Messy or hard to navigate |
| dbt Models | Clean models with tests, docs, and lineage | Functional models, basic tests | Minimal or untested models |
| Data Ingestion | Multiple sources, automated process | 2 sources, some manual steps | Limited sources or manual only |
| Security | Proper IAM roles, bucket policies | Basic security | Missing security |
| Cost Awareness | Monitored costs, optimized | Aware of costs | No cost tracking |
Project 2: Spark Data Processing + Warehousing (Weeks 4-6, Individual, 35% of grade)
Problem Statement: Process a large dataset using Apache Spark. Clean, transform, and aggregate data. Load results into both Redshift and Snowflake to compare platforms.
Dataset Options:
- 1GB+ public dataset (Wikipedia, Amazon reviews, GitHub, Twitter, etc.)
- User’s own data (CSV, Parquet, JSON)
- Synthetic/generated big data
Deliverables:
- PySpark application (Python + Spark DataFrame API)
- Data cleaning + transformation logic
- Aggregations and analytical queries
- Output loaded to both Redshift and Snowflake (platform comparison write-up)
- Performance analysis:
- Execution time with different cluster sizes
- Cost estimates for 1-node, 5-node, 10-node clusters
- Optimization recommendations
- Jupyter notebook explaining approach + results
- GitHub repo with PySpark code + data schema
Rubric (5 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| Spark Code | Efficient transformations, proper caching | Functional code | Inefficient or problematic |
| Data Quality | Handles missing values, edge cases | Handles common cases | Errors or data loss |
| Platform Comparison | Thoughtful Redshift vs. Snowflake analysis | Basic comparison | Minimal or missing |
| Performance | Optimized partitioning, justifies choices | Adequate performance | Unoptimized or slow |
| Documentation | Clear explanation of transformations | Adequate explanation | Minimal docs |
Project 3: End-to-End Data Pipeline (Weeks 7-8, Team of 3-4, 35% of grade)
Problem Statement: Design and implement a complete big data pipeline as a team. Automate daily/weekly data ingestion, processing, and loading. Build a simple dashboard or reporting layer.
Pipeline Components:
- Ingestion: Scheduled data pull (S3 from API, CloudWatch logs, or other source)
- Processing: Spark job cleaning + transforming data
- Storage: Results stored in data lake (S3) or warehouse (Redshift/Snowflake)
- Reporting: Simple dashboard or SQL queries showing results
Deliverables:
- AWS Lambda function(s) for scheduled ingestion
- Spark job for data processing
- AWS Step Functions workflow orchestrating the pipeline
- CloudWatch monitoring + alarms
- Cost estimation (monthly running cost)
- Architecture diagram
- Operational runbook (how to monitor + troubleshoot)
- GitHub repo with all code + infrastructure templates
- Team oral defense: present pipeline architecture + live demo (20% of Project 3 grade)
- Peer evaluation of team contributions
Rubric (5 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| Pipeline Architecture | Well-designed, scalable, modular | Functional, some inefficiency | Basic or problematic design |
| Automation | Fully automated, scheduled correctly | Mostly automated | Manual steps remaining |
| Monitoring | Comprehensive logging + alerts | Basic monitoring | Minimal visibility |
| Oral Defense | Clear explanation, confident demo, handles Q&A well | Adequate presentation | Unclear or unprepared |
| Production Readiness | Error handling, rollback plan, tested | Mostly robust | Lacks error handling |
AI Tools Integration
Weeks 1-3 (Cloud Setup + dbt):
- Use Claude/ChatGPT to:
- Explain AWS services + when to use each
- Debug IAM permission issues
- Generate dbt model templates and tests
- Review bucket policies
Weeks 4-6 (Spark Processing + Warehousing):
- Use AI to:
- Suggest Spark DataFrame transformations
- Debug PySpark errors
- Compare Redshift vs. Snowflake features
- Generate cost calculations
Weeks 7-8 (Pipeline):
- Use AI to:
- Generate Lambda function code
- Debug Step Functions workflows
- Suggest monitoring strategies
- Review error handling patterns
Studio Session Topics:
- Week 1: Cloud fundamentals + AWS services overview
- Week 2: S3 + EC2 essentials workshop
- Week 3: dbt fundamentals — models, tests, documentation
- Week 4: Spark architecture + RDD/DataFrame concepts
- Week 5: Spark SQL optimization + cost implications
- Week 6: Data warehouse comparison — Redshift vs. Snowflake
- Week 7: Orchestration + scheduling (Lambda, Step Functions, Airflow)
- Week 8: Team presentations + operational best practices
Assessment Summary
| Component | Weight | Notes |
|---|---|---|
| Project 1 (Data Lake + dbt) | 20% | Weeks 1-3, individual |
| Project 2 (Spark + Warehousing) | 35% | Weeks 4-6, individual |
| Project 3 (End-to-End Pipeline) | 35% | Weeks 7-8, team (includes oral defense) |
| Studio participation | 10% | Weekly attendance |
No traditional exam. Project-based with infrastructure focus.
Technology Stack
- Cloud: AWS (S3, EC2, Lambda, RDS, Redshift, Step Functions)
- Data Warehousing: Redshift + Snowflake (students learn both platforms)
- Data Transformation: dbt (models, tests, documentation)
- Processing: Apache Spark (PySpark), Python
- Languages: Python, SQL
- Infrastructure: CloudFormation or Terraform (optional)
- Monitoring: CloudWatch, X-Ray
- Notebook: Jupyter with Spark kernel
Prerequisites
- Completion of BADM 554 (SQL) + FIN 550 (Python) (or equivalent)
- Comfortable with Linux command line (pre-course module required)
Last Updated: February 2026