Last updated: March 26, 2026

← MSBAi Home

FIN 550 - Big Data Analytics in Finance (ML I)

Program-level details: See program/curriculum.md

Status: In Development Xing completing Modules 3/4/8 after spring break. Structure stable.

Proposed MSBAi name: Predictive Analytics for Business — pending formal rename approval

Credits: 4 Term: Fall 2026 (Weeks 9-16) Instructor: Xing/Mathias

Course Vision

Students learn core machine learning methods — regression, classification, regularization, tree-based models, and neural networks — applied to real-world prediction problems drawn primarily from financial markets and institutions. Finance provides the running context because it offers the richest, most granular, and most publicly available datasets for learning these skills. The methods are universal; the applications are financial.

This is ML I in the MSBAi sequence. FIN 550 focuses on supervised learning: building, evaluating, and selecting models for prediction. BADM 576 (ML II) extends to unsupervised learning, NLP, time series, deep learning, and deployment.

Domain perspective: Each MSBAi core course brings a distinct business lens. FIN 550 brings the finance and accounting perspective — students work with stock returns, firm financial statements, corporate bonds, mutual funds, and market text data. No prior finance knowledge is assumed; financial concepts are introduced as needed.

Prerequisites

Learning Outcomes (L-C-E Framework)

Literacy (Foundational Awareness)

Competency (Applied Skills)

Expertise (Advanced Application)

Module-by-Module Breakdown

Each module includes asynchronous video lectures (concepts + data example), a live studio session, and either a team exercise or project milestone.

Module 1: Returns, Risk, and Event Studies (Xing)

Component Content
ML Skill Hypothesis testing, statistical event analysis, abnormal return estimation
Finance Context Stock returns (simple, cumulative), portfolio construction (equal-weighted, value-weighted), Sharpe ratios
Video — Concepts Returns and risk fundamentals, portfolio construction, market reactions to corporate events
Video — Data Example Event study on dividend increase announcements: estimating abnormal returns around event dates
Studio Session Event study on stock split announcements (guided, live)
Team Exercise Event study on dividend decrease announcements

For career pivoters: No finance background assumed. Module introduces “returns” as percentage change in price and builds from there. All formulas implemented in Python — conceptual understanding matters more than memorization.


Module 2: Linear Regression and Firm Characteristics (Mathias)

Component Content
ML Skill Linear regression, multiple regression, Fama-MacBeth cross-sectional regression, interpreting coefficients
Finance Context Predicting stock returns using firm characteristics (size, market-to-book) from Compustat accounting data
Video — Concepts Linear regression for prediction, Fama-MacBeth regressions, key firm characteristics and corporate variables
Video — Data Example Cross-sectional stock return prediction using three firm characteristics
Studio Session Hands-on regression with Compustat data
Team Exercise Predict cross-sectional stock returns using multiple firm characteristics

Module 3: Model Evaluation and Cross-Validation (Xing)

Component Content
ML Skill Train/test splits, k-fold cross-validation, leave-one-out CV, hyperparameter tuning, model selection
Finance Context Evaluating prediction quality on financial data from Modules 1-2
Video — Concepts Validation-set approach, leave-one-out CV, k-fold CV, hyperparameter tuning for model selection
Video — Data Example 1 Predicting housing prices using linear regression with polynomial features — comparing CV methods
Video — Data Example 2 Improving stock return prediction from Module 2 using cross-validation
Studio Session Cross-validation workshop: apply CV to Module 2 models, compare results
Team Exercise Evaluate and improve regression models using systematic cross-validation

Placement rationale: Cross-validation is taught in Module 3 (not Module 8) so students apply rigorous evaluation to every model they build from this point forward.


Module 4: Logistic Regression and Variable Selection (Xing)

Component Content
ML Skill Logistic regression, stepwise selection (forward/backward), Lasso regularization, Ridge regression
Finance Context Predicting corporate bond defaults, binary business outcomes
Video — Concepts Variable selection methods (stepwise, Lasso), logistic regression for binary prediction, regularization
Video — Data Example 1 Stock return prediction with variable selection — comparing stepwise vs. Lasso using corporate characteristics
Video — Data Example 2 Corporate bond default prediction using logistic regression and Lasso with bond- and firm-level characteristics
Studio Session Predicting dividend increases using logistic regression and Lasso
Team Exercise Predict corporate bond defaults with an expanded set of firm- and bond-level characteristics

Module 5: Factor Models and Portfolio Performance (Mathias)

Component Content
ML Skill Multi-factor regression, performance attribution, model comparison across factor sets
Finance Context CAPM, Fama-French multi-factor models, mutual fund alpha and beta, index fund comparison
Video — Concepts CAPM and multi-factor models, mutual funds vs. index funds, performance attribution through alpha and beta
Video — Data Example Mutual fund performance attribution — do funds earn their fees? Do returns persist?
Studio Session Predicting mutual fund returns using fund characteristics (idiosyncratic volatility, fund size)
Team Exercise Mutual fund alpha and beta estimation in alternative samples, including performance attribution for index funds

For career pivoters: Factor models are introduced as “multi-variable regression where the variables are market-wide risk factors.” The finance concepts (alpha, beta, CAPM) are explained as business context, not prerequisites.


Module 6: Text Analytics for Business Data (Xing)

Component Content
ML Skill Bag-of-words, n-grams, dictionary-based sentiment measurement, text-as-features for prediction
Finance Context 10-K/10-Q filings, measuring company sentiment, detecting exposure to topics (tariffs, geographic risk)
Video — Concepts Processing text: bag-of-words, n-grams, sentiment lexicons (Loughran-McDonald), exposure measurement
Video — Data Examples Measuring company sentiment and tariff exposure from SEC filings
Studio Session Hands-on text processing with financial filings
Team Exercise Extract a predictive text signal from 10-K filings and use it in a regression model

NLP boundary note: This module covers practical “text → numeric features → prediction.” Transformer models (BERT), word embeddings, and text classification as standalone NLP tasks are covered in BADM 576 (ML II, Week 3).


Module 7: Tree-Based and Ensemble Methods (Xing)

Component Content
ML Skill Classification and regression trees, bagging, random forests, gradient boosting, neural networks
Finance Context Housing price prediction, county-level economic data, mortgage unpaid balance prediction
Video — Concepts Tree-based models, ensemble methods (bagging, random forest, boosting), intro to neural networks
Video — Data Example Ames housing price prediction using tree-based models and neural networks
Studio Session Predicting county-level housing prices using Census demographic and housing data
Team Exercise Predict mortgage unpaid balances using Freddie Mac loan-level data with tree-based methods

Module 8: Trading Strategies and Business Case Synthesis (Mathias + Xing)

Component Content
ML Skill End-to-end empirical pipeline, model comparison, business case communication, oral defense
Finance Context Trading strategies, calendar-time portfolios, momentum and value anomalies, portfolio sorting
Video — Concepts Trading strategy construction, calendar-time portfolios, return anomalies
Studio Session Cross-sectional return prediction with momentum and value anomalies using portfolio sorting techniques
Tutorial Project presentations Team oral defense of trading signal system (see Final Project below)

Tutorial Project: Trading Signal System (Team, runs alongside modules)

Throughout the course, student teams (3 members) build a trading signal system — the signature applied project for FIN 550. Teams:

  1. Identify a signal from a big data source (text, alternative data, financial variables)
  2. Predict stock returns using the signal with methods learned in the course
  3. Construct a dynamic portfolio strategy based on the signal
  4. Analyze abnormal returns for the strategy

The tutorial project integrates skills from every module. Teams present progress during studio sessions and submit the final deliverable + oral defense in Module 8.


Assessment

One major team project (the Trading Signal System) runs alongside the course, with weekly assignments building foundational skills and project milestones scaffolding toward the final deliverable. No traditional exam.

Weekly Assignments (35%)

Individual exercises tied to each module. Students apply that module’s methods to financial datasets and submit a Jupyter notebook + short write-up via GitHub.

Assignment Module Description
A1: Event Study Analysis 1 Conduct an event study on dividend decrease announcements; estimate abnormal returns around event dates
A2: Regression Analysis 2 Predict cross-sectional stock returns using multiple firm characteristics from Compustat data
A3: Cross-Validation Workshop 3 Apply k-fold CV to Module 2 regression models; compare validation approaches and report metrics
A4: Classification Exercise 4 Predict corporate bond defaults using logistic regression + Lasso with firm- and bond-level characteristics
A5: Factor Model Analysis 5 Estimate mutual fund alpha and beta in alternative samples; performance attribution for index funds
A6: Text Feature Extraction 6 Extract a predictive text signal from 10-K filings and use it in a regression model
A7: Ensemble Methods 7 Predict mortgage unpaid balances using Freddie Mac data with tree-based methods; compare with regression

Rubric (per assignment, 4 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
Methodology Correct application of module methods, justified choices Reasonable approach with minor gaps Flawed or missing methodology
Model Evaluation Rigorous evaluation, explains metrics clearly Evaluation applied but basic Missing or weak evaluation
Code Quality Clean, documented Jupyter notebook, reproducible Adequate code, some comments Messy or undocumented
Written Analysis Connects methodology to findings; explains business implications Adequate explanation Minimal or unclear

Project Milestones (25%)

Team milestones (team of 3) that scaffold toward the final Trading Signal System deliverable. Submitted at defined checkpoints; each milestone receives formative feedback.

Milestone Due Deliverable
M1: Project Proposal End of Module 2 Signal hypothesis, data source identification, team roles, preliminary EDA
M2: Signal Construction + Regression Baseline End of Module 4 Constructed signal, baseline regression model, initial cross-validation results
M3: Model Expansion + Text Integration End of Module 6 Classification/text features added to pipeline, model comparison (structured vs. structured + text)
M4: Peer Review Module 7 Cross-team review of another team’s pipeline; written feedback on methodology, evaluation, and gaps

Rubric (per milestone, 3 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
Progress Substantial, on-track work building on prior modules Adequate progress with some gaps Behind schedule or superficial
Technical Quality Methods applied correctly, evaluation included Functional but basic analysis Errors or missing components
Team Collaboration Clear evidence of shared work, complementary contributions Adequate collaboration Uneven contribution

Final Project Deliverable (15%)

Complete the Trading Signal System and submit the final analysis package.

Deliverables:

Rubric (4 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
Signal Design Creative, well-justified signal from interesting data source Reasonable signal choice Generic or unjustified
Model Pipeline Multiple methods compared systematically, strong evaluation Functional pipeline Single method or weak evaluation
Portfolio Analysis Rigorous abnormal return analysis, addresses look-ahead bias Adequate portfolio construction Flawed methodology
Written Analysis Clear business case, honest about limitations and out-of-sample risks Adequate explanation Minimal or unclear writeup

Oral Defense (20%)

Team oral defense: 15-min presentation + 10-min Q&A. Each team member must demonstrate individual understanding of the full pipeline — not just their assigned section.

Rubric (3 dimensions):

Dimension Excellent (A) Proficient (B) Developing (C)
Presentation Clarity Confident delivery, logical structure, clear visualizations Adequate delivery and structure Unclear or disorganized
Technical Depth Articulates trade-offs, explains model choices, addresses limitations Demonstrates understanding of key methods Superficial or rehearsed-only understanding
Q&A Handling Handles unexpected questions well, thinks on feet Answers most questions adequately Struggles with questions outside prepared remarks

Studio Participation (5%)

Weekly engagement in live studio sessions. Students work through guided exercises, discuss approaches, and share progress on project milestones.

AI Tools Integration

Modules 1-3 (Regression and Evaluation):

Modules 4-6 (Classification and Text):

Modules 7-8 (Ensembles and Synthesis):

AI Attribution: Students document all AI tool usage in project submissions — which tools, what prompts, how outputs were modified. See design/assessment_strategy.md for attribution log template.

Assessment Summary

Component Weight Notes
Weekly Assignments (7 assignments) 35% Individual, one per module
Project Milestones (4 milestones) 25% Team of 3, scaffolded checkpoints
Final Project Deliverable 15% Team of 3, complete trading signal system
Oral Defense 20% Team of 3, 15-min presentation + 10-min Q&A
Studio Participation 5% Weekly engagement

No traditional exam. Weekly assignments build skills; the team project integrates them. Oral defense verifies understanding.

AI Usage Levels (AIAS)

Assessment AIAS Level AI Permitted
Weekly Assignments 2 AI for debugging, interpreting outputs, finance concept explanation — with attribution
Project Milestones 2 AI for code assistance, data exploration — with attribution
Final Project Deliverable 3 AI as collaborator for business case drafting and model comparison — with full disclosure
Oral Defense 0 No AI
Studio Participation 1 AI for concept exploration during exercises

Technology Stack


Pedagogical Notes for Faculty

Design suggestions grounded in program research — not requirements. Adapt to your course and teaching style. Full references in reference/articles/.

The scenic route (cognitive friction) FIN 550 is where the tension between AI efficiency and learning depth is sharpest. Students can use Copilot to generate a random forest in seconds — but if they haven’t first built a baseline regression by hand (Modules 1-3), the ensemble result has no prediction to surprise them. The dopamine gap research shows the brain learns through prediction errors: the gap between what you expected and what happened. A student who struggles with linear regression before seeing how gradient boosting improves on it learns more than one who skips straight to XGBoost. The AIAS progression (0→1→2→3 across modules) already scaffolds this; the key is framing the early manual work as the investment that makes the later AI-assisted work register. → Machulla (2026), Schultz et al. (1997)

The IKEA effect (completion matters) The Trading Signal System runs across all 8 modules — this is the longest sustained project in the program. The IKEA effect research shows that labor leads to love only when it leads to completion. Each milestone (M1→M4) should feel like a working thing: a testable hypothesis, a running model, a pipeline that produces output. The oral defense in Module 8 is the ultimate completion signal. For career pivoters with no finance background, the moment they can explain their trading signal system to a panel is transformative — it’s when “I’m not a finance person” becomes “I built this.” → Norton, Mochon & Ariely (2012)

Variable uncertainty and calibrated difficulty The dopamine system is most engaged at ~50% uncertainty — when the student genuinely doesn’t know if they’ll succeed. Assignments that are too easy (certain success) or too hard (certain failure) produce flat responses. For career pivoters, Module 5 (Factor Models) is a known difficulty spike — the finance theory is unfamiliar. Consider front-loading the finance concepts students need (the “For career pivoters” notes are already in the right spirit) so the challenge is the ML application, not the domain vocabulary. The goal: students should feel “I might be able to do this” — not “I definitely can” or “I definitely can’t.” → Machulla (2026), Fiorillo, Tobler & Schultz (2003)

Three AI iterations before milestone submission For milestones M2-M4 (AIAS 2-3), consider requiring students to iterate with AI at least 3 times before submitting. This builds the habit of using AI as a thinking partner: first attempt → AI critique → revised attempt → AI alternative → final version with documented rationale. Produces richer AI Attribution Logs and prevents the “paste Copilot output, submit” pattern. → Means (2026, “Practice Gap”)

The Push-Back Protocol for model interpretation When students use AI to interpret regression outputs or explain cross-validation results (Modules 2-4), there’s a risk they accept AI’s explanation without verifying it against the data. The Push-Back Protocol (demand evidence → surface assumptions → request alternatives → stress-test → synthesize) is especially valuable here: ML model interpretation is exactly the kind of task where AI sounds confident but can be wrong. Consider making one weekly assignment a structured push-back exercise. → Means (2025, “Push-Back Protocol”)

Peer review as milestone M4 The cross-team review at M4 is a strong design choice. The IKEA effect research suggests students learn as much from evaluating others’ pipelines as from building their own — exposure to different approaches calibrates their sense of quality. Consider structuring the peer review with the same rubric dimensions used for the final deliverable (signal design, model pipeline, portfolio analysis, written analysis) so students internalize the evaluation criteria before their own defense. → Norton et al. (2012), Vendrell & Johnston (2026)

Attack your assessments Before the semester, have a confident AI user attempt each weekly assignment and the trading signal project using current AI tools. AI is already strong at generating scikit-learn pipelines and interpreting regression output. Where can AI complete the task without genuine understanding of the finance context or model assumptions? Those are the spots to add pre-AI phases or shift weight toward the oral defense. → Furze (2026)


Course Sequence:BDI 513 — Data Storytelling Next: BADM 558 — Big Data Infrastructure