Last updated: May 30, 2026

FIN 550 - Big Data Analytics in Finance (ML I)

Program-level details: See program/curriculum.md

Status: In Development Xing completing Modules 3/4/8 after spring break. Structure stable.

Live LD team data: tracked in fin550/sync/ — auto-synced from Box ~4x/day (scripts/box-autosync.py). Canonical files: Course Map.xlsx + Instructional Activity Roster.xlsx (roster formerly Instructional Material Roster.xlsx). As of 2026-05-29 both are still the blank LD template — no content snapshot generated yet; the sync auto-detects when Gao/Kronlund populate them.

Credits: 4

Term: Fall 2026 (Weeks 9-16)

Instructor: Xing/Mathias

Course Vision

Students learn core machine learning methods — regression, classification, regularization, tree-based models, and neural networks — applied to real-world prediction problems drawn primarily from financial markets and institutions. Finance provides the running context because it offers the richest, most granular, and most publicly available datasets for learning these skills. The methods are universal; the applications are financial.

This is ML I in the MSBAi sequence. FIN 550 focuses on supervised learning: building, evaluating, and selecting models for prediction. BADM 576 (ML II) extends to unsupervised learning, NLP, time series, deep learning, and deployment.

Domain perspective: Each MSBAi core course brings a distinct business lens. FIN 550 brings the finance and accounting perspective — students work with stock returns, firm financial statements, corporate bonds, mutual funds, and market text data. No prior finance knowledge is assumed; financial concepts are introduced as needed.

Prerequisites

Python programming (from BADM 554 or equivalent)
Statistics foundation: FIN 550 assumes working knowledge of descriptive statistics, probability, sampling, hypothesis testing, and regression. The program offers two Gies-on-Coursera preparatory courses as self-diagnostic stats prep, sequenced to be completed before FIN 550 begins (Fall 2026 Week 9):
- Exploring and Producing Data for Business Decision Making (University of Illinois)
- Inferential and Predictive Statistics for Business (University of Illinois)
Per the program-wide staggered prep model (2026-05-18), these courses are self-diagnostic — students use the quiz to identify whether they should work through the prep content. Scores are not evaluated by the program. See program/curriculum.md for the full prep stack.

Learning Outcomes (L-C-E Framework)

Literacy (Foundational Awareness)

L1: Understand supervised learning and explain when regression vs. classification applies
L2: Explain how financial data (returns, firm characteristics, corporate filings) is structured and used for prediction
L3: Recognize overfitting, describe train/test/validation splits, and explain why cross-validation matters

Competency (Applied Skills)

C1: Build and evaluate regression models (linear, Fama-MacBeth, Lasso) using financial and accounting data
C2: Build and evaluate classification models (logistic regression, decision trees) for business outcomes
C3: Extract predictive features from structured data (firm characteristics) and unstructured text (corporate filings)
C4: Apply cross-validation, regularization, and hyperparameter tuning to improve model performance

Expertise (Advanced Application)

E1: Design and execute a multi-step empirical analysis from data to portfolio strategy to performance evaluation
E2: Apply ensemble methods (random forest, gradient boosting) and neural networks, choosing appropriately among model types
E3: Translate predictive model results into actionable business recommendations through an oral defense

Module-by-Module Breakdown

Each module includes asynchronous video lectures (concepts + data example), a live studio session, and either a team exercise or project milestone.

Module 1: Returns, Risk, and Event Studies (Xing)

Component	Content
ML Skill	Hypothesis testing, statistical event analysis, abnormal return estimation
Finance Context	Stock returns (simple, cumulative), portfolio construction (equal-weighted, value-weighted), Sharpe ratios
Video — Concepts	Returns and risk fundamentals, portfolio construction, market reactions to corporate events
Video — Data Example	Event study on dividend increase announcements: estimating abnormal returns around event dates
Studio Session	Event study on stock split announcements (guided, live)
Team Exercise	Event study on dividend decrease announcements

For career pivoters: No finance background assumed. Module introduces “returns” as percentage change in price and builds from there. All formulas implemented in Python — conceptual understanding matters more than memorization.

Module 2: Linear Regression and Firm Characteristics (Mathias)

Component	Content
ML Skill	Linear regression, multiple regression, Fama-MacBeth cross-sectional regression, interpreting coefficients
Finance Context	Predicting stock returns using firm characteristics (size, market-to-book) from Compustat accounting data
Video — Concepts	Linear regression for prediction, Fama-MacBeth regressions, key firm characteristics and corporate variables
Video — Data Example	Cross-sectional stock return prediction using three firm characteristics
Studio Session	Hands-on regression with Compustat data
Team Exercise	Predict cross-sectional stock returns using multiple firm characteristics

Module 3: Model Evaluation and Cross-Validation (Xing)

Component	Content
ML Skill	Train/test splits, k-fold cross-validation, leave-one-out CV, hyperparameter tuning, model selection
Finance Context	Evaluating prediction quality on financial data from Modules 1-2
Video — Concepts	Validation-set approach, leave-one-out CV, k-fold CV, hyperparameter tuning for model selection
Video — Data Example 1	Predicting housing prices using linear regression with polynomial features — comparing CV methods
Video — Data Example 2	Improving stock return prediction from Module 2 using cross-validation
Studio Session	Cross-validation workshop: apply CV to Module 2 models, compare results
Team Exercise	Evaluate and improve regression models using systematic cross-validation

Placement rationale: Cross-validation is taught in Module 3 (not Module 8) so students apply rigorous evaluation to every model they build from this point forward.

Module 4: Logistic Regression and Variable Selection (Xing)

Component	Content
ML Skill	Logistic regression, stepwise selection (forward/backward), Lasso regularization, Ridge regression
Finance Context	Predicting corporate bond defaults, binary business outcomes
Video — Concepts	Variable selection methods (stepwise, Lasso), logistic regression for binary prediction, regularization
Video — Data Example 1	Stock return prediction with variable selection — comparing stepwise vs. Lasso using corporate characteristics
Video — Data Example 2	Corporate bond default prediction using logistic regression and Lasso with bond- and firm-level characteristics
Studio Session	Predicting dividend increases using logistic regression and Lasso
Team Exercise	Predict corporate bond defaults with an expanded set of firm- and bond-level characteristics

Module 5: Factor Models and Portfolio Performance (Mathias)

Component	Content
ML Skill	Multi-factor regression, performance attribution, model comparison across factor sets
Finance Context	CAPM, Fama-French multi-factor models, mutual fund alpha and beta, index fund comparison
Video — Concepts	CAPM and multi-factor models, mutual funds vs. index funds, performance attribution through alpha and beta
Video — Data Example	Mutual fund performance attribution — do funds earn their fees? Do returns persist?
Studio Session	Predicting mutual fund returns using fund characteristics (idiosyncratic volatility, fund size)
Team Exercise	Mutual fund alpha and beta estimation in alternative samples, including performance attribution for index funds

For career pivoters: Factor models are introduced as “multi-variable regression where the variables are market-wide risk factors.” The finance concepts (alpha, beta, CAPM) are explained as business context, not prerequisites.

Module 6: Text Analytics for Business Data (Xing)

Component	Content
ML Skill	Bag-of-words, n-grams, dictionary-based sentiment measurement, text-as-features for prediction
Finance Context	10-K/10-Q filings, measuring company sentiment, detecting exposure to topics (tariffs, geographic risk)
Video — Concepts	Processing text: bag-of-words, n-grams, sentiment lexicons (Loughran-McDonald), exposure measurement
Video — Data Examples	Measuring company sentiment and tariff exposure from SEC filings
Studio Session	Hands-on text processing with financial filings
Team Exercise	Extract a predictive text signal from 10-K filings and use it in a regression model

NLP boundary note: This module covers practical “text → numeric features → prediction.” Transformer models (BERT), word embeddings, and text classification as standalone NLP tasks are covered in BADM 576 (ML II, Week 3).

Module 7: Tree-Based and Ensemble Methods (Xing)

Component	Content
ML Skill	Classification and regression trees, bagging, random forests, gradient boosting, neural networks
Finance Context	Housing price prediction, county-level economic data, mortgage unpaid balance prediction
Video — Concepts	Tree-based models, ensemble methods (bagging, random forest, boosting), intro to neural networks
Video — Data Example	Ames housing price prediction using tree-based models and neural networks
Studio Session	Predicting county-level housing prices using Census demographic and housing data
Team Exercise	Predict mortgage unpaid balances using Freddie Mac loan-level data with tree-based methods

Module 8: Trading Strategies and Business Case Synthesis (Mathias + Xing)

Component	Content
ML Skill	End-to-end empirical pipeline, model comparison, business case communication, oral defense
Finance Context	Trading strategies, calendar-time portfolios, momentum and value anomalies, portfolio sorting
Video — Concepts	Trading strategy construction, calendar-time portfolios, return anomalies
Studio Session	Cross-sectional return prediction with momentum and value anomalies using portfolio sorting techniques
Tutorial Project presentations	Team oral defense of trading signal system (see Final Project below)

Tutorial Project: Trading Signal System (Team, runs alongside modules)

Throughout the course, student teams (3 members) build a trading signal system — the signature applied project for FIN 550. Teams:

Identify a signal from a big data source (text, alternative data, financial variables)
Predict stock returns using the signal with methods learned in the course
Construct a dynamic portfolio strategy based on the signal
Analyze abnormal returns for the strategy

The tutorial project integrates skills from every module. Teams present progress during studio sessions and submit the final deliverable + oral defense in Module 8.

Assessment

One major team project (the Trading Signal System) runs alongside the course, with weekly assignments building foundational skills and project milestones scaffolding toward the final deliverable. No traditional exam.

Weekly Assignments (35%)

Individual exercises tied to each module. Students apply that module’s methods to financial datasets and submit a Jupyter notebook + short write-up via GitHub.

Assignment	Module	Description
A1: Event Study Analysis	1	Conduct an event study on dividend decrease announcements; estimate abnormal returns around event dates
A2: Regression Analysis	2	Predict cross-sectional stock returns using multiple firm characteristics from Compustat data
A3: Cross-Validation Workshop	3	Apply k-fold CV to Module 2 regression models; compare validation approaches and report metrics
A4: Classification Exercise	4	Predict corporate bond defaults using logistic regression + Lasso with firm- and bond-level characteristics
A5: Factor Model Analysis	5	Estimate mutual fund alpha and beta in alternative samples; performance attribution for index funds
A6: Text Feature Extraction	6	Extract a predictive text signal from 10-K filings and use it in a regression model
A7: Ensemble Methods	7	Predict mortgage unpaid balances using Freddie Mac data with tree-based methods; compare with regression

Rubric (per assignment, 4 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
Methodology	Correct application of module methods, justified choices	Reasonable approach with minor gaps	Flawed or missing methodology
Model Evaluation	Rigorous evaluation, explains metrics clearly	Evaluation applied but basic	Missing or weak evaluation
Code Quality	Clean, documented Jupyter notebook, reproducible	Adequate code, some comments	Messy or undocumented
Written Analysis	Connects methodology to findings; explains business implications	Adequate explanation	Minimal or unclear

Project Milestones (25%)

Team milestones (team of 3) that scaffold toward the final Trading Signal System deliverable. Submitted at defined checkpoints; each milestone receives formative feedback.

Milestone	Due	Deliverable
M1: Project Proposal	End of Module 2	Signal hypothesis, data source identification, team roles, preliminary EDA
M2: Signal Construction + Regression Baseline	End of Module 4	Constructed signal, baseline regression model, initial cross-validation results
M3: Model Expansion + Text Integration	End of Module 6	Classification/text features added to pipeline, model comparison (structured vs. structured + text)
M4: Peer Review	Module 7	Cross-team review of another team’s pipeline; written feedback on methodology, evaluation, and gaps

Rubric (per milestone, 3 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
Progress	Substantial, on-track work building on prior modules	Adequate progress with some gaps	Behind schedule or superficial
Technical Quality	Methods applied correctly, evaluation included	Functional but basic analysis	Errors or missing components
Team Collaboration	Clear evidence of shared work, complementary contributions	Adequate collaboration	Uneven contribution

Final Project Deliverable (15%)

Complete the Trading Signal System and submit the final analysis package.

Deliverables:

Complete trading signal pipeline: signal discovery → prediction → portfolio → abnormal returns
Apply tree-based/ensemble methods and compare with earlier regression approaches
5-page business case document:
- Signal rationale and data source
- Prediction methodology and model comparison
- Portfolio strategy and performance analysis
- Limitations, risks, and out-of-sample considerations
Peer evaluation of team contributions
GitHub repo with all code + documentation

Rubric (4 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
Signal Design	Creative, well-justified signal from interesting data source	Reasonable signal choice	Generic or unjustified
Model Pipeline	Multiple methods compared systematically, strong evaluation	Functional pipeline	Single method or weak evaluation
Portfolio Analysis	Rigorous abnormal return analysis, addresses look-ahead bias	Adequate portfolio construction	Flawed methodology
Written Analysis	Clear business case, honest about limitations and out-of-sample risks	Adequate explanation	Minimal or unclear writeup

Oral Defense (20%)

Team oral defense: 15-min presentation + 10-min Q&A. Each team member must demonstrate individual understanding of the full pipeline — not just their assigned section.

Rubric (3 dimensions):

Dimension	Excellent (A)	Proficient (B)	Developing (C)
Presentation Clarity	Confident delivery, logical structure, clear visualizations	Adequate delivery and structure	Unclear or disorganized
Technical Depth	Articulates trade-offs, explains model choices, addresses limitations	Demonstrates understanding of key methods	Superficial or rehearsed-only understanding
Q&A Handling	Handles unexpected questions well, thinks on feet	Answers most questions adequately	Struggles with questions outside prepared remarks

Studio Participation (5%)

Weekly engagement in live studio sessions. Students work through guided exercises, discuss approaches, and share progress on project milestones.

AI Tools Integration

Modules 1-3 (Regression and Evaluation):

Use Claude/ChatGPT to:
- Explain finance concepts as encountered (returns, abnormal returns, factor models)
- Debug scikit-learn and statsmodels errors
- Interpret regression outputs and cross-validation results
- Suggest feature engineering approaches for financial data

Modules 4-6 (Classification and Text):

Use AI to:
- Explain variable selection trade-offs (Lasso vs. stepwise)
- Generate text processing code (tokenization, sentiment)
- Debug logistic regression issues
- Review model comparison methodology

Modules 7-8 (Ensembles and Synthesis):

Use AI to:
- Explain tree-based model hyperparameters
- Draft business case structure for trading signal report
- Review portfolio analysis methodology
- Practice Q&A scenarios for oral defense

AI Attribution: Students document all AI tool usage in project submissions — which tools, what prompts, how outputs were modified. See design/assessment_strategy.md for attribution log template.

Assessment Summary

Component	Weight	Notes
Weekly Assignments (7 assignments)	35%	Individual, one per module
Project Milestones (4 milestones)	25%	Team of 3, scaffolded checkpoints
Final Project Deliverable	15%	Team of 3, complete trading signal system
Oral Defense	20%	Team of 3, 15-min presentation + 10-min Q&A
Studio Participation	5%	Weekly engagement

No traditional exam. Weekly assignments build skills; the team project integrates them. Oral defense verifies understanding.

AI Usage Levels (AIAS)

Assessment	AIAS Level	AI Permitted
Weekly Assignments	2	AI for debugging, interpreting outputs, finance concept explanation — with attribution
Project Milestones	2	AI for code assistance, data exploration — with attribution
Final Project Deliverable	3	AI as collaborator for business case drafting and model comparison — with full disclosure
Oral Defense	0	No AI
Studio Participation	1	AI for concept exploration during exercises

Technology Stack

ML Libraries: scikit-learn, statsmodels, XGBoost, LightGBM
Data: pandas, numpy
Text Processing: nltk, scikit-learn (CountVectorizer, TfidfVectorizer)
Visualization: matplotlib, seaborn
Financial Data: WRDS/Compustat, CRSP, yfinance, SEC EDGAR
IDE: VS Code with GitHub Copilot; Google Colab (browser alternative)
Notebooks: Jupyter Notebooks (via Colab or VS Code)
Version Control: GitHub (all projects published)

Pedagogical Notes for Faculty

Design suggestions grounded in program research — not requirements. Adapt to your course and teaching style. Full references in reference/articles/.

The scenic route (cognitive friction) FIN 550 is where the tension between AI efficiency and learning depth is sharpest. Students can use Copilot to generate a random forest in seconds — but if they haven’t first built a baseline regression by hand (Modules 1-3), the ensemble result has no prediction to surprise them. The dopamine gap research shows the brain learns through prediction errors: the gap between what you expected and what happened. A student who struggles with linear regression before seeing how gradient boosting improves on it learns more than one who skips straight to XGBoost. The AIAS progression (0→1→2→3 across modules) already scaffolds this; the key is framing the early manual work as the investment that makes the later AI-assisted work register. → Machulla (2026), Schultz et al. (1997)

The IKEA effect (completion matters) The Trading Signal System runs across all 8 modules — this is the longest sustained project in the program. The IKEA effect research shows that labor leads to love only when it leads to completion. Each milestone (M1→M4) should feel like a working thing: a testable hypothesis, a running model, a pipeline that produces output. The oral defense in Module 8 is the ultimate completion signal. For career pivoters with no finance background, the moment they can explain their trading signal system to a panel is transformative — it’s when “I’m not a finance person” becomes “I built this.” → Norton, Mochon & Ariely (2012)

Variable uncertainty and calibrated difficulty The dopamine system is most engaged at ~50% uncertainty — when the student genuinely doesn’t know if they’ll succeed. Assignments that are too easy (certain success) or too hard (certain failure) produce flat responses. For career pivoters, Module 5 (Factor Models) is a known difficulty spike — the finance theory is unfamiliar. Consider front-loading the finance concepts students need (the “For career pivoters” notes are already in the right spirit) so the challenge is the ML application, not the domain vocabulary. The goal: students should feel “I might be able to do this” — not “I definitely can” or “I definitely can’t.” → Machulla (2026), Fiorillo, Tobler & Schultz (2003)

Three AI iterations before milestone submission For milestones M2-M4 (AIAS 2-3), consider requiring students to iterate with AI at least 3 times before submitting. This builds the habit of using AI as a thinking partner: first attempt → AI critique → revised attempt → AI alternative → final version with documented rationale. Produces richer AI Attribution Logs and prevents the “paste Copilot output, submit” pattern. → Means (2026, “Practice Gap”)

The Push-Back Protocol for model interpretation When students use AI to interpret regression outputs or explain cross-validation results (Modules 2-4), there’s a risk they accept AI’s explanation without verifying it against the data. The Push-Back Protocol (demand evidence → surface assumptions → request alternatives → stress-test → synthesize) is especially valuable here: ML model interpretation is exactly the kind of task where AI sounds confident but can be wrong. Consider making one weekly assignment a structured push-back exercise. → Means (2025, “Push-Back Protocol”)

Peer review as milestone M4 The cross-team review at M4 is a strong design choice. The IKEA effect research suggests students learn as much from evaluating others’ pipelines as from building their own — exposure to different approaches calibrates their sense of quality. Consider structuring the peer review with the same rubric dimensions used for the final deliverable (signal design, model pipeline, portfolio analysis, written analysis) so students internalize the evaluation criteria before their own defense. → Norton et al. (2012), Vendrell & Johnston (2026)

Attack your assessments Before the semester, have a confident AI user attempt each weekly assignment and the trading signal project using current AI tools. AI is already strong at generating scikit-learn pipelines and interpreting regression output. Where can AI complete the task without genuine understanding of the finance context or model assumptions? Those are the spots to add pre-AI phases or shift weight toward the oral defense. → Furze (2026)

Course Sequence: ← BDI 513 — Data Storytelling

Next: BADM 558 — Big Data Infrastructures →

MSBAi Curriculum Site

MSBAi - Online Master of Science in Business Analytics

AI-First curriculum design documentation for the MSBAi program launching Fall 2026