FIN 550 - Big Data Analytics in Finance (ML I)
Program-level details: See program/curriculum.md
Status: In Development Xing completing Modules 3/4/8 after spring break. Structure stable. Proposed MSBAi name: Predictive Analytics for Business — pending formal rename approval
| Credits: 4 | Term: Fall 2026 (Weeks 9-16) | Instructor: Xing/Mathias |
Course Vision
Students learn core machine learning methods — regression, classification, regularization, tree-based models, and neural networks — applied to real-world prediction problems drawn primarily from financial markets and institutions. Finance provides the running context because it offers the richest, most granular, and most publicly available datasets for learning these skills. The methods are universal; the applications are financial.
This is ML I in the MSBAi sequence. FIN 550 focuses on supervised learning: building, evaluating, and selecting models for prediction. BADM 576 (ML II) extends to unsupervised learning, NLP, time series, deep learning, and deployment.
Domain perspective: Each MSBAi core course brings a distinct business lens. FIN 550 brings the finance and accounting perspective — students work with stock returns, firm financial statements, corporate bonds, mutual funds, and market text data. No prior finance knowledge is assumed; financial concepts are introduced as needed.
Prerequisites
- Python programming (from BADM 554 or equivalent)
- Statistics foundation required: Students must complete the following Coursera pre-program courses (or equivalent) before starting FIN 550:
- Exploring and Producing Data for Business Decision Making (University of Illinois)
- Inferential and Predictive Statistics for Business (University of Illinois)
- These are Gies College of Business courses on Coursera covering descriptive statistics, probability, sampling, hypothesis testing, and regression
Learning Outcomes (L-C-E Framework)
Literacy (Foundational Awareness)
- L1: Understand supervised learning and explain when regression vs. classification applies
- L2: Explain how financial data (returns, firm characteristics, corporate filings) is structured and used for prediction
- L3: Recognize overfitting, describe train/test/validation splits, and explain why cross-validation matters
Competency (Applied Skills)
- C1: Build and evaluate regression models (linear, Fama-MacBeth, Lasso) using financial and accounting data
- C2: Build and evaluate classification models (logistic regression, decision trees) for business outcomes
- C3: Extract predictive features from structured data (firm characteristics) and unstructured text (corporate filings)
- C4: Apply cross-validation, regularization, and hyperparameter tuning to improve model performance
Expertise (Advanced Application)
- E1: Design and execute a multi-step empirical analysis from data to portfolio strategy to performance evaluation
- E2: Apply ensemble methods (random forest, gradient boosting) and neural networks, choosing appropriately among model types
- E3: Translate predictive model results into actionable business recommendations through an oral defense
Module-by-Module Breakdown
Each module includes asynchronous video lectures (concepts + data example), a live studio session, and either a team exercise or project milestone.
Module 1: Returns, Risk, and Event Studies (Xing)
| Component | Content |
|---|---|
| ML Skill | Hypothesis testing, statistical event analysis, abnormal return estimation |
| Finance Context | Stock returns (simple, cumulative), portfolio construction (equal-weighted, value-weighted), Sharpe ratios |
| Video — Concepts | Returns and risk fundamentals, portfolio construction, market reactions to corporate events |
| Video — Data Example | Event study on dividend increase announcements: estimating abnormal returns around event dates |
| Studio Session | Event study on stock split announcements (guided, live) |
| Team Exercise | Event study on dividend decrease announcements |
For career pivoters: No finance background assumed. Module introduces “returns” as percentage change in price and builds from there. All formulas implemented in Python — conceptual understanding matters more than memorization.
Module 2: Linear Regression and Firm Characteristics (Mathias)
| Component | Content |
|---|---|
| ML Skill | Linear regression, multiple regression, Fama-MacBeth cross-sectional regression, interpreting coefficients |
| Finance Context | Predicting stock returns using firm characteristics (size, market-to-book) from Compustat accounting data |
| Video — Concepts | Linear regression for prediction, Fama-MacBeth regressions, key firm characteristics and corporate variables |
| Video — Data Example | Cross-sectional stock return prediction using three firm characteristics |
| Studio Session | Hands-on regression with Compustat data |
| Team Exercise | Predict cross-sectional stock returns using multiple firm characteristics |
Module 3: Model Evaluation and Cross-Validation (Xing)
| Component | Content |
|---|---|
| ML Skill | Train/test splits, k-fold cross-validation, leave-one-out CV, hyperparameter tuning, model selection |
| Finance Context | Evaluating prediction quality on financial data from Modules 1-2 |
| Video — Concepts | Validation-set approach, leave-one-out CV, k-fold CV, hyperparameter tuning for model selection |
| Video — Data Example 1 | Predicting housing prices using linear regression with polynomial features — comparing CV methods |
| Video — Data Example 2 | Improving stock return prediction from Module 2 using cross-validation |
| Studio Session | Cross-validation workshop: apply CV to Module 2 models, compare results |
| Team Exercise | Evaluate and improve regression models using systematic cross-validation |
Placement rationale: Cross-validation is taught in Module 3 (not Module 8) so students apply rigorous evaluation to every model they build from this point forward.
Module 4: Logistic Regression and Variable Selection (Xing)
| Component | Content |
|---|---|
| ML Skill | Logistic regression, stepwise selection (forward/backward), Lasso regularization, Ridge regression |
| Finance Context | Predicting corporate bond defaults, binary business outcomes |
| Video — Concepts | Variable selection methods (stepwise, Lasso), logistic regression for binary prediction, regularization |
| Video — Data Example 1 | Stock return prediction with variable selection — comparing stepwise vs. Lasso using corporate characteristics |
| Video — Data Example 2 | Corporate bond default prediction using logistic regression and Lasso with bond- and firm-level characteristics |
| Studio Session | Predicting dividend increases using logistic regression and Lasso |
| Team Exercise | Predict corporate bond defaults with an expanded set of firm- and bond-level characteristics |
Module 5: Factor Models and Portfolio Performance (Mathias)
| Component | Content |
|---|---|
| ML Skill | Multi-factor regression, performance attribution, model comparison across factor sets |
| Finance Context | CAPM, Fama-French multi-factor models, mutual fund alpha and beta, index fund comparison |
| Video — Concepts | CAPM and multi-factor models, mutual funds vs. index funds, performance attribution through alpha and beta |
| Video — Data Example | Mutual fund performance attribution — do funds earn their fees? Do returns persist? |
| Studio Session | Predicting mutual fund returns using fund characteristics (idiosyncratic volatility, fund size) |
| Team Exercise | Mutual fund alpha and beta estimation in alternative samples, including performance attribution for index funds |
For career pivoters: Factor models are introduced as “multi-variable regression where the variables are market-wide risk factors.” The finance concepts (alpha, beta, CAPM) are explained as business context, not prerequisites.
Module 6: Text Analytics for Business Data (Xing)
| Component | Content |
|---|---|
| ML Skill | Bag-of-words, n-grams, dictionary-based sentiment measurement, text-as-features for prediction |
| Finance Context | 10-K/10-Q filings, measuring company sentiment, detecting exposure to topics (tariffs, geographic risk) |
| Video — Concepts | Processing text: bag-of-words, n-grams, sentiment lexicons (Loughran-McDonald), exposure measurement |
| Video — Data Examples | Measuring company sentiment and tariff exposure from SEC filings |
| Studio Session | Hands-on text processing with financial filings |
| Team Exercise | Extract a predictive text signal from 10-K filings and use it in a regression model |
NLP boundary note: This module covers practical “text → numeric features → prediction.” Transformer models (BERT), word embeddings, and text classification as standalone NLP tasks are covered in BADM 576 (ML II, Week 3).
Module 7: Tree-Based and Ensemble Methods (Xing)
| Component | Content |
|---|---|
| ML Skill | Classification and regression trees, bagging, random forests, gradient boosting, neural networks |
| Finance Context | Housing price prediction, county-level economic data, mortgage unpaid balance prediction |
| Video — Concepts | Tree-based models, ensemble methods (bagging, random forest, boosting), intro to neural networks |
| Video — Data Example | Ames housing price prediction using tree-based models and neural networks |
| Studio Session | Predicting county-level housing prices using Census demographic and housing data |
| Team Exercise | Predict mortgage unpaid balances using Freddie Mac loan-level data with tree-based methods |
Module 8: Trading Strategies and Business Case Synthesis (Mathias + Xing)
| Component | Content |
|---|---|
| ML Skill | End-to-end empirical pipeline, model comparison, business case communication, oral defense |
| Finance Context | Trading strategies, calendar-time portfolios, momentum and value anomalies, portfolio sorting |
| Video — Concepts | Trading strategy construction, calendar-time portfolios, return anomalies |
| Studio Session | Cross-sectional return prediction with momentum and value anomalies using portfolio sorting techniques |
| Tutorial Project presentations | Team oral defense of trading signal system (see Final Project below) |
Tutorial Project: Trading Signal System (Team, runs alongside modules)
Throughout the course, student teams (3 members) build a trading signal system — the signature applied project for FIN 550. Teams:
- Identify a signal from a big data source (text, alternative data, financial variables)
- Predict stock returns using the signal with methods learned in the course
- Construct a dynamic portfolio strategy based on the signal
- Analyze abnormal returns for the strategy
The tutorial project integrates skills from every module. Teams present progress during studio sessions and submit the final deliverable + oral defense in Module 8.
Assessment
One major team project (the Trading Signal System) runs alongside the course, with weekly assignments building foundational skills and project milestones scaffolding toward the final deliverable. No traditional exam.
Weekly Assignments (35%)
Individual exercises tied to each module. Students apply that module’s methods to financial datasets and submit a Jupyter notebook + short write-up via GitHub.
| Assignment | Module | Description |
|---|---|---|
| A1: Event Study Analysis | 1 | Conduct an event study on dividend decrease announcements; estimate abnormal returns around event dates |
| A2: Regression Analysis | 2 | Predict cross-sectional stock returns using multiple firm characteristics from Compustat data |
| A3: Cross-Validation Workshop | 3 | Apply k-fold CV to Module 2 regression models; compare validation approaches and report metrics |
| A4: Classification Exercise | 4 | Predict corporate bond defaults using logistic regression + Lasso with firm- and bond-level characteristics |
| A5: Factor Model Analysis | 5 | Estimate mutual fund alpha and beta in alternative samples; performance attribution for index funds |
| A6: Text Feature Extraction | 6 | Extract a predictive text signal from 10-K filings and use it in a regression model |
| A7: Ensemble Methods | 7 | Predict mortgage unpaid balances using Freddie Mac data with tree-based methods; compare with regression |
Rubric (per assignment, 4 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| Methodology | Correct application of module methods, justified choices | Reasonable approach with minor gaps | Flawed or missing methodology |
| Model Evaluation | Rigorous evaluation, explains metrics clearly | Evaluation applied but basic | Missing or weak evaluation |
| Code Quality | Clean, documented Jupyter notebook, reproducible | Adequate code, some comments | Messy or undocumented |
| Written Analysis | Connects methodology to findings; explains business implications | Adequate explanation | Minimal or unclear |
Project Milestones (25%)
Team milestones (team of 3) that scaffold toward the final Trading Signal System deliverable. Submitted at defined checkpoints; each milestone receives formative feedback.
| Milestone | Due | Deliverable |
|---|---|---|
| M1: Project Proposal | End of Module 2 | Signal hypothesis, data source identification, team roles, preliminary EDA |
| M2: Signal Construction + Regression Baseline | End of Module 4 | Constructed signal, baseline regression model, initial cross-validation results |
| M3: Model Expansion + Text Integration | End of Module 6 | Classification/text features added to pipeline, model comparison (structured vs. structured + text) |
| M4: Peer Review | Module 7 | Cross-team review of another team’s pipeline; written feedback on methodology, evaluation, and gaps |
Rubric (per milestone, 3 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| Progress | Substantial, on-track work building on prior modules | Adequate progress with some gaps | Behind schedule or superficial |
| Technical Quality | Methods applied correctly, evaluation included | Functional but basic analysis | Errors or missing components |
| Team Collaboration | Clear evidence of shared work, complementary contributions | Adequate collaboration | Uneven contribution |
Final Project Deliverable (15%)
Complete the Trading Signal System and submit the final analysis package.
Deliverables:
- Complete trading signal pipeline: signal discovery → prediction → portfolio → abnormal returns
- Apply tree-based/ensemble methods and compare with earlier regression approaches
- 5-page business case document:
- Signal rationale and data source
- Prediction methodology and model comparison
- Portfolio strategy and performance analysis
- Limitations, risks, and out-of-sample considerations
- Peer evaluation of team contributions
- GitHub repo with all code + documentation
Rubric (4 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| Signal Design | Creative, well-justified signal from interesting data source | Reasonable signal choice | Generic or unjustified |
| Model Pipeline | Multiple methods compared systematically, strong evaluation | Functional pipeline | Single method or weak evaluation |
| Portfolio Analysis | Rigorous abnormal return analysis, addresses look-ahead bias | Adequate portfolio construction | Flawed methodology |
| Written Analysis | Clear business case, honest about limitations and out-of-sample risks | Adequate explanation | Minimal or unclear writeup |
Oral Defense (20%)
Team oral defense: 15-min presentation + 10-min Q&A. Each team member must demonstrate individual understanding of the full pipeline — not just their assigned section.
Rubric (3 dimensions):
| Dimension | Excellent (A) | Proficient (B) | Developing (C) |
|---|---|---|---|
| Presentation Clarity | Confident delivery, logical structure, clear visualizations | Adequate delivery and structure | Unclear or disorganized |
| Technical Depth | Articulates trade-offs, explains model choices, addresses limitations | Demonstrates understanding of key methods | Superficial or rehearsed-only understanding |
| Q&A Handling | Handles unexpected questions well, thinks on feet | Answers most questions adequately | Struggles with questions outside prepared remarks |
Studio Participation (5%)
Weekly engagement in live studio sessions. Students work through guided exercises, discuss approaches, and share progress on project milestones.
AI Tools Integration
Modules 1-3 (Regression and Evaluation):
- Use Claude/ChatGPT to:
- Explain finance concepts as encountered (returns, abnormal returns, factor models)
- Debug scikit-learn and statsmodels errors
- Interpret regression outputs and cross-validation results
- Suggest feature engineering approaches for financial data
Modules 4-6 (Classification and Text):
- Use AI to:
- Explain variable selection trade-offs (Lasso vs. stepwise)
- Generate text processing code (tokenization, sentiment)
- Debug logistic regression issues
- Review model comparison methodology
Modules 7-8 (Ensembles and Synthesis):
- Use AI to:
- Explain tree-based model hyperparameters
- Draft business case structure for trading signal report
- Review portfolio analysis methodology
- Practice Q&A scenarios for oral defense
AI Attribution: Students document all AI tool usage in project submissions — which tools, what prompts, how outputs were modified. See design/assessment_strategy.md for attribution log template.
Assessment Summary
| Component | Weight | Notes |
|---|---|---|
| Weekly Assignments (7 assignments) | 35% | Individual, one per module |
| Project Milestones (4 milestones) | 25% | Team of 3, scaffolded checkpoints |
| Final Project Deliverable | 15% | Team of 3, complete trading signal system |
| Oral Defense | 20% | Team of 3, 15-min presentation + 10-min Q&A |
| Studio Participation | 5% | Weekly engagement |
No traditional exam. Weekly assignments build skills; the team project integrates them. Oral defense verifies understanding.
AI Usage Levels (AIAS)
| Assessment | AIAS Level | AI Permitted |
|---|---|---|
| Weekly Assignments | 2 | AI for debugging, interpreting outputs, finance concept explanation — with attribution |
| Project Milestones | 2 | AI for code assistance, data exploration — with attribution |
| Final Project Deliverable | 3 | AI as collaborator for business case drafting and model comparison — with full disclosure |
| Oral Defense | 0 | No AI |
| Studio Participation | 1 | AI for concept exploration during exercises |
Technology Stack
- ML Libraries: scikit-learn, statsmodels, XGBoost, LightGBM
- Data: pandas, numpy
- Text Processing: nltk, scikit-learn (CountVectorizer, TfidfVectorizer)
- Visualization: matplotlib, seaborn
- Financial Data: WRDS/Compustat, CRSP, yfinance, SEC EDGAR
- IDE: VS Code with GitHub Copilot; Google Colab (browser alternative)
- Notebooks: Jupyter Notebooks (via Colab or VS Code)
- Version Control: GitHub (all projects published)
Pedagogical Notes for Faculty
Design suggestions grounded in program research — not requirements. Adapt to your course and teaching style. Full references in reference/articles/.
The scenic route (cognitive friction) FIN 550 is where the tension between AI efficiency and learning depth is sharpest. Students can use Copilot to generate a random forest in seconds — but if they haven’t first built a baseline regression by hand (Modules 1-3), the ensemble result has no prediction to surprise them. The dopamine gap research shows the brain learns through prediction errors: the gap between what you expected and what happened. A student who struggles with linear regression before seeing how gradient boosting improves on it learns more than one who skips straight to XGBoost. The AIAS progression (0→1→2→3 across modules) already scaffolds this; the key is framing the early manual work as the investment that makes the later AI-assisted work register. → Machulla (2026), Schultz et al. (1997)
The IKEA effect (completion matters) The Trading Signal System runs across all 8 modules — this is the longest sustained project in the program. The IKEA effect research shows that labor leads to love only when it leads to completion. Each milestone (M1→M4) should feel like a working thing: a testable hypothesis, a running model, a pipeline that produces output. The oral defense in Module 8 is the ultimate completion signal. For career pivoters with no finance background, the moment they can explain their trading signal system to a panel is transformative — it’s when “I’m not a finance person” becomes “I built this.” → Norton, Mochon & Ariely (2012)
Variable uncertainty and calibrated difficulty The dopamine system is most engaged at ~50% uncertainty — when the student genuinely doesn’t know if they’ll succeed. Assignments that are too easy (certain success) or too hard (certain failure) produce flat responses. For career pivoters, Module 5 (Factor Models) is a known difficulty spike — the finance theory is unfamiliar. Consider front-loading the finance concepts students need (the “For career pivoters” notes are already in the right spirit) so the challenge is the ML application, not the domain vocabulary. The goal: students should feel “I might be able to do this” — not “I definitely can” or “I definitely can’t.” → Machulla (2026), Fiorillo, Tobler & Schultz (2003)
Three AI iterations before milestone submission For milestones M2-M4 (AIAS 2-3), consider requiring students to iterate with AI at least 3 times before submitting. This builds the habit of using AI as a thinking partner: first attempt → AI critique → revised attempt → AI alternative → final version with documented rationale. Produces richer AI Attribution Logs and prevents the “paste Copilot output, submit” pattern. → Means (2026, “Practice Gap”)
The Push-Back Protocol for model interpretation When students use AI to interpret regression outputs or explain cross-validation results (Modules 2-4), there’s a risk they accept AI’s explanation without verifying it against the data. The Push-Back Protocol (demand evidence → surface assumptions → request alternatives → stress-test → synthesize) is especially valuable here: ML model interpretation is exactly the kind of task where AI sounds confident but can be wrong. Consider making one weekly assignment a structured push-back exercise. → Means (2025, “Push-Back Protocol”)
Peer review as milestone M4 The cross-team review at M4 is a strong design choice. The IKEA effect research suggests students learn as much from evaluating others’ pipelines as from building their own — exposure to different approaches calibrates their sense of quality. Consider structuring the peer review with the same rubric dimensions used for the final deliverable (signal design, model pipeline, portfolio analysis, written analysis) so students internalize the evaluation criteria before their own defense. → Norton et al. (2012), Vendrell & Johnston (2026)
Attack your assessments Before the semester, have a confident AI user attempt each weekly assignment and the trading signal project using current AI tools. AI is already strong at generating scikit-learn pipelines and interpreting regression output. Where can AI complete the task without genuine understanding of the finance context or model assumptions? Those are the spots to add pre-AI phases or shift weight toward the oral defense. → Furze (2026)
| Course Sequence: ← BDI 513 — Data Storytelling | Next: BADM 558 — Big Data Infrastructure → |