
DATA 510: Data Science Capstone
May 18, 2026
By the end of this session, you will be able to:
This session supports your project proposal and the communication quality expected at milestones and industry panel review.
Over the term you will propose, plan, and execute a real data science project that:
Research methods are how you keep the project from becoming “a notebook that ran models.” They are how you justify why the work matters and what would count as success.
Week 2 starts a recurring thread on research methods and technical writing. Today: overview plus research questions. Later weeks we’ll go deeper into study design, writing, and evaluation.
Inspired by teaching-oriented work on data science as a discipline (Hicks and Irizarry, 2018; Reddy, 2021; Rosier, 2022):
Your write-up should read like applied data science research: motivated problem, clear questions, appropriate methods, honest limits.
| Pillar | What Reviewers Look For In Your Framing |
|---|---|
| Data engineering | Can you obtain, document, version, and reproduce the data pipeline? |
| Visualization & communication | Can non-experts understand stakes and findings? |
| Machine learning / analytics | Are models or analyses matched to the question? |
| Statistics & design | Do claims match evidence type (predictive vs causal, etc.)? |
| Ethics | Who is affected, what could go wrong, what mitigations exist? |
Which statement best matches a capstone-grade research stance?
A. Maximize test-set accuracy; the story writes itself.
B. Pick a dataset from Kaggle first, then hunt for a question.
C. Start from a stakeholder problem, then articulate what evidence would change decisions.
D. Avoid stating limits so the poster looks confident.
C. Data source and question still need instructor approval, but problem-first framing is the habit we are building.
Use PRIDE to connect a consequential problem to questions you can defend this semester:
| Step | Prompt | Capstone Output |
|---|---|---|
| P Problem & impact | Who is affected? What decision or outcome improves? | Problem statement, stakeholders |
| R Review & gap | What is known? What is unknown and matters? | Short related-work anchor |
| I Inquiry | What are your primary and secondary research questions? | Numbered RQs in proposal |
| D Data & ethics | What data exist? What harms, bias, or consent issues appear? | Data plan + ethics notes |
| E Evidence plan | What analysis, baseline, and metrics answer each RQ? | Methods / evaluation plan |

Students often collapse these. Keep them distinct:
Bad sign: your “question” is only “we will use random forest.” That is a task. Ask what you want to learn or decide.
| Type | You Are Trying To… | Capstone Example |
|---|---|---|
| Descriptive | Summarize | What is the weekly pattern of missing sensor readings? |
| Exploratory | Map structure when theory is thin | Which customer segments co-occur with late payments? |
| Inferential | Generalize from sample to population | Does mean error differ between two models after seasonality controls? |
| Predictive | Forecast or classify new cases | Can we predict 30-day churn from 90-day usage? |
| Causal | Support a cause claim | Did the policy cause the change, or did seasonality confound it? |
Exploratory work is legitimate early; your proposal should still converge on questions sharp enough to finish.
A strong RQ is:
FINER mnemonic (adapted from clinical research): Feasible, Interesting, Novel (for your context), Ethical, Relevant.
Weak: “Analyze customer data to find insights.”
Which revision is most capstone-ready?
A. Build a dashboard of all customer columns.
B. Among B2B accounts in 2024, can we predict 60-day churn from product usage in the prior 90 days, and does a seasonal naive baseline beat our model?
C. Use deep learning because it is modern.
D. Prove our product causes retention.
B. It names population, horizon, task, and a baseline. D is causal and likely overclaims without design.
Canvas Page: Week 2 Paper Activity (groups, papers, links).
Read on your own first:
Then as a group, discuss and write up:
| Group | Members |
|---|---|
| 1 | Rohan, Luca, Sophia, Tiffany |
| 2 | Addison, Manish, Aaron, Summer |
| 3 | Sarah, Aiyana, Jon, Emery |
| 4 | Ben, Jackson, Dylan, Amaya |
| 5 | Spencer, Mary, Seira, Emily |
| 6 | Dane, Mike, Brooke, Brandon |
| 7 | Bradley, Siera, Simon, Serenna |
| 8 | Shanti, Alex, Courtney |
Download your paper from the link on the Canvas activity page or below.
| Group | Domain | Paper |
|---|---|---|
| 1 | Healthcare Finance | Predicting High-Cost Patients (PLOS ONE) |
| 2 | Public Hospital Operations | AI For Bed Regulation, Brazil (PLOS ONE) |
| 3 | Emergency Medical Services | Pre-Hospital Transport Prediction, Qatar (PLOS ONE) |
| 4 | Social Science / NLP | Climate Discourse On Twitter (PLOS ONE) |
| 5 | Algorithmic Fairness / Policy | Fairness Dynamics Of Targeted Job-Seeker Help (Sci. Rep.) |
| 6 | Environmental Risk | Daily Wildfire Expansion Rate (MDPI Fire) |
| 7 | Computer Vision / UAV | UAV Forest-Fire Surveillance (PLOS ONE) |
| 8 | NLP Methods | PEFT Vs Full Fine-Tuning, Multilingual News (PLOS ONE) |
Submit one file per group on Canvas (PDF or Word). Include all group members’ names.
| Criterion | Excellent (3) | Adequate (2) | Needs Work (1) |
|---|---|---|---|
| Motivation Summary | Clear stakes, stakeholders, and gap tied to the paper | Mostly clear, minor gaps | Vague or off-paper |
| Research Questions | Accurate RQs/aims from intro; quoted or carefully paraphrased | Minor inaccuracies | Missing or invented |
| Evaluation | Uses PRIDE/FINER-style criteria with specific evidence | Generic praise/critique | Superficial |
| Critique | Concrete revisions for capstone scale | Some specifics | Only opinion |
| Professionalism | Concise, cited, one voice | Minor issues | Sloppy or incomplete |
One representative gives:
Four groups present today. Group numbers are drawn at random (not volunteer-only).
Select 4 distinct group numbers from 1 through 8. List them in ascending order.
Use cryptographically secure randomness (Python secrets.SystemRandom, or openssl rand).
Print a hex seed from os.urandom(16) before the draw. Draw once; repeat with an independent
seed as a check—if the two sets differ, keep the first draw. Show minimal Python 3 code that
reproduces the result from the printed seed.Langenberger, Schulte, and Groene (2023), PLOS ONE (healthcare claims, Germany).
Synopsis: Predict which insured members will become high-cost next year using routine sickness-fund claims, comparing random forest, gradient boosting, neural nets, and logistic regression for care management and prevention.
Stated Objective / Questions:
RegulaRN Leitos Gerais platform (Brazil), PLOS ONE (47k+ regulation records).
Synopsis: Train and compare ML classifiers on state hospital bed regulation data to support regulators predicting patient outcomes (discharge vs death) and reduce subjectivity in bed assignment.
Stated Aims:
HMCAS ambulance service (Qatar), PLOS ONE (~93k emergency calls).
Synopsis: Use ML on pre-hospital EMS data to predict whether a patient must be transported vs treated on scene, supporting resource allocation in a multinational population.
Stated Objective:
Shyrokykh, Girnyk, and Dellmuth (2023), PLOS ONE (UN agency tweets).
Synopsis: Compare lexicon, traditional ML, and deep learning for classifying whether tweets are about climate change, in a social-science setting with small labeled data and class imbalance.
Stated Contributions (Implicit Questions):
Long-term fairness dynamics, Scientific Reports (simulation + synthetic labor market).
Synopsis: Model how a public employment service’s targeted help, driven by predictions that may use a protected attribute, affects long-term fairness between groups.
Stated Purposes / Questions:
Shmuel and Heifetz (2023), MDPI Fire (global daily fire observations).
Synopsis: Apply XGBoost, random forest, MLP, and logistic regression to predict daily wildfire growth from weather, topography, and fuels; also classify whether growth rate will increase or decrease the next day.
Stated Problem (Implicit Questions):
SHAMTA and Demir (2024), PLOS ONE (UAV + edge AI).
Synopsis: Build a UAV surveillance system with YOLOv8/v5 and CNN-RCNN models for early forest-fire detection, Jetson Nano onboard inference, and a ground-station interface for coordinates and images.
Stated Contributions (Engineering Aims):
Parameter-efficient fine-tuning study, PLOS ONE (SemEval-2023 news tasks).
Synopsis: Compare adapters, LoRA, and full fine-tuning on multilingual news classification (genre, framing, persuasion) across languages and training scenarios.
Stated Research Questions: