
DATA 510: Data Science Capstone
May 11, 2026
By the end of this session, you will be able to:
Course connection: This supports your project proposal (clear question, data, and planned analysis) and the Data-Driven Scrum habit of separating open-ended backlog items from testable backlog items.
EDA (exploratory data analysis)
The art of discovering what is in the data: distributions, joins gone wrong, outliers, drift, and surprises you did not know to ask about yet.
Research
Disciplined inquiry: you articulate a problem, connect it to theory or prior work, and commit to analyses that can support or refute a claim about the world (or about a model of the world).
EDA is an iterative process for uncovering patterns, anomalies, and structure before you fully trust downstream modeling or inference.
If your only plan is “run XGBoost and see what happens,” you still need EDA to learn whether labels, leakage, and joins make that plan meaningful.
You open a new client extract. Nobody documented the schema. What is the most honest first move?
A. Write the final thesis paragraph so stakeholders feel progress.
B. Fit a large neural net and tune until validation loss looks good.
C. Profile tables, keys, missingness, and a few sanity plots before locking a research question.
D. Pick the strongest correlation and build the deck around it.
C. EDA is how you earn the right to ask sharp research questions. A and B skip the reality check; D is how you accidentally narrate noise.
EDA helps you:
EDA is not “fishing until p is small.” It is sense-making and risk reduction for everything that follows.
Which outcome is most aligned with healthy EDA for a capstone?
A. A single polished chart for the poster background.
B. A short list of data-quality issues with severity and next checks.
C. Twelve significance stars from pairwise t-tests on every column.
D. A trained model with no written definition of the target.
B. Artifacts that change what you measure, split, or fix are the point. C and D optimize the wrong scoreboard; A alone skips the hard lessons.

A research question is a focused inquiry that guides analysis:
If your question is “understand customers,” that is a mood, not a question. Push for outcome, population, comparison, and time window.
They:
Weak questions produce weak reviews from faculty and industry. Strong questions make bad data news early, which is cheaper than bad news at the poster session.
| Type | Core move | Example (tightened) |
|---|---|---|
| Descriptive | Summarize a phenomenon | What is the distribution of capstone project hours logged per week in DATA 510? |
| Exploratory | Map structure when theory is thin | Which usage features cluster together among users who later churn? |
| Inferential | Learn about a population from a sample | Does mean model error differ between two deployed algorithms after controlling for traffic seasonality? |
| Predictive | Forecast or classify future cases | Can we predict churn in the next 30 days from usage in the prior 90 days? |
| Causal | Support a claim about causes | Did a policy change cause a shift in completion rates, or did seasonality confound it? |
Exploratory questions are still research-shaped: they narrow what you will look at next. They are not a license to run every test in the menu.
“We want to know whether our new recommender increases click-through because of the model and not because we also changed the UI the same week.”
A. Descriptive
B. Predictive
C. Causal
D. Purely EDA with no question
C. You are asking about a cause-and-effect claim; that demands design, not just a leaderboard metric. (Also: expect confounding and plan for it.)
Research questions often lead to EDA (you need to operationalize constructs and check feasibility).
EDA often refines or generates research questions (the data talks back).

Together they support data-informed decisions instead of data-stamped decisions (pretty charts pasted under a foregone conclusion).
If every plot becomes a hypothesis test on the same sample, you are not doing EDA plus research. You are doing multiple chances to get lucky.
Capstone habit: separate exploratory notebooks from confirmatory analysis paths, pre-register the latter when possible, and document what changed after EDA (even if it hurts).
Predictive research question: Can we predict customer churn in the next 30 days from usage patterns in the prior 90 days?
With neighbors (about 5 minutes total):
Brainstorm icon
You discover the churn label is missing for the entire newest month of data. What is the best next step for the research side of the project?
A. Drop the month silently so metrics look cleaner.
B. Keep modeling; missing labels are a deployment detail.
C. Document the issue, revise timelines or scope, and agree on a label policy with stakeholders.
D. Switch to causal claims because prediction failed.
C. EDA exists partly to force these conversations early. A and B are how projects lose trust; D is the wrong genre swap.
Questions? Pull me aside after block one or use office hours on the syllabus.