
Mining Text Data ⚒️
DATA 351: Data Management with SQL
This lecture covers string functions, POSIX regular expressions in PostgreSQL, extracting structured fields from messy narrative text, and full-text search with tsvector, tsquery, GIN indexes, ts_headline, and ts_rank. We use crime report narratives and State of the Union speeches. Based on Chapter 14 of Practical SQL, 2nd Edition.
What You’ll Learn
Learning Objectives 🎯
By the end of this session, you will be able to:
- Apply common string functions for trimming, case conversion, measuring length, and slicing substrings
- Read basic POSIX regex notation (
\d,{n,m}, groups, alternation) and use~,~*,!~, andsubstring(... from ...) - Use
regexp_match,regexp_matches,regexp_replace, andregexp_split_to_tableto transform text - Extract structured columns from semi-structured blobs using capture groups and array indexing
- Build full-text search queries with
to_tsvector,to_tsquery, the@@match operator, and GIN indexes - Present and rank search hits with
ts_headlineandts_rank(including length normalization)
. . .
Text is not a luxury column type. It is where half your “CSV from a government website” problems live. Today we give SQL something sharper than hope.
Part 1: Environment
Create the Database and Load Data 🗄️
Create the Database and Load Data · Steps 🗄️
Create a dedicated database for this chapter using whatever method you prefer or the command above in a terminal.
Download and run the setup script mining_text.sql ↓ while connected to mining_text. The file embeds all rows with INSERT statements (dollar-quoted text), so you can run it from pgAdmin or Beekeeper without configuring CSV paths.
Optional: mining_text_csvs.zip ↓ still has the raw extracts if you want to practice \COPY separately.
Tables You Will See Today 📊
| Table | Source | Role |
|---|---|---|
county_regex_demo |
bundled in mining_text.sql |
Regex filtering practice (~, ~*) |
crime_reports |
embedded in mining_text.sql |
Parse semi-structured police narratives |
president_speeches |
embedded in mining_text.sql |
Full-text search on State of the Union addresses |
Sanity Checks ✅
Sanity Checks · Output ✅
Run these after the setup script finishes to verify the counts of the tables.
Part 2: String Functions
Why Strings Still Matter ❤️🔥
JSON gets the hype. Arrays get the conference talks. Strings pay the rent: addresses, notes, log lines, pasted emails, and whatever your stakeholder calls “structured” because it has a colon somewhere.
PostgreSQL’s string toolkit is documented in the manual under String Functions. We start with the ones you will actually type.
Casing and Cleanup 🧹
Casing and Cleanup · Output 🧹
Run these in order and watch how forgiving real-world text is not.
initcap capitalizes word starts. It does not read minds. Your acronym table lives in application code, not here.
Length, Position, Trim 📏
SELECT char_length(' Pat ');
SELECT length(' Pat '); -- same for varchar; byte length differs on some types
SELECT position(', ' in 'Tan, Bella');
SELECT trim('s' from 'socks');
SELECT trim(trailing 's' from 'socks');
SELECT trim(' Pat ');
SELECT char_length(trim(' Pat '));
SELECT ltrim('socks', 's');
SELECT rtrim('socks', 's');Length, Position, Trim · Output 📏
Run these in order.
SELECT char_length(' Pat ');
-- 5
SELECT length(' Pat '); -- same for varchar; byte length differs on some types
-- 5
SELECT position(', ' in 'Tan, Bella');
-- 4
SELECT trim('s' from 'socks');
-- ock
SELECT trim(trailing 's' from 'socks');
-- sock
SELECT trim(' Pat ');
-- Pat
SELECT char_length(trim(' Pat '));
-- 3
SELECT ltrim('socks', 's');
-- ocks
SELECT rtrim('socks', 's');
-- sockIf you are counting characters for validation rules, char_length after trim is the boring, correct path.
Slicing and Replacing ✂️
Slicing and Replacing · Output ✂️
Run these (including the replace('aaa', ...) check).
Quick check: replace('aaa', 'aa', 'b') returns ba (left-to-right, non-overlapping substitution). String replacement is greedy in a way that will eventually humble everyone.
Part 3: Regular Expressions
What is a regular expression?
A regular expression (regex) is a small pattern language for describing text shape. You hand PostgreSQL a pattern string; the engine can test a match, return the matched substring, split on delimiters, or pull out captured pieces.
PostgreSQL uses POSIX regular expressions (documented under Pattern Matching). In SQL the pattern usually lives in a single-quoted string. Backslashes are easy to get wrong: the string parser and the regex engine both care about \, so you often write '\\d' (two characters in source, one backslash in the pattern) or use dollar-quoting when patterns get busy.
Notation ♯
| Piece | Meaning |
|---|---|
. |
one character (with caveats: not always newline) |
\d |
a digit 0–9 |
\w |
a word character (letters, digits, underscore in the POSIX class) |
[0-9] |
any single digit from the list |
{n} |
repeat the previous “atom” exactly n times |
{n,m} |
repeat at least n and at most m times, so {1,2} means one or two |
+ |
one or more; * zero or more; ? optional (zero or one) |
^ $ |
start and end of the string (for these lectures, think whole-field matching unless noted) |
\| |
alternation: A\|B is A or B |
() |
grouping; also capture for regexp_match |
(?: … ) |
group without capturing (still useful for precedence) |
Example 📝
\d{1,2}/\d{1,2}/\d{2} describes a date fragment like 4/16/17: one or two digits, slash, one or two digits, slash, two digits. Slashes are literal here; we sometimes escape them as \/ in patterns for clarity or habit from other tools.
Operators You Must Know 🔣
| Operator | Meaning |
|---|---|
~ |
matches, case sensitive |
~* |
matches, case insensitive |
!~ |
does not match, case sensitive |
!~* |
does not match, case insensitive |
In WHERE clauses, regex turns a pile of LIKE patterns into one expressive filter. It also turns a simple bug into a performance mystery. Use indexes thoughtfully (later: GIN for FTS; for regex alone, sometimes a functional index, sometimes a redesign).
Substring with a Pattern 🔍
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '.+');
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '\d{1,2} (?:a.m.|p.m.)');
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '^\w+');
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '\w+.$');
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from 'May|June');
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '\d{4}');
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from 'May \d, \d{4}');Substring with a Pattern · Output 🔍
substring(string from pattern) returns the first match. Run them one at a time.
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '.+');
-- The game starts at 7 p.m. on May 2, 2024.
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '\d{1,2} (?:a.m.|p.m.)');
-- 7 p.m.
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '^\w+');
-- The
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '\w+.$');
-- 2024.
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from 'May|June');
-- May
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from '\d{4}');
-- 2024
SELECT substring('The game starts at 7 p.m. on May 2, 2024.' from 'May \d, \d{4}');
-- May 2, 2024Notice how greedy .+ is on the first pattern. Regex is a contract between you and the parser. Read the contract.
Filtering Rows 🗺️
Filtering Rows · Output 🗺️
We use county_regex_demo instead of the full Census extract:
Run both queries. Advance once per query, then per result line.
The second query is the classic “I want Ashland but not Washington” maneuver. If your data contains NULL county names, remember that WHERE col ~ 'pattern' does not match NULL. Neither does your excuse.
Replacing and Splitting 🔀
Replacing and Splitting · Output 🔀
Run these.
SELECT regexp_replace('05/12/2024', '\d{4}', '2023');
-- 05/12/2023
SELECT regexp_split_to_table('Four,score,and,seven,years,ago', ',');
-- Four
-- score
-- and
-- seven
-- years
-- ago
SELECT regexp_split_to_array('Phil Mike Tony Steve', ' ');
-- {Phil,Mike,Tony,Steve}
SELECT array_length(regexp_split_to_array('Phil Mike Tony Steve', ' '), 1);
-- 4regexp_split_to_table is a set-returning function: it produces one row per piece. That makes it handy for normalizing multi-valued cells someone swore were “atomic.”
Part 4: From Blob to Columns
The Crime Narratives 🚔
The Crime Narratives · Output 🚔
Our crime_reports table loads only original_text. Everything else (dates, address, offense, case number) is hiding inside the blob.
Run this to open one row:
Beautiful. Readable. Terrible for GROUP BY. We will fix that with patterns, not with interns.
First Date Match 📅
Run this.
First Date Match · Output 📅
regexp_match returns the first match as a text array (or NULL).
All Date Matches 📆
Row 1 has two dates in the narrative.
. . .
Run this and compare to the previous slide.
All Date Matches · Output 📆
regexp_matches with the g flag finds every non-overlapping match.
Compare the two outputs for rows that mention more than one calendar date.
Capture Groups Clean Up Noise 📎
Run these:
Capture Groups · Output 📎
Parentheses in the pattern define capture groups. regexp_match still returns the whole match as element [1] when you wrap only part of the pattern. Run both.
SELECT crime_id,
regexp_match(original_text, '-\d{1,2}\/\d{1,2}\/\d{2}')
FROM crime_reports
ORDER BY crime_id;
-- 1 | {-4/17/17}
-- 2 |
-- 3 |
-- 4 |
-- 5 |
SELECT crime_id,
regexp_match(original_text, '-(\d{1,2}\/\d{1,2}\/\d{2})')
FROM crime_reports
ORDER BY crime_id;
-- 1 | {4/17/17}
-- 2 |
-- 3 |
-- 4 |
-- 5 |The second version drops the leading hyphen by capturing only the date digits.
Pull Several Fields at Once 🧩
Run this:
SELECT
crime_id,
regexp_match(original_text, '(?:C0|SO)[0-9]+') AS case_number,
regexp_match(original_text, '\d{1,2}\/\d{1,2}\/\d{2}') AS date_1,
regexp_match(original_text, '\n(?:\w+ \w+|\w+)\n(.*):') AS crime_type,
regexp_match(original_text, '(?:Sq.|Plz.|Dr.|Ter.|Rd.)\n(\w+ \w+|\w+)\n')
AS city
FROM crime_reports
ORDER BY crime_id;Pull Several Fields · Output 🧩
This is not pretty. It is honest.
SELECT
crime_id,
regexp_match(original_text, '(?:C0|SO)[0-9]+') AS case_number,
regexp_match(original_text, '\d{1,2}\/\d{1,2}\/\d{2}') AS date_1,
regexp_match(original_text, '\n(?:\w+ \w+|\w+)\n(.*):') AS crime_type,
regexp_match(original_text, '(?:Sq.|Plz.|Dr.|Ter.|Rd.)\n(\w+ \w+|\w+)\n')
AS city
FROM crime_reports
ORDER BY crime_id;
-- 1 | {C0170006614} | {4/16/17} | {Larceny} | {Sterling}
-- 2 | {C0170006162} | {4/8/17} | {Destruction of Property} | {Sterling}
-- 3 | {C0170006079} | {4/4/17} | {Larceny} | {Sterling}
-- 4 | {SO170006250} | {04/10/17} | {Larceny} | {Middleburg}
-- 5 | {SO170006211} | {04/09/17} | {Destruction of Property} | {Sterling}Patterns like this are why we version-control SQL. The day the sheriff’s office changes a label from hrs. to hours, you will discover the true meaning of “breaking change.”
Arrays in PostgreSQL: [1] 🔢
Run this:
Arrays in PostgreSQL · Output 🔢
regexp_match returns text[]. Grab the first element explicitly.
If the match fails, regexp_match is NULL and [1] is NULL. Plan for that when you cast to timestamps.
Optional: Populate Columns 🛠️
Optional: Populate Columns · Notes 🛠️
The textbook walks through UPDATE statements that cast extracted dates to timestamptz and fill street, city, crime_type, and description. The expressions are long but repetitive: once you trust one regexp_match, the rest are the same idea with different patterns.
Question for you: Why might EXISTS (SELECT regexp_matches(...)) appear in a CASE branch?
Answer
Because sometimes you need to know whether a second date exists before you try to parse it.
Walkthrough: Build date_1 Once ⏰
⚠️ Don’t run this (unless you are using a scratch database or copy of the crime_reports table):
CREATE temp TABLE crime_reports_copy AS SELECT * FROM crime_reports;
UPDATE crime_reports
SET date_1 =
(
(regexp_match(original_text, '\d{1,2}\/\d{1,2}\/\d{2}'))[1]
|| ' ' ||
(regexp_match(original_text, '\/\d{2}\n(\d{4})'))[1]
|| ' US/Eastern'
)::timestamptz
RETURNING crime_id, date_1, original_text;Walkthrough: Build date_1 · Output ⏰
The book example concatenates the first date in MM/DD/YY form with the first time field (four digits) and a time zone. Run once (it updates every row). Use a scratch database or ROLLBACK if you are experimenting.
UPDATE crime_reports_copy
SET date_1 =
(
(regexp_match(original_text, '\d{1,2}\/\d{1,2}\/\d{2}'))[1]
|| ' ' ||
(regexp_match(original_text, '\/\d{2}\n(\d{4})'))[1]
|| ' US/Eastern'
)::timestamptz
RETURNING crime_id, date_1, original_text;
-- 1 | 2017-04-17 01:00:00+00 | ...
-- 2 | 2017-04-08 20:00:00+00 | ...
-- 3 | 2017-04-04 18:00:00+00 | ...
-- 4 | 2017-04-10 20:05:00+00 | ...
-- 5 | 2017-04-09 16:00:00+00 | ...Why RETURNING?
RETURNING shows you what you broke before you commit. This is not Tinder. Do not swipe right on untested regex.
If a row lacks one of the pieces, the corresponding regexp_match is NULL and the concatenation collapses in ways PostgreSQL will explain loudly. That is good. Silent failure is how you become a historian instead of a data engineer.
View the Cleaned Rows 📋
Run this:
View the Cleaned Rows · Output 📋
After a full UPDATE of all parsed columns (as in the book), a simple report query becomes possible. The UPDATE sets date_1 only (other parsed columns stay NULL until you run the full UPDATE block that sets all parsed columns).
Compare mentally to the blob you started with. That difference is the whole lecture in one SELECT.
Part 5: Full-Text Search
Pipeline: From Speech to Search 🔄
Raw text enters, tokens come out, queries meet documents at @@. The GIN index sits on the tsvector column so the middle of this diagram does not require a sequential read of every presidency.
pg_ts_config 🔤
Run this:
pg_ts_config · Output 🔤
Full-text search is not LIKE '%tax%'. It tokenizes, stems, drops noise words, and gives you operators that behave well at scale.
Check which language configurations exist.
to_tsvector and @@ 🔤
SELECT to_tsvector('english', 'I am walking across the sitting room to sit with you.');
SELECT to_tsquery('english', 'walking & sitting');
SELECT to_tsvector('english', 'I am walking across the sitting room')
@@ to_tsquery('english', 'walking & sitting');
SELECT to_tsvector('english', 'I am walking across the sitting room')
@@ to_tsquery('english', 'walking & running');to_tsvector and @@ · Output 🔤
Convert document and query:
Run these.
SELECT to_tsvector('english', 'I am walking across the sitting room to sit with you.');
-- 'across':4 'room':7 'sit':6,9 'walk':3
SELECT to_tsquery('english', 'walking & sitting');
-- 'walk' & 'sit'
SELECT to_tsvector('english', 'I am walking across the sitting room')
@@ to_tsquery('english', 'walking & sitting');
-- t
SELECT to_tsvector('english', 'I am walking across the sitting room')
@@ to_tsquery('english', 'walking & running');
-- f@@ answers: does this document match this query?
Operators Inside to_tsquery ⚙️
&AND|OR!NOT<->adjacent terms (phrase)<N>distance (terms within N positions)
We will lean on the State of the Union table. Our setup script already adds search_speech_text and a GIN index.
Index for Search 🗂️
Index for Search · Notes 🗂️
GIN is the usual choice for tsvector columns. Without it, you are scanning speeches the way I scroll my inbox: slowly and with regret.
Find Speeches 🔎
Run this:
Find Speeches · Output 🔎
Snippets with ts_headline 📰
Run this:
Snippets with ts_headline · Output 📰
SELECT president,
speech_date,
ts_headline(
speech_text,
to_tsquery('english', 'tax'),
'StartSel = <,
StopSel = >,
MinWords=5,
MaxWords=7,
MaxFragments=1'
)
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english', 'tax')
ORDER BY speech_date;
-- Harry S. Truman | 1946-01-21 | ... <tax> POLICY ...
-- Harry S. Truman | 1947-01-06 | excise <tax> rates which, under the present
-- Harry S. Truman | 1948-01-07 | point in our <tax> structure. ...
-- ... (many more rows mention tax)That angle-bracket markup is for HTML consumers. In slides, it still reads clearly as “here is the hit in context.”
AND / NOT and Phrase Proximity 🎛️
Run each query:
SELECT president,
speech_date,
ts_headline(
speech_text,
to_tsquery('english', 'transportation & !roads'),
'StartSel = <, StopSel = >, MinWords=5, MaxWords=7, MaxFragments=1'
)
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english', 'transportation & !roads')
ORDER BY speech_date;
SELECT president,
speech_date,
ts_headline(
speech_text,
to_tsquery('english', 'military <-> defense'),
'StartSel = <, StopSel = >, MinWords=5, MaxWords=7, MaxFragments=1'
)
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english', 'military <-> defense')
ORDER BY speech_date;AND / NOT and Phrase Proximity · Output 🎛️
SELECT president,
speech_date,
ts_headline(
speech_text,
to_tsquery('english', 'transportation & !roads'),
'StartSel = <, StopSel = >, MinWords=5, MaxWords=7, MaxFragments=1'
)
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english', 'transportation & !roads')
ORDER BY speech_date;
-- Harry S. Truman | 1947-01-06 | Mr. President, Mr. Speaker, Members
-- Harry S. Truman | 1949-01-05 | Mr. President, Mr. Speaker, Members
SELECT president,
speech_date,
ts_headline(
speech_text,
to_tsquery('english', 'military <-> defense'),
'StartSel = <, StopSel = >, MinWords=5, MaxWords=7, MaxFragments=1'
)
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english', 'military <-> defense')
ORDER BY speech_date;
-- Dwight D. Eisenhower | 1956-01-05 | To the Congress of the
-- Dwight D. Eisenhower | 1958-01-09 | Mr. President, Mr. Speaker, Members
-- Dwight D. Eisenhower | 1959-01-09 | Mr. President, Mr. Speaker, MembersTry <2> instead of <-> in a copy of the query to allow a small gap between words. Language is messy. PostgreSQL knows.
Ranking 🏅
Run both and compare ordering.
SELECT president,
speech_date,
ts_rank(
search_speech_text,
to_tsquery('english', 'war & security & threat & enemy')
) AS score
FROM president_speeches
WHERE search_speech_text @@
to_tsquery('english', 'war & security & threat & enemy')
ORDER BY score DESC
LIMIT 5;
SELECT president,
speech_date,
ts_rank(
search_speech_text,
to_tsquery('english', 'war & security & threat & enemy'),
2
)::numeric AS score
FROM president_speeches
WHERE search_speech_text @@
to_tsquery('english', 'war & security & threat & enemy')
ORDER BY score DESC
LIMIT 5;Ranking · Output 🏅
SELECT president,
speech_date,
ts_rank(
search_speech_text,
to_tsquery('english', 'war & security & threat & enemy')
) AS score
FROM president_speeches
WHERE search_speech_text @@
to_tsquery('english', 'war & security & threat & enemy')
ORDER BY score DESC
LIMIT 5;
-- William J. Clinton | 1997-02-04 | 0.35810584
-- George W. Bush | 2004-01-20 | 0.29587495
-- George W. Bush | 2003-01-28 | 0.28381455
-- Harry S. Truman | 1946-01-21 | 0.25752166
-- William J. Clinton | 2000-01-27 | 0.22214262
SELECT president,
speech_date,
ts_rank(
search_speech_text,
to_tsquery('english', 'war & security & threat & enemy'),
2
)::numeric AS score
FROM president_speeches
WHERE search_speech_text @@
to_tsquery('english', 'war & security & threat & enemy')
ORDER BY score DESC
LIMIT 5;
-- George W. Bush | 2004-01-20 | 0.000103
-- William J. Clinton | 1997-02-04 | 0.000098
-- George W. Bush | 2003-01-28 | 0.000096
-- Jimmy Carter | 1979-01-23 | 0.000090
-- Lyndon B. Johnson | 1968-01-17 | 0.000073Long speeches accumulate token hits. Normalization option 2 divides rank by document length.
Compare the ordering before and after normalization. This is the difference between “mentions the keywords a lot” and “mentions them a lot relative to length.”
The book points to a live demo at anthonydebarros.com/sotu. Worth clicking after class.
Part 6: Practice and Pitfalls
Try It: Crime Case Numbers 🧪
Write a query that returns crime_id and the case number extracted from original_text using the same (?:C0|SO)[0-9]+ pattern, but filter to rows where the case starts with SO.
Stretch: Add a column that counts how many distinct dates appear in the narrative using regexp_matches with the g flag and a subquery or lateral trick. (If you get stuck, move on. The point is to think in matches, not to win a Nobel prize in SQL golf.)
Try It: Speeches 🎤
Find speeches where the query economy & jobs matches, show ts_headline, and sort by normalized rank (ts_rank with the length divisor). Pick a president you have opinions about. Keep the opinions out of the query.
Common Footguns (Collected) ⚠️
- Forgetting that
regexp_matchreturnsNULLwhen nothing matches, then indexing[1]in a cast chain - Using
LIKEpatterns with regex operators by mistake (they are not interchangeable) - Building huge
ORchains when a single well-factored regex would do - Skipping the GIN index on
tsvectorcolumns and wondering why search “works on my laptop”
Summary
What We Covered 📚
| Topic | Core functions and ideas |
|---|---|
| Strings | upper, lower, initcap, trim, left/right, replace, char_length |
| Regex filters | ~, ~*, !~, substring, regexp_replace, splits |
| Structured extraction | regexp_match, regexp_matches, capture groups, [1] indexing |
| Full-text search | to_tsvector, to_tsquery, @@, ts_headline, ts_rank |
| Performance | GIN index on tsvector |
The One-Sentence Version 💡
Treat text like data: measure it, constrain it, extract what you need, and index what you search. Everything else is a spreadsheet someone emailed you as prose.
References 🔗
- DeBarros, A. (2022). Practical SQL (2nd ed.). No Starch Press. Chapter 14.
- PostgreSQL: Pattern Matching
- PostgreSQL: String Functions
- PostgreSQL: Regular Expressions
- PostgreSQL: Full Text Search
- PostgreSQL: ts_rank