Advanced Query Techniques

DATA 503: Fundamentals of Data Engineering

Lucas P. Cordova, Ph.D.

Willamette University

April 8, 2026

Learning objectives

What you should be able to do

Use scalar and derived-table subqueries for filtering and reshaping aggregates
Choose IN vs EXISTS / NOT EXISTS for membership tests and anti-joins
Write LATERAL joins for per-row computations and top-N patterns
Refactor multi-step logic with CTEs (WITH) and optional CASE classification

Course connection: These patterns show up in analytical SQL, incremental pipelines, and any time you need correct, readable transformations over relational data.

Restore the practice database

Create a Database

Create a database called class_dbs using the createdb command in a terminal OR using any method you usually use to create a new database.

createdb -h localhost -U postgres class_dbs

Download the files from Canvas

You will get one of these (not both required):

File	Format	Tool
`class_dbs.sql`	Plain SQL script	`psql` with `-f`
`class_dbs.dump`	Custom `pg_dump -Fc` archive	`pg_restore`

Save the file somewhere easy to find. The examples below assume it landed in your Downloads folder.

💻 Local PostgreSQL: plain SQL (`psql`)

Open a terminal, go to the folder that contains class_dbs.sql, then run psql against the class_dbs database you created above.

cd ~/Downloads
psql -h localhost -U postgres -d class_dbs -f class_dbs.sql

💻 Local PostgreSQL: custom archive (`pg_restore`)

The class_dbs database must exist before restore (same as the SQL path above).

cd ~/Downloads
pg_restore -h localhost -U postgres -d class_dbs --verbose class_dbs.dump

🌐 Railway: same files, database `railway`

Replace host and port with the values from your Railway Postgres service (Variables / Connect). The database name is often railway.

Plain SQL:

cd ~/Downloads
psql "postgresql://postgres:YOUR_PASSWORD@YOUR_HOST:YOUR_PORT/railway" -f class_dbs.sql

Custom dump:

cd ~/Downloads
pg_restore "postgresql://postgres:YOUR_PASSWORD@YOUR_HOST:YOUR_PORT/railway" --verbose class_dbs.dump

Subqueries: WHERE and FROM

Scalar subquery in `WHERE`

A scalar subquery returns one value. The database can evaluate it once, then use it like a constant in the outer query.

Run this. Counties at or above the national 90th percentile for pop_est_2019:

SELECT county_name, state_name, pop_est_2019
FROM us_counties_pop_est_2019
WHERE pop_est_2019 >= (
    SELECT percentile_cont(0.9) WITHIN GROUP (ORDER BY pop_est_2019)
    FROM us_counties_pop_est_2019
)
ORDER BY pop_est_2019 DESC;

This subquery is uncorrelated: it does not reference the outer row. It typically runs once, then the outer query filters.

Derived table in `FROM`

A subquery in FROM builds a derived table. PostgreSQL requires an alias (AS ...).

SELECT round(calcs.avg_pop, 0) AS avg_pop,
       calcs.median_pop,
       round(calcs.avg_pop - calcs.median_pop, 0) AS mean_minus_median
FROM (
    SELECT avg(pop_est_2019) AS avg_pop,
           percentile_cont(0.5)
               WITHIN GROUP (ORDER BY pop_est_2019)::numeric AS median_pop
    FROM us_counties_pop_est_2019
) AS calcs;

Use this when you need to aggregate first, then treat the result as a table you can select from or join.

Checkpoint: above-average counties

Task: Return every county whose pop_est_2019 is strictly greater than the national average.

Required columns: county_name, state_name, pop_est_2019

Requirements:

Use a scalar subquery in WHERE that computes avg(pop_est_2019) over all counties
Order by pop_est_2019 descending

Expected shape (first rows, illustrative):

county_name	state_name	pop_est_2019
Los Angeles County	California	10039107
Cook County	Illinois	5150233
Harris County	Texas	4713320

Hint: Compare pop_est_2019 to (SELECT avg(pop_est_2019) FROM us_counties_pop_est_2019).

Solution: above-average counties

SELECT county_name, state_name, pop_est_2019
FROM us_counties_pop_est_2019
WHERE pop_est_2019 > (
    SELECT avg(pop_est_2019)
    FROM us_counties_pop_est_2019
)
ORDER BY pop_est_2019 DESC;

Why fewer than half the rows?

`IN` and `EXISTS`

`IN (subquery)`

IN tests membership against the set returned by the subquery.

SELECT first_name, last_name
FROM employees
WHERE emp_id IN (
    SELECT id FROM retirees
)
ORDER BY emp_id;

Readable when the subquery is small and uncorrelated.

`EXISTS` and `NOT EXISTS`

EXISTS returns true if the subquery returns at least one row. The subquery can reference outer columns; that makes it correlated (conceptually: evaluated per outer row, though the planner may rewrite).

SELECT first_name, last_name
FROM employees AS e
WHERE EXISTS (
    SELECT 1
    FROM retirees AS r
    WHERE r.id = e.emp_id
);

NOT EXISTS is the usual pattern for anti-joins: rows in A with no match in B. Prefer NOT EXISTS over NOT IN when the subquery can produce NULLs, because NOT IN with NULLs in the list behaves badly.

Checkpoint: active employees

Task: List employees who are not in retirees.

Required columns: first_name, last_name

Requirements: Use NOT EXISTS with a correlated subquery matching retirees.id to employees.emp_id. Order by last_name, first_name

Expected shape:

first_name	last_name
Samuel	Cole
Arthur	Pappas
…	…

Hint: SELECT 1 inside EXISTS is enough; you only care whether a row appears.

Solution: active employees

SELECT first_name, last_name
FROM employees AS e
WHERE NOT EXISTS (
    SELECT 1
    FROM retirees AS r
    WHERE r.id = e.emp_id
)
ORDER BY last_name, first_name;

`LATERAL` joins

Top-N per group

LATERAL lets a subquery in FROM reference columns from earlier tables. Classic use: for each parent row, run a query and keep a few rows.

For each teacher, return their two most recent lab accesses.

SELECT t.first_name, t.last_name, a.access_time, a.lab_name
FROM teachers AS t
LEFT JOIN LATERAL (
    SELECT *
    FROM teachers_lab_access
    WHERE teacher_id = t.id
    ORDER BY access_time DESC
    LIMIT 2
) AS a ON true
ORDER BY t.id;

LEFT JOIN LATERAL keeps teachers with zero visits (NULLs in a.*). ON true satisfies join syntax; filtering is inside the lateral subquery.

Checkpoint: one visit per teacher

Task: Same as the demo, but return only each teacher’s single most recent lab access.

Required columns:

first_name
last_name
access_time
lab_name

Requirements:

LEFT JOIN LATERAL with ORDER BY access_time DESC and LIMIT 1

Expected shape (illustrative):

first_name	last_name	access_time	lab_name
Janet	Smith	2022-12-21 …	Chemistry A
Lee	Reynolds	…	…

Hint: Change one number from the demo query.

Solution: one visit per teacher

SELECT t.first_name, t.last_name, a.access_time, a.lab_name
FROM teachers AS t
LEFT JOIN LATERAL (
    SELECT *
    FROM teachers_lab_access
    WHERE teacher_id = t.id
    ORDER BY access_time DESC
    LIMIT 1
) AS a ON true
ORDER BY t.id;

Review: Common Table Expressions (`WITH`)

Why CTEs

A CTE names an intermediate result. Use them when you:

chain several logical steps
reuse the same expression instead of repeating a subquery
want readers (including future you) to follow the story top to bottom

WITH big_counties AS (
    SELECT county_name, state_name, pop_est_2019
    FROM us_counties_pop_est_2019
    WHERE pop_est_2019 >= 100000
)
SELECT state_name, count(*) AS num_big_counties
FROM big_counties
GROUP BY state_name
ORDER BY num_big_counties DESC;

Checkpoint: state totals with two CTEs

Task: Build a state-level summary (table: us_counties_pop_est_2019)

Required columns:

state_name
total_pop (sum of pop_est_2019)
num_counties (count of rows per state)
avg_pop_per_county (total_pop / num_counties, rounded to whole numbers is fine)

Requirements:

CTE 1: aggregate population by state_name
CTE 2: aggregate county counts by state_name
Join the two CTEs on state_name
Order by avg_pop_per_county descending

Expected shape (first rows):

state_name	total_pop	num_counties	avg_pop_per_county
California	…	…	…
Texas	…	…	…

Hint: Same GROUP BY state_name in both CTEs; join keys match exactly.

Solution: state totals with two CTEs

WITH state_pop AS (
    SELECT state_name, sum(pop_est_2019) AS total_pop
    FROM us_counties_pop_est_2019
    GROUP BY state_name
),
state_counties AS (
    SELECT state_name, count(*) AS num_counties
    FROM us_counties_pop_est_2019
    GROUP BY state_name
)
SELECT sp.state_name,
       sp.total_pop,
       sc.num_counties,
       round(sp.total_pop::numeric / sc.num_counties, 0) AS avg_pop_per_county
FROM state_pop AS sp
JOIN state_counties AS sc USING (state_name)
ORDER BY avg_pop_per_county DESC;

Crosstab

Enable and pivot

Reporting often needs long to wide layout. PostgreSQL provides crosstab() in the tablefunc extension.

CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT *
FROM crosstab(
    $$
    SELECT office, flavor, count(*) AS response_count
    FROM ice_cream_survey
    GROUP BY office, flavor
    ORDER BY office
    $$,
    $$ SELECT flavor
       FROM ice_cream_survey
       GROUP BY flavor
       ORDER BY flavor
    $$
) AS ct(
    office text,
    chocolate bigint,
    strawberry bigint,
    vanilla bigint
);

Contract: the first query must expose three columns: row id, category, value. The second query lists category labels. The outer AS (...) names output columns in that order.

`CASE` and classification

Pattern

CASE evaluates first match wins. Put the most specific conditions first when ranges overlap.

SELECT max_temp,
       CASE
           WHEN max_temp >= 90 THEN 'Hot'
           WHEN max_temp >= 70 THEN 'Warm'
           WHEN max_temp >= 50 THEN 'Pleasant'
           ELSE 'Cool or missing'
       END AS temp_band
FROM temperature_readings
LIMIT 5;

Checkpoint: population tiers with a CTE

Task: Classify each county, then summarize. (table: us_counties_pop_est_2019)

Required columns (final result):

tier
num_counties
total_pop
avg_county_pop

Requirements:

Use a CTE that adds tier with CASE
Outer query: GROUP BY tier with aggregates

Tier rules on pop_est_2019:

Metro: >= 500000
Urban: 100000 to 499999
Suburban: 50000 to 99999
Rural: below 50000

Expected shape:

tier	num_counties	total_pop	avg_county_pop
Rural	…	…	…
Metro	…	…	…

Hint: CASE should mirror the order above so a county falls into exactly one tier.

Solution: population tiers

WITH county_tiers AS (
    SELECT county_name,
           state_name,
           pop_est_2019,
           CASE
               WHEN pop_est_2019 >= 500000 THEN 'Metro'
               WHEN pop_est_2019 >= 100000 THEN 'Urban'
               WHEN pop_est_2019 >= 50000 THEN 'Suburban'
               ELSE 'Rural'
           END AS tier
    FROM us_counties_pop_est_2019
)
SELECT tier,
       count(*) AS num_counties,
       sum(pop_est_2019) AS total_pop,
       round(avg(pop_est_2019), 0) AS avg_county_pop
FROM county_tiers
GROUP BY tier
ORDER BY total_pop DESC;

Typical story: many counties are rural, but a small number of metro counties hold a large share of total population.

Summary

Techniques in one place

Technique	Use it when
Scalar subquery in `WHERE`	Compare rows to one computed threshold
Derived table in `FROM`	Aggregate, then select or join the aggregate
`IN`	Simple membership against a set
`EXISTS` / `NOT EXISTS`	Correlated checks; anti-joins; nullable-safe negation
`LATERAL`	Per-row subquery, top-N per group
CTE (`WITH`)	Multiple named steps, less repetition
`crosstab()`	Pivot reporting shapes
`CASE`	Bucket values before or during aggregation

Choosing a pattern

References

DeBarros, A. (2022). Practical SQL (2nd ed.). No Starch Press. Chapter 13.
PostgreSQL: WITH Queries (CTEs)
PostgreSQL: Subquery Expressions
PostgreSQL: LATERAL
PostgreSQL: tablefunc

Advanced Query Techniques

Learning objectives

What you should be able to do

Restore the practice database

Create a Database

Download the files from Canvas

💻 Local PostgreSQL: plain SQL (psql)

💻 Local PostgreSQL: custom archive (pg_restore)

🌐 Railway: same files, database railway

Subqueries: WHERE and FROM

Scalar subquery in WHERE

Derived table in FROM

Checkpoint: above-average counties

Solution: above-average counties

IN and EXISTS

IN (subquery)

EXISTS and NOT EXISTS

Checkpoint: active employees

Solution: active employees

LATERAL joins

Top-N per group

Checkpoint: one visit per teacher

Solution: one visit per teacher

Review: Common Table Expressions (WITH)

Why CTEs

Checkpoint: state totals with two CTEs

Solution: state totals with two CTEs

Crosstab

Enable and pivot

CASE and classification

Pattern

Checkpoint: population tiers with a CTE

Solution: population tiers

Summary

Techniques in one place

Choosing a pattern

References

💻 Local PostgreSQL: plain SQL (`psql`)

💻 Local PostgreSQL: custom archive (`pg_restore`)

🌐 Railway: same files, database `railway`

Scalar subquery in `WHERE`

Derived table in `FROM`

`IN` and `EXISTS`

`IN (subquery)`

`EXISTS` and `NOT EXISTS`

`LATERAL` joins

Review: Common Table Expressions (`WITH`)

`CASE` and classification