
DATA 351: Data Management with SQL
April 8, 2026
You already know SQL is for data you control. Much of the world’s data still lives in HTML pages. For structured extraction at human scale (assignments, prototypes, one-off exports), a point-and-click scraper inside the browser keeps you close to the page and avoids writing a full crawler on day one.
This lecture uses Web Scraper (Chromium extension from the Chrome Web Store). It is not the only option, but it matches our goals: sitemap, selector tree, preview, CSV export.
By the end of class, you will be able to:
The steps and screenshots in this deck follow the official guide:
Keep that page open in a tab while you work. Figures such as the news site example (selector tree with a Link selector and a child Text selector) are shown there in full.
Install Web Scraper from the Chrome Web Store (works in Chromium-based browsers that allow extensions):
After installation, restart the browser or only use the tool in tabs opened after installation so the extension loads cleanly. Requirements are listed under Installation 2.
Web Scraper lives inside Developer Tools, not only as a toolbar icon.
Shortcuts:
| Platform | Open Developer Tools |
|---|---|
| Windows / Linux | Ctrl+Shift+I or F12 |
| macOS | Cmd+Opt+I |
Then open the Web Scraper tab inside the tools panel. Open Web Scraper includes a figure for Chrome 3.
Web Scraper tab in Chrome Developer Tools (documentation figure).
A sitemap is your recipe for one scrape job. The first setting is the start URL: the page where crawling begins. You can add multiple start URLs (for example, several search queries) using the + control next to the URL field. After creation, start URLs also appear under Edit metadata in the sitemap menu 1.
Sitemap editor: start URL field and + to add more start URLs (documentation figure).
When page URLs contain a number, you can replace that segment with a range instead of listing every page by hand 1:
| Pattern | Example meaning |
|---|---|
[1-100] |
Pages 1 through 100 |
[001-100] |
Zero-padded, e.g. 001, 002, … |
[0-100:10] |
Step by 10: 0, 10, 20, … |
Examples from the docs:
https://example.com/page/[1-3] yields /page/1, /page/2, /page/3https://example.com/page/[001-100] matches three-digit pathshttps://example.com/page/[0-100:10] yields every tenth valueSelectors are organized in a tree. The extension runs them in tree order: parent selectors run first, then children on the pages those parents open 1.
Classic pattern from the documentation:
Use Element preview and Data preview when you build each selector so you know the CSS selection matches real nodes.


The docs recommend being comfortable with at least Link selector and Text selector 4, 5. Link selectors follow href values and pass child selectors to the destination page. If clicking a result does not change the URL (heavy AJAX), read Pagination selector instead of forcing a Link selector 4.





When the sitemap is ready, open the Scrape panel and start the job. A popup window loads pages and extracts rows. When it finishes, the popup closes and you get a completion notice 1.

Two knobs matter for fragile sites:

After scraping:
Goal: build one table where each row is a release you care about. Rows should combine:
Discogs lays these out in HTML that can change; use Element preview and Data preview on a real release page to lock selectors.
Start from at least five search result pages for releases, sorted by community have count (descending):
https://www.discogs.com/search?sort=have%2Cdesc&type=release
Pagination: Discogs search uses a page query parameter. For five pages, a range start URL matches the documentation pattern 1:
If your browser copies the URL with a leading slash path variant, keep the query string identical aside from page=[1-5]. You can instead add five separate start URLs (page=1 … page=5) using the + URL field.
Discogs is a real commercial community. For class:
Primary path for class (search, then each release page):
page=[1-5] as above (or five explicit page= URLs)./release/ URL. This opens the detail page for each album. Name it clearly (for example release_link).Optional warm-up (listing only): before you add the Link selector, you can add Text selectors on the search page scoped to each card to capture title and artist from the snippet. Those columns are redundant with the detail page for some fields, but they help verify selectors before you crawl deeper.
Tips:
https://www.discogs.com/release/... . Avoid confusing them with master or artist links if multiple anchors appear in the same card; restrict the Link selector with a CSS filter or parent scope so you only enqueue releases.
/release/ pages; then add Text children for detail statistics (and optional listing fields). Verify with Element preview / Data preview on both search and release viewsEmbedded screenshots are the same figures used on Scraping a site 1, Open Web Scraper 3, Link selector 4, and Text selector 5. Copies are stored under data351/assets/images/webscraper-docs/ for this deck.
