Extract Article Content from Captures → MDX Reading Files
Implement the following plan:
Extract Article Content from Captures → MDX Reading Files
Context
Captures store raw document.documentElement.innerHTML in the database. The goal is to extract article content (stripping nav/ads/chrome), convert to markdown, and generate an MDX file matching the existing reading format — ready for manual annotation with <Sidenote> and <Footnote> components.
Dependencies
Add to apps/api/package.json (exact versions):
@mozilla/readability:0.6.0— article extraction (Firefox Reader View algorithm)linkedom:0.18.12— lightweight server-side DOM for Readabilityturndown:7.2.2— HTML to markdown@types/turndown:5.0.6(devDependency)
Changes
1. New: apps/api/src/services/html-extractor.ts
Single function extractArticle(html, url) that:
- Parses HTML with
linkedom - Runs
@mozilla/readabilityto extract article content - Converts extracted HTML to markdown with
turndown(ATX headings, fenced code blocks, strip images) - Returns
{ title, byline, siteName, content (markdown), excerpt, url }
2. New: apps/api/src/services/mdx-generator.ts
generateReadingMdx(extraction) that builds an MDX string with:
- YAML frontmatter matching
ReadingMetatype (title, author, source, sourceUrl, slug, requiresAuth) - Slug auto-generated from title
requiresAuth: trueby default- Article markdown as body
3. Modify: apps/api/src/routes/admin/captures.ts
Add POST /:id/extract endpoint:
- Auth: instructor/staff
- Loads capture from DB, runs extraction pipeline
- Returns
{ extraction: { title, byline, siteName, excerpt, slug }, mdx } - 422 if Readability can't parse the page
4. Modify: apps/web/app/capture/client.tsx
Add to each expanded capture item:
- "Extract Article" button → calls POST endpoint
- Shows extracted MDX in a
<pre>block when ready - "Copy MDX" button (clipboard) and "Download .mdx" button (file download)
Verification
- Capture a page using the Tampermonkey script
- Visit
/capture, expand the capture, click "Extract Article" - Verify MDX output has correct frontmatter and clean markdown
- Copy/download and save to
content/spring2026/readings/ - Verify it renders at
/{semester}/readings/{slug}
If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: ~/.claude/projects/-Users-challen-www-usingandunderstanding-ai/6f16702a-bcd8-49a9-9a25-b3ad7ec8b883.jsonl
If this plan can be broken down into multiple independent tasks, consider using the TeamCreate tool to create a team and parallelize the work.
I'll start by reading the relevant existing files to understand the codebase patterns, then implement the plan.
Now I have a clear picture. Let me implement all four parts of the plan.
Now let me create the two service files and update the route and client in parallel where possible. First, the two independent service files.
Now add the extract endpoint to the captures route.
Now update the client component with the extraction UI.
Now let me verify everything compiles.
TypeScript errors in the client — need to handle the possibly undefined lookups. Let me fix those.
The issue is that inside the extractions[capture.id] && check, TypeScript doesn't narrow the record access. I'll extract a local variable.
The web errors are fixed. Now the API errors — Readability's return types use undefined for some fields. Let me fix.
All clean. Let me also run lint to make sure.
Everything passes. Here's a summary of what was implemented:
New files:
apps/api/src/services/html-extractor.ts—extractArticle(html, url)usinglinkedom+@mozilla/readability+turndownto parse HTML, extract the article, and convert to markdown (strips images/figures, ATX headings, fenced code)apps/api/src/services/mdx-generator.ts—generateReadingMdx(extraction)builds MDX with YAML frontmatter matchingReadingMeta(title, author, source, sourceUrl, slug, requiresAuth: true)
Modified files:
apps/api/package.json— added@mozilla/readability,linkedom,turndown,@types/turndownapps/api/src/routes/admin/captures.ts— addedPOST /:id/extractendpoint (instructor/staff auth, returns extraction metadata + MDX string, 422 if Readability fails)apps/web/app/capture/client.tsx— added "Extract Article" button per capture, extraction result display with title/byline, "Copy MDX" and "Download .mdx" buttons, and a<pre>block showing the generated MDX
Can we test this using the capture that I just did?
Let me check if the dev server is running, then look at the captures available.
That went to the Next.js app, not the API. Let me hit the API server directly.
The API requires auth. Let me query the database directly to find the most recent capture and test the extraction pipeline.
The New Yorker article — let me write a quick script to test the extraction pipeline on it.
The extraction works well. The article content is cleanly extracted — Ted Chiang's "ChatGPT Is a Blurry JPEG of the Web" with correct title, byline, site name, and markdown content. The frontmatter matches the ReadingMeta format.
The tail shows some related-article links that Readability included. That's typical — those could be trimmed manually when editing the MDX. Let me clean up the test file.
The extraction pipeline works. From the 1MB captured HTML of the New Yorker page:
- Title: "ChatGPT Is a Blurry JPEG of the Web"
- Byline: Ted Chiang
- Site: The New Yorker
- Slug:
chatgpt-is-a-blurry-jpeg-of-the-web - Content: 25KB of clean markdown (down from 1MB of raw HTML)
- Frontmatter: correctly formatted with
requiresAuth: true
One thing to note: Readability includes some related-article links at the bottom of the extraction. Those would need to be trimmed manually when editing the MDX, which fits the workflow since these files are meant to be a starting point for manual annotation with <Sidenote> and <Footnote> components anyway.
You can test the full UI flow at /capture — expand a capture, click "Extract Article", then use "Copy MDX" or "Download .mdx" to save it.
Can we add line length limits to the MDX files so that they break at say 80 or 120 characters?
Good idea. Let me check what the existing readings look like for line length conventions.
Existing readings don't wrap either, so this would be for readability when editing. Let me add word-wrapping to the markdown content in the MDX generator. I'll wrap at 120 characters (80 feels too narrow for prose).
Let me verify it compiles and test on the capture.
Now let me quick-test the wrapping on the actual capture.
Good wrapping overall, but 13 lines still exceed 120. Those are likely lines with markdown links. Let me check what those long lines look like.
As expected — all the long lines are markdown links (URLs can't be broken). Line 51 is an in-article link; the rest are related-article links at the bottom. That's fine — you can't break URLs. The prose itself is properly wrapped.
Now let me clean up and address your second request — running extraction automatically when a capture is expanded, rather than requiring a button click.
Now let me update the client to auto-extract when expanding a capture.
I'll rewrite the client to auto-extract on expand — fire both the content load and extraction in parallel, show the MDX directly (no "Extract Article" button), and keep the raw HTML as a collapsible detail.
All clean. Here's what changed:
-
Line wrapping: Added
wrapProse()tomdx-generator.tsthat word-wraps plain prose paragraphs at 120 characters while leaving headings, lists, blockquotes, code blocks, and link lines untouched (URLs can't be broken). -
Auto-extract on expand: Removed the "Extract Article" button. Now when you click a capture to expand it, the extraction runs immediately — you see "Extracting article..." then the MDX result with Copy/Download buttons. The raw HTML preview is no longer shown (the MDX is the primary output). Errors still display inline if Readability can't parse the page.
No, sorry. I want the MDX conversion done as soon as the capture arrives.
Ah, you want the extraction to happen server-side when the capture is first submitted. Let me find the capture submission endpoint.
Right, yes. This should all be server side. It probably makes sense to store the MDX content in a database, where we can either (1) promote it to a page for students directly or (2) download and add to the local repository.
Good call. Let me look at the schema and plan this out.
Here's my plan:
- Schema: Add
extractedMdx(text, nullable) andextractedSlug(text, nullable) topageCaptures— the MDX has the full frontmatter baked in, and the slug is useful for quick reference/promotion - Capture endpoint: Run
extractArticle+generateReadingMdxright after insert, update the row with the result (non-blocking — capture still succeeds if extraction fails) - Admin list endpoint: Return slug + whether MDX exists
- Client: Show the MDX directly when expanding (no extraction step), with Copy/Download buttons
Good. Now apply the migration and update the capture endpoint.
Now update the capture submission endpoint to run extraction on ingest.
Now update the admin captures endpoints to return the MDX data, and remove the now-unnecessary POST /:id/extract endpoint.
Now update the client to use the pre-extracted MDX directly.
Now update the server page component that passes data to the client, since the shape changed.
Clean. Now let me backfill the existing New Yorker capture so it has MDX too.
Let me write a quick backfill script to extract MDX for existing captures.
The test captures have tiny content (5-27 bytes) that can't form valid HTML. The try/catch in the capture endpoint would handle this, but the backfill script needs the same protection. Let me fix the extractor to be more defensive, and also fix the backfill.
The three test captures correctly fail (tiny invalid HTML), and the New Yorker article is backfilled. Let me clean up and verify.
All clean. Here's what changed:
Schema: Added extracted_mdx and extracted_slug columns to page_captures (migration 0019).
Capture ingest (apps/api/src/routes/captures.ts): Now runs extractArticle + generateReadingMdx immediately when a capture arrives and stores the result. If extraction fails (invalid HTML, non-article page), the capture is still saved — just without MDX.
Admin endpoints (apps/api/src/routes/admin/captures.ts): List endpoint returns hasMdx and extractedSlug; detail endpoint returns extractedMdx and extractedSlug. Removed the POST /:id/extract endpoint since extraction is no longer on-demand.
Client (apps/web/app/capture/client.tsx): Shows an "MDX" badge on captures that have extracted content. Expanding shows the MDX directly with Copy/Download buttons — no extraction button needed.
Extractor (html-extractor.ts): Added a null check for document.documentElement to handle invalid HTML gracefully.
The existing New Yorker capture has been backfilled. New captures will have MDX generated automatically on arrival.
Can we clear local captures so that I can test this?
Done — all 4 captures cleared. Ready for you to test a fresh capture.
Done! Next, can we create a script or bun command that grabs captures from the server (prod by default, but accepting dev or local flags) and writes them into the appropriate directory for further processing?
Let me understand how existing scripts connect to different environments.
Now let me check what semester is current so I know the default output directory.
Good. The script needs to:
- Connect to prod/dev/local DB
- Query captures that have
extracted_mdx - Write each to
content/spring2026/readings/{slug}.mdx
For prod/dev it needs port-forwarding + password via the k8s helpers. For local it just uses the local DATABASE_URL.
Let me also check how db-query.ts handles the local case — it doesn't, it only handles dev/prod. Let me look at how local DB connection works.
Now I have everything I need. The script will:
- Default to
prod, acceptdevorlocalas first arg - For prod/dev: use k8s port-forward + kubectl secret
- For local: use the local DATABASE_URL
- Query captures with
extracted_mdxnot null - Write each to
content/spring2026/readings/{slug}.mdx, skipping if file already exists
Now add the script to the root package.json.
Let me test it against local.
The root package doesn't have drizzle-orm/postgres-js. Let me check how other scripts that query the DB do it — they use @repo/db which wraps the connection.
The existing scripts use kubectl exec to run SQL directly — they don't use drizzle from the scripts directory. Let me follow that pattern: use kubectl exec for prod/dev and docker exec for local, just running a SQL query and parsing the output. Actually, a simpler approach: use db:query for the SQL and pipe results. But the cleanest approach is to just use raw SQL via the postgres client, since postgres (the npm package) should be available from @repo/db's dependencies.
Let me check what's available at the root level.
It's in bun's cache but not hoisted to root node_modules. The simplest approach is to use kubectl exec / docker exec with raw SQL, matching the pattern of the other scripts. That avoids needing drizzle or postgres in the scripts directory.
Wait — the MDX content has newlines, so using psql with a delimiter won't work for multiline text fields. I need a different approach. Let me use the db:query script pattern with a TypeScript file that gets run with DATABASE_URL set, or for local just run directly from the api package.
Actually, the cleanest solution: make this a script inside apps/api where @repo/db is available, and have it accept a DATABASE_URL override for remote environments. The root script can handle port-forwarding and delegate.
Let me look at how db:query runs a .ts script file with the remote DATABASE_URL — that's exactly the pattern I need.
So db:query dev script.ts sets DATABASE_URL and runs the script with bun run. I'll follow this exactly: a root-level script that handles env selection and port-forwarding, then runs the actual worker script inside apps/api.
Now create the worker script in apps/api where @repo/db is available.
Let me verify the file was written correctly.
And verify skip behavior on re-run.
Let me clean up the test file and run type checks.
All clean. Here's the new command:
bun run captures:pull [prod|dev|local]
prod(default) — port-forwards to the K8s prod database, pulls captures with extracted MDXdev— same for the dev databaselocal— uses the local Docker postgres
Each capture with MDX gets written to content/spring2026/readings/{slug}.mdx. Files that already exist are skipped, so it's safe to run repeatedly.
Nice! How is slug set?
From the article title. In apps/api/src/services/mdx-generator.ts:
function slugify(title: string): string {
return title
.toLowerCase()
.replace(/[^a-z0-9]+/g, "-")
.replace(/^-+|-+$/g, "");
}
So "ChatGPT Is a Blurry JPEG of the Web" becomes chatgpt-is-a-blurry-jpeg-of-the-web.
Can we use the URL somehow instead, or maybe add a URL hash to avoid title collisions?
Most article URLs already have a good slug in the path — e.g. newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web → chatgpt-is-a-blurry-jpeg-of-the-web. We could use the last path segment of the URL, falling back to the title if it's not usable (e.g. just an ID or empty).
Good idea — title + domain + short hash gives readable, unique slugs. E.g. chatgpt-is-a-blurry-jpeg-of-the-web-newyorker-a3f2.
Let me quick-test what the slug looks like now.
chatgpt-is-a-blurry-jpeg-of-the-web-newyorker-ed2e — title + domain + 4-char hash.
Note that existing captures in the DB still have the old slug format. You'll want to re-capture or I can update the existing one. The slug in the frontmatter inside extracted_mdx will also be stale. Want me to clear the local captures again so you can re-test fresh?
Yes.
Done — cleared. Ready for a fresh capture.
Can we put the capture script at the top of the Captures Page?
Let me find the Tampermonkey script first.
I see — the profile page already has the capture token/script section. Let me check what script is served at the .user.js endpoint.
The script is dynamically generated with the user's token baked in. It's served via an authenticated endpoint, so I can't just inline it statically. But I can show the "Install Capture Script" button and setup instructions on the captures page too, pulling from the same token data.
Let me check how the captures page server component works and what data it already fetches.
I need to also fetch the capture token data and pass it along. Let me check what the profile page does to get apiUrl.
Now I have the full picture. I'll fetch the capture token in the captures page and pass apiUrl + token to the client, which will show the install link at the top.
Now update the client to accept and display the capture script install info.
Now add the script install section at the top of the page, before the captures list. I'll add it in both the empty and non-empty states.
All clean. The captures page now shows an "Install Capture Script" link at the top (with a Tampermonkey link) when the user has a capture token. It appears on both the empty and populated states of the page. The profile page keeps its existing capture section too.
How do I grab the article from the local DB?
bun run captures:pull local
Nice! Next, can you clean up the new MDX file? Move the date to the frontmatter, use a subtitle, remove the parts past the end.
I see the structure. Let me also check the ReadingMeta type to see if date and subtitle are supported fields.
No date or subtitle fields yet. Let me add them to the type, then clean up the file.
Now clean up the MDX file: move date to frontmatter, add subtitle, trim everything after the article ends (line 225 with the ♦).
Now trim everything after the article end (the ♦ marker and all the related article links).
Now update the MDX generator to also extract subtitle and date into the frontmatter automatically for future captures.
The excerpt from Readability is the subtitle (the deck/subhead). Now let me update the MDX generator to emit subtitle and try to extract a date from the beginning of the content.
All clean. Here's what changed:
ReadingMetatype: Added optionalsubtitleanddatefields- MDX file cleaned up: Subtitle moved to frontmatter, date converted to
2023-02-09in frontmatter, related article links at the end removed - MDX generator updated for future captures:
- Detects the excerpt at the start of content and moves it to
subtitlein frontmatter - Detects date lines like "February 9, 2023" in the first few paragraphs and converts to
YYYY-MM-DDin frontmatter, removing the line from the body
- Detects the excerpt at the start of content and moves it to
The related-article junk at the end will still need manual trimming (it varies too much by site to reliably auto-detect), but the subtitle and date are now handled automatically.
No, sorry. I don't think that we should make these changes to the generator. We'll need to do this manually latre.
Got it — I'll revert the generator changes and keep the manual edits to the MDX file.
Reverted. The generator stays simple (title, author, source, sourceUrl, slug, requiresAuth). The subtitle and date fields remain in the ReadingMeta type for manual use, and the hand-edits to the Ted Chiang file are kept.
Your task
First, run git status, git diff, and git log --oneline -10 to understand the current state.
Then add and commit all modified files with a descriptive message. Break the changes into multiple commits if appropriate. If there are untracked files, ask if they should be added or added to the appropriate .gitignore file. The user may also want to take over at this point, so provide an abort option as well. Always check changes for unnecessary debugging statements or logging that might have been accidentally included. Prompt the user if you find any, and offer to remove each one by one. Do not use interactive git staging (git add -p) as it does not work.
Good. No debug statements anywhere. The changes group naturally into three commits:
- Capture pipeline — extraction service, MDX generator, schema, ingest endpoint, admin endpoints
- Capture UI — web client + page updates
- Captures pull script + reading file
3 commits:
- Extract article content from captures into MDX on ingest — services, schema migration, API changes, userscript improvements
- Update captures UI — pre-extracted MDX display, install script link
- Add captures:pull script and Ted Chiang reading — the pull command and the cleaned-up reading file
OK, next New Yorker article; process similarly.
Let me clear the local captures so you can do a fresh capture.
Ready — capture the next article and let me know when it's done.
OK, next New Yorker article; process similarly.
No captures yet. Go ahead and capture the article — I'll pull it once it's in.
It's in content. I already pulled it into the filesystem.
Now I can see the structure. Same cleanup as the Ted Chiang piece: move the subtitle to frontmatter, add the date, and trim the ♦.
Let me verify the result looks clean.
Clean. James Somers, "The Case That A.I. Is Thinking" — subtitle added, date set to 2025-11-10, trailing ♦ removed, no related-article junk at the end (Readability did a cleaner job on this one).
Your task
First, run git status, git diff, and git log --oneline -10 to understand the current state.
Then add and commit all modified files with a descriptive message. Break the changes into multiple commits if appropriate. If there are untracked files, ask if they should be added or added to the appropriate .gitignore file. The user may also want to take over at this point, so provide an abort option as well. Always check changes for unnecessary debugging statements or logging that might have been accidentally included. Prompt the user if you find any, and offer to remove each one by one. Do not use interactive git staging (git add -p) as it does not work.
Just the one new file, no debug statements to worry about.
Done.
OK! Now, the two new articles are both more accessible than McLuhan, and so probably don't need the same level of footnoting or sidenoting. But perhaps a bit would be useful? Or adding links out to relevant topics or people mentioned?
Let me explore the existing annotation patterns in the readings to understand what's available.
Now let me read both articles to understand what kinds of annotations would be most useful.
Good context. Now let me read the existing annotated reading to calibrate the right density level.
This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.
Analysis: Let me go through the conversation chronologically to capture all important details.
-
Initial Plan Implementation: User asked to implement a plan for extracting article content from captures into MDX reading files. The plan had 4 parts:
- New service: html-extractor.ts (linkedom + readability + turndown)
- New service: mdx-generator.ts (generates MDX with frontmatter)
- New endpoint: POST /:id/extract on admin captures route
- Updated UI: capture client with extract button, MDX preview, copy/download
-
Dependencies added: @mozilla/readability 0.6.0, linkedom 0.18.12, turndown 7.2.2, @types/turndown 5.0.6
-
Testing the extraction: Successfully tested on a captured New Yorker article (Ted Chiang's "ChatGPT Is a Blurry JPEG of the Web") - 1MB HTML → 25KB clean markdown
-
Line wrapping: User asked for line length limits. Chose 120 chars after asking. Added wrapProse() and wordWrap() to mdx-generator.ts. Lines with URLs can still exceed 120 (can't break URLs).
-
Auto-extract on expand: User first asked to run MDX conversion immediately (not require button click). I updated client to auto-extract on expand. Then user clarified they wanted extraction done server-side on capture arrival.
-
Server-side extraction on ingest:
- Added extracted_mdx and extracted_slug columns to page_captures schema
- Generated migration 0019
- Modified capture POST endpoint to run extraction immediately on arrival
- Removed POST /:id/extract endpoint
- Updated admin endpoints to return MDX data
- Updated client to show pre-extracted MDX
-
Slug generation: Originally just slugified title. User asked to use URL somehow or add hash. After discussion, settled on: title + domain + 4-char URL hash. Example:
chatgpt-is-a-blurry-jpeg-of-the-web-newyorker-ed2e -
Capture script on captures page: Added the Tampermonkey install link to the top of the /capture page, fetching capture token data.
-
captures:pull script: Created
bun run captures:pull [prod|dev|local]command. Root script handles port-forwarding (scripts/captures-pull.ts), worker script in apps/api/src/scripts/pull-captures.ts queries DB and writes MDX files to content/spring2026/readings/. -
MDX cleanup: Manually cleaned up the Ted Chiang article - moved date to frontmatter, added subtitle, removed related article junk at end. Added subtitle and date fields to ReadingMeta type.
-
User rejected auto-cleanup in generator: User explicitly said NOT to add subtitle/date extraction to the generator. I reverted those generator changes. The generator stays simple.
-
Three commits made: Pipeline, UI, pull script+reading file.
-
Second article: Processed James Somers "The Case That A.I. Is Thinking" - same cleanup pattern (subtitle, date, remove ♦). Committed.
-
Current work (plan mode): User asked about adding light annotations (sidenotes, footnotes, links) to the two new articles. They noted these are more accessible than McLuhan so don't need as heavy annotation. I explored the existing annotation patterns (Sidenote with id + sn: link syntax, Footnote with ^[] syntax) and was reading both articles to plan annotations when the summary was requested.
Key errors encountered:
- TS errors: Object possibly undefined when accessing Record by key - fixed with IIFE pattern
- Readability return types use undefined not null - fixed with ?? null
- linkedom crashes on tiny/invalid HTML - fixed with documentElement null check
- Can't import @repo/db from root scripts/ directory - solved by splitting into root script (port-forward) + worker script in apps/api/
- User rejected auto-extraction of subtitle/date in generator - reverted
Important files created/modified:
- apps/api/src/services/html-extractor.ts (new)
- apps/api/src/services/mdx-generator.ts (new)
- apps/api/src/routes/captures.ts (modified - extraction on ingest)
- apps/api/src/routes/admin/captures.ts (modified - return MDX data, removed extract endpoint)
- apps/web/app/capture/client.tsx (modified - show pre-extracted MDX, install link)
- apps/web/app/capture/page.tsx (modified - fetch capture token)
- packages/db/src/schema/page-captures.ts (modified - new columns)
- packages/types/src/content.ts (modified - subtitle, date on ReadingMeta)
- scripts/captures-pull.ts (new)
- apps/api/src/scripts/pull-captures.ts (new)
- content/spring2026/readings/chatgpt-is-a-blurry-jpeg-of-the-web-newyorker-ed2e.mdx (new)
- content/spring2026/readings/the-case-that-a-i-is-thinking-newyorker-d419.mdx (new)
Summary:
-
Primary Request and Intent: The user asked to implement a capture-to-MDX pipeline for their course website. The full workflow: capture web pages via Tampermonkey → extract article content server-side on ingest → store extracted MDX in database → pull MDX files to local filesystem → manually clean up frontmatter (subtitle, date, trim junk) → optionally add sidenotes/footnotes/links. The user iteratively refined requirements: extraction should happen on capture arrival (not on-demand), slugs should include title+domain+hash, the generator should stay simple (no auto subtitle/date extraction), and a CLI script should pull captures from prod/dev/local databases.
-
Key Technical Concepts:
@mozilla/readability(Firefox Reader View algorithm) for article extractionlinkedomfor lightweight server-side DOM parsingturndownfor HTML→markdown conversion- Drizzle ORM schema migrations (never hand-write, use
bun run db:generate) - MDX frontmatter matching
ReadingMetatype (title, subtitle, date, author, source, sourceUrl, slug, requiresAuth) - Sidenote component:
<Sidenote id="x">content</Sidenote>+[linked text](sn:x)syntax - Footnote component:
^[footnote content]syntax - K8s port-forwarding pattern for remote DB access (scripts/lib/k8s.ts)
- Word-wrapping prose at 120 chars while preserving headings, lists, code blocks, and URLs
-
Files and Code Sections:
-
apps/api/src/services/html-extractor.ts (new)
- Core extraction: linkedom parse → Readability → turndown (strips images/figures)
- Returns
{ title, byline, siteName, content (markdown), excerpt, url } - Has null check for
document.documentElementto handle invalid HTML
export function extractArticle(html: string, url: string): ArticleExtraction | null { const { document } = parseHTML(html); if (!document.documentElement) return null; // ... readability + turndown pipeline } -
apps/api/src/services/mdx-generator.ts (new)
- Generates MDX with YAML frontmatter + word-wrapped prose body
- Slug = slugify(title) + extractDomain(url) + shortHash(url, 4 chars)
- wrapProse() splits on double-newlines, skips headings/lists/code/links, wraps prose at 120 chars
export function generateReadingMdx(extraction: ArticleExtraction): { slug: string; mdx: string }- User explicitly rejected adding auto subtitle/date extraction here — generator stays simple
-
apps/api/src/routes/captures.ts (modified)
- POST handler now runs extractArticle + generateReadingMdx immediately after parsing body
- Stores extractedMdx and extractedSlug in DB; capture still saved if extraction fails (try/catch)
let extractedMdx: string | null = null; let extractedSlug: string | null = null; try { const extraction = extractArticle(body.content, body.url); if (extraction) { const result = generateReadingMdx(extraction); extractedMdx = result.mdx; extractedSlug = result.slug; } } catch { /* still save raw capture */ } -
apps/api/src/routes/admin/captures.ts (modified)
- List endpoint returns
hasMdx(boolean) andextractedSlug - Detail endpoint returns
extractedMdxandextractedSlug - Removed the
POST /:id/extracton-demand endpoint (no longer needed) - Removed imports of html-extractor and mdx-generator
- List endpoint returns
-
apps/web/app/capture/client.tsx (modified)
- Shows "MDX" badge on list items with extracted content
- Expanded view shows extractedMdx directly with "Copy MDX" and "Download .mdx" buttons
- Shows "Install Capture Script" link at top (requires capture token)
- No more extraction button or raw HTML preview
-
apps/web/app/capture/page.tsx (modified)
- Fetches capture token data via serverApiFetch
- Passes captureToken and apiUrl to CaptureClient
-
packages/db/src/schema/page-captures.ts (modified)
- Added
extractedMdx: text("extracted_mdx")andextractedSlug: text("extracted_slug")columns - Migration 0019_productive_lizard.sql generated and applied
- Added
-
packages/types/src/content.ts (modified)
- Added
subtitle?: stringanddate?: stringto ReadingMeta interface
- Added
-
scripts/captures-pull.ts (new)
- Root-level script handling env selection and port-forwarding
- Delegates to worker script in apps/api via
bun run - Usage:
bun run captures:pull [prod|dev|local](default: prod)
-
apps/api/src/scripts/pull-captures.ts (new)
- Worker script that queries DB for captures with extractedMdx
- Writes each to content/spring2026/readings/{slug}.mdx, skips existing files
-
content/spring2026/readings/chatgpt-is-a-blurry-jpeg-of-the-web-newyorker-ed2e.mdx (new, manually cleaned)
- Ted Chiang, New Yorker, 2023-02-09
- Subtitle added, date moved to frontmatter, ♦ and related article links trimmed
-
content/spring2026/readings/the-case-that-a-i-is-thinking-newyorker-d419.mdx (new, manually cleaned)
- James Somers, New Yorker, 2025-11-10
- Subtitle added, date added, ♦ trimmed (no related article junk on this one)
-
apps/api/package.json (modified)
- Added:
@mozilla/readability: 0.6.0,linkedom: 0.18.12,turndown: 7.2.2 - devDeps:
@types/turndown: 5.0.6
- Added:
-
package.json (modified)
- Added script:
"captures:pull": "bun run scripts/captures-pull.ts"
- Added script:
-
-
Errors and fixes:
- TS2532 Object possibly undefined in client.tsx when accessing
extractions[capture.id].extraction.titleinside a truthiness check: Fixed with IIFE pattern(() => { const ext = extractions[capture.id]; if (!ext) return null; ... })() - TS2322 Type undefined not assignable in html-extractor.ts for Readability return fields (title, byline, siteName, excerpt): Fixed with
?? null/?? "Untitled"coalescing - linkedom crash on invalid HTML (TypeError: Cannot destructure 'firstElementChild' from null): Fixed by adding
if (!document.documentElement) return null;guard - Cannot find module '@repo/db' when running scripts from root: Fixed by splitting into root script (port-forward + env) that delegates to worker script inside apps/api/ where @repo/db is available
- User rejected auto subtitle/date extraction in generator: User said "I don't think that we should make these changes to the generator. We'll need to do this manually later." Reverted all generator changes for subtitle/date/excerpt detection.
- User rejected on-demand extraction: User said "No, sorry. I want the MDX conversion done as soon as the capture arrives." Moved extraction from client-triggered to server-side on ingest.
- TS2532 Object possibly undefined in client.tsx when accessing
-
Problem Solving:
- Solved the full pipeline from HTML capture → clean MDX reading file
- Handled slug collisions with title+domain+hash pattern
- Made extraction best-effort (doesn't block capture save on failure)
- Created CLI tool following existing k8s script patterns for remote DB access
- Backfilled existing captures via temporary script (then deleted it)
-
All user messages:
- "Implement the following plan: [detailed plan for capture→MDX extraction]"
- "Can we test this using the capture that I just did?"
- "Can we add line length limits to the MDX files so that they break at say 80 or 120 characters?" (chose 120)
- "Also, let's run the MDX conversion immediately on capture; don't require a button click."
- "No, sorry. I want the MDX conversion done as soon as the capture arrives."
- "Right, yes. This should all be server side. It probably makes sense to store the MDX content in a database, where we can either (1) promote it to a page for students directly or (2) download and add to the local repository."
- "Can we clear local captures so that I can test this?"
- "Done! Next, can we create a script or bun command that grabs captures from the server (prod by default, but accepting dev or local flags) and writes them into the appropriate directory for further processing?"
- "Can we put the capture script at the top of the Captures Page?"
- "It's nice to have it on the profile page as well." (profile page already had it)
- "How do I grab the article from the local DB?" (answered:
bun run captures:pull local) - "Nice! Next, can you clean up the new MDX file? Move the date to the frontmatter, use a subtitle, remove the parts past the end."
- "No, sorry. I don't think that we should make these changes to the generator. We'll need to do this manually later." (reverted generator changes)
- /commitall (first commit - 3 commits made)
- "Can we use the URL somehow instead, or maybe add a URL hash to avoid title collisions?" (chose title+domain+hash after discussion)
- "Yes." (clear captures for re-test)
- "It's in content. I already pulled it into the filesystem." (re: second article)
- /commitall (second commit)
- "OK, next New Yorker article; process similarly." → "OK, next New Yorker article; process similarly." (second time, article was already in filesystem)
- /commitall (third commit)
- "OK! Now, the two new articles are both more accessible than McLuhan, and so probably don't need the same level of footnoting or sidenoting. But perhaps a bit would be useful? Or adding links out to relevant topics or people mentioned?"
-
Pending Tasks:
- Plan and implement light annotations (sidenotes/footnotes/links) for the two new reading MDX files
- Currently in plan mode — need to design the annotation approach and get user approval
-
Current Work: In plan mode, exploring how to add light annotations to the two new reading files:
content/spring2026/readings/chatgpt-is-a-blurry-jpeg-of-the-web-newyorker-ed2e.mdx(Ted Chiang)content/spring2026/readings/the-case-that-a-i-is-thinking-newyorker-d419.mdx(James Somers)
Have completed Phase 1 exploration: read both articles fully, explored existing annotation patterns in alphago.mdx and understanding-media-ch1.mdx, understood Sidenote (
<Sidenote id="x">+[text](sn:x)) and Footnote (^[content]) component syntax. Was about to begin Phase 2 (design) — launching a Plan agent to design the annotation approach.The user said: "the two new articles are both more accessible than McLuhan, and so probably don't need the same level of footnoting or sidenoting. But perhaps a bit would be useful? Or adding links out to relevant topics or people mentioned?"
-
Optional Next Step: Continue plan mode Phase 2: Design the annotation plan for both articles. The user wants lighter annotations than McLuhan's understanding-media-ch1.mdx (which has ~18 dense sidenotes). Focus on:
- Brief footnotes for key people mentioned (linking to Wikipedia/relevant pages)
- A few sidenotes for concepts that benefit from context (e.g., the Hutter Prize, Chiang's predictions that can be evaluated with hindsight, cross-reference between the two articles since Somers explicitly responds to Chiang)
- External links to referenced works (books, papers, other articles)
User's exact words: "the two new articles are both more accessible than McLuhan, and so probably don't need the same level of footnoting or sidenoting. But perhaps a bit would be useful? Or adding links out to relevant topics or people mentioned?"
If you need specific details from before compaction (like exact code snippets, error messages, or content you generated), read the full transcript at: ~/.claude/projects/-Users-challen-www-usingandunderstanding-ai/81c10e04-df4b-44a0-a195-49e486217b62.jsonl Please continue the conversation from where we left off without asking the user any further questions. Continue with the last task that you were asked to work on.
I was in plan mode designing light annotations for the two new reading articles. Let me continue by reading the current plan file and both articles to finalize the annotation design.
Let me also check the existing annotated readings for the annotation style.
Now I have a thorough understanding of both articles and the annotation patterns. Let me design the annotation plan.
Now let me write the new plan for annotating the two articles.
The plan proposes light annotations for both articles:
Chiang ("Blurry JPEG"): ~5 footnotes, ~3 sidenotes, ~3 external links. Key highlights include a cross-reference to Somers' article (which directly responds to Chiang), a note on Chiang's GPT-4 prediction proving correct, and connecting his writing-process argument to course themes.
Somers ("The Case That A.I. Is Thinking"): ~8 footnotes (many key people mentioned), ~4 sidenotes, ~4 external links. Key highlights include the dialogue between the two articles, the Human Genome Project analogy, Hofstadter's conversion from AI skeptic, and the Kanerva/Transformer connection.
Total: ~13 footnotes, ~7 sidenotes — much lighter than McLuhan's ~18 dense sidenotes.