Defuddle: The Next Generation of Web Content Extraction
Defuddle is an emerging TypeScript alternative to Mozilla Readability. Written for the Obsidian Web Clipper, it extracts clean HTML for markdown conversion, with custom site extractors and math/code block standardization.
Part 1: Foundations — The Mental Model
You have probably used browser “Reader View” extensions, read-it-later apps like Pocket, or web clippers. Under the hood, almost all of them rely on one ancient, legendary library: Mozilla Readability.js.
While Readability is great, the modern web has evolved. Pages now have complex math formulas (MathJax/KaTeX), elaborate code blocks with syntax highlighting, nested footnotes, and JavaScript-rendered content (like X/Twitter or ChatGPT chats). Readability often strips these or outputs messy HTML that is terrible for converting into Markdown.
Enter Defuddle by kepano (the creator of the Obsidian Web Clipper).
Mental Model: Think of Defuddle as Readability 2.0 specialized for Markdown enthusiasts. It is a content extractor that looks at a cluttered web page and surgically isolates the main article, heavily standardizing complex elements (like math, code, and footnotes) into clean, semantic HTML so that subsequent HTML-to-Markdown converters (like Turndown) produce perfect results.
Part 2: The Investigation — Architecture Deep Dive
Defuddle is written purely in TypeScript and designed to run uniformly in the Browser, in Node.js, and via CLI.
The Pipeline Architecture
When you pass a document to Defuddle, it runs the HTML through a multi-stage pipeline:
- Extraction (Site-Specific or Heuristic)
- Standardization (Elements Pipeline)
- Scoring & Cleanup
1. The Extractor Registry
Mozilla Readability applies one giant set of heuristic rules to every site. Defuddle, however, maintains an Extractor Registry (src/extractors/).
If you are parsing a generic blog, it uses heuristics. But if you are parsing a known site, it uses a dedicated extractor. Defuddle ships with built-in extractors for:
- AI Chats:
chatgpt.ts,claude.ts,gemini.ts,grok.ts - Social & Forums:
reddit.ts,hackernews.ts,x-article.ts,twitter.ts - Media & Code:
github.ts,youtube.ts
These extractors know exactly where the payload is on those specific DOM structures, avoiding the need to guess. There is even a useAsync option that falls back to third-party APIs (like FxTwitter) if the local HTML is a blank SPA frame.
2. The Standardization Pipeline
The magic of Defuddle happens in src/elements/. Once the raw content is isolated, Defuddle standardizes it so Markdown converters won’t choke:
- Code Blocks (
code.ts): Checks forpre > code. It strips out line numbers and arbitrary syntax highlighting spans, leaving only semantic<code data-lang="js" class="language-js">tokens. - Math (
math.ts): Detects MathJax, KaTeX, and MathML. It converts them into standardized<math data-latex="...">elements (using libraries likemathml-to-latexandtemmlin the “full” bundle). - Footnotes (
footnotes.ts): Detects various footnote reference patterns (superscripts, brackets) and rewrites them into a standard ordered list at the bottom of the DOM, ensuring Markdown converters create strict[^1]syntax. - Headings (
headings.ts): Demotes H1s to H2s, removes anchor links inside headings, and drops the first heading if it perfectly matches the<title>.
3. Tree Shaking Bundles
Defuddle compiles to three targets:
defuddle/core: Tiny browser bundle. No external dependencies.defuddle/full: Browser bundle that includes heavy MathML/LaTeX parsing libraries.defuddle/node: Optimized for backend scraping using JSDOM.
Part 3: The Diagnosis — What It Does for Developers
For developers building scrapers, AI ingestion pipelines, or productivity tools, Defuddle solves several long-standing headaches.
Problem 1: Ingesting AI Chat Logs
If you try to scrape a ChatGPT or Claude URL using standard Readability, the heavy DOM nesting confuses the heuristic scorer. Defuddle’s specific chatgpt.ts extractor identifies the user/assistant message bubbles and formats them cleanly, making it trivial to dump your chat history into an Obsidian vault or another LLM context window.
Problem 2: Preserving Code and Math
If you clip a technical blog post containing Python code and LaTeX math, standard extractors often destroy the backticks and render math as garbled text. By enforcing the Standardize pipeline, Defuddle ensures that when you pipe its output to Turndown, you get ```python and $$a \neq 0$$.
Example Output
When you run Defuddle, you don’t just get HTML; you get a rich metadata object:
{
"author": "John Doe",
"title": "Quantum Computing 101",
"content": "<article><h2>Introduction</h2>...</article>",
"description": "A beginner's guide to qubits.",
"image": "https://example.com/hero.jpg",
"schemaOrgData": { ... },
"wordCount": 1250,
"parseTime": 42
}
Part 4: The Resolution — How to Use Defuddle
You can integrate Defuddle into your stack in minutes.
For Python / CLI Developers (Quick Web Scraping)
If you just want to extract a page via bash quickly:
npm install -g defuddle
# Return a clean JSON payload mapping the site
defuddle parse https://example.com/article --json
# Return pure markdown
defuddle parse https://example.com/article --markdown
For Node.js (Backend Scrapers & AI Pipelines)
If you are writing a data ingestion pipeline or vector-database loader in Node.js (make sure your package.json has "type": "module"):
import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';
async function extractArticle(url) {
// 1. Fetch DOM using JSDOM
const dom = await JSDOM.fromURL(url);
// 2. Parse with Defuddle
const result = await Defuddle(dom, url, {
markdown: true,
removeImages: false
});
console.log(`Title: ${result.title}`);
return result.contentMarkdown; // Pure, clean markdown
}
For Browser Extensions (Frontend)
If you are building a React app, Chrome extension, or Web Clipper:
import Defuddle from 'defuddle';
// Pass the live window.document
const extractor = new Defuddle(document, { debug: false });
const payload = extractor.parse();
console.log(payload.content); // Lean, standardized HTML
Final Mental Model
┌────────────────────────────────────────────────────────────┐
│ Defuddle │
│ │
│ "Readability 2.0 engineered for Markdown workflows" │
│ │
│ How it Extacts: │
│ → Site-specific Extractors (Reddit, ChatGPT, GitHub) │
│ → Fallback Heuristic Scorer for generic blogs │
│ │
│ How it Standardizes (The true value): │
│ → Code blocks: Strips line numbers, keeps language tags │
│ → Math: Converts MathJax/KaTeX to semantic MathML/LaTeX │
│ → Footnotes: Flattens to standardized bottom lists │
│ │
│ Where it runs: │
│ → Browser (no dependencies) │
│ → Node.js (via JSDOM) │
│ → CLI (with built-in --markdown flags) │
└────────────────────────────────────────────────────────────┘
GitHub: kepano/defuddle
Playground: Defuddle Playground
Related posts
-
Pi Mono Explained: The Anti-Framework for AI Coding Agents
A deep dive into Pi Mono, the radically extensible monorepo for building AI agents that refuses to dictate your workflow—and lets you build the agent you actually want.
-
MoneyPrinterV2: What 18,000 Stars Worth of Automated Content Actually Looks Like
An assembly line for AI content — local LLMs write the script, KittenTTS reads it, Gemini paints the pictures. The video uploads itself.
-
Project N.O.M.A.D.: The Knowledge Bunker You Build for a Rainless Day
When the cloud evaporates, what stays on your disk matters.
-
ruanyf/weekly: The Digital Lighthouse in a Sea of Algorithmic Slop
At the end of the infinite scroll, one man is still standing.