I Built a Tool to Migrate 500+ Images to WebP in One Hour
My Lighthouse score was crying because of heavy images. So I wrote a Python ETL to bulk-convert everything to WebP and update all the URLs automatically. Here's how.
I was staring at my Lighthouse report like it owed me money.
Performance: 62.
The culprit? Images. Hundreds of them scattered across markdown files, hosted on Flickr, Imgur, GitHub… all in glorious, unoptimized JPEG and PNG formats.
The manual fix would be:
- Download each image
- Convert to WebP
- Upload to my CDN
- Find and replace every URL in every markdown file
For 500 unique images? That’s not a weekend project. That’s a prison sentence.
So I did what any lazy engineer would do: I automated it.
The Problem
My blog uses Hugo with markdown files. Images are referenced everywhere:
# In frontmatter
image: "https://live.staticflickr.com/65535/54519397357_403fc67f4a_k. jpg"
# In gallery shortcodes
< gallery id="example_id">
- https://live.staticflickr.com/65535/54525108024_adbff3cc9b_k. jpg
- https://live.staticflickr.com/65535/54520449879_784f0f24ca_k. jpg
< /gallery >
# Standard markdown

Each image had to be:
- Downloaded from the original source
- Converted to WebP (smaller, faster)
- Uploaded to my new CDN
- URL replaced in the markdown file
Multiply by 500. No thanks.
The Solution: An ETL Pipeline
I built bulk-webp-url-replacer—a Python tool that does exactly what it says:
python -m bulk_webp_url_replacer \
--scan-dir ./content \
--download-dir ./downloads \
--output-dir ./webp_images \
--new-url-prefix "https://cdn.example.com/images" \
--threads 8
What it does:
- Extract — Scans all
.mdfiles for image URLs (frontmatter, galleries, inline) - Transform — Downloads each image and converts to WebP
- Load — Replaces all old URLs with new CDN paths
One command. 500 images. Done.
The Technical Bits
Regex Patterns for URL Extraction
Markdown has multiple ways to embed images. My extractor handles them all:
PATTERNS = [
# YAML frontmatter: image: "https://..."
re.compile(r'^image:\s*["\']?(https?://[^"\'>\s]+)["\']?\s*$'),
# TOML frontmatter: image = "https://..."
re.compile(r'^image\s*=\s*["\']?(https?://[^"\'>\s]+)["\']?\s*$'),
# Gallery shortcodes: - https://...
re.compile(r'^\s*-\s+(https?://[^\s]+\.(jpg|jpeg|png|gif|webp))\s*$'),
# Standard markdown: <img src="https://..." alt="alt" loading="lazy" decoding="async" />
re.compile(r'!\[[^\]]*\]\((https?://[^)]+)\)'),
]
Parallel Downloads
Downloading 500 images sequentially? Slow. With ThreadPoolExecutor:
with ThreadPoolExecutor(max_workers=8) as executor:
futures = {executor.submit(process_url, url): url for url in urls}
for future in as_completed(futures):
# Process results as they complete
8 threads = 8x faster. Simple math.
Rate Limiting & Retries
Imgur wasn’t happy with my enthusiasm. HTTP 429 errors everywhere.
The fix: exponential backoff with browser-like headers.
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
}
for attempt in range(max_retries):
response = requests.get(url, headers=HEADERS, timeout=30)
if response.status_code == 429:
time.sleep(2 ** attempt) # 1s, 2s, 4s...
continue
Smart Skipping
The tool saves a mapping.json after each run:
{
"https://old-url.com/image.jpg": "new-filename.webp"
}
Next run? It skips already-processed images. Incremental migrations FTW.
The Results
Before:
- 612 image references across 72 markdown files
- Images scattered across Flickr, Imgur, GitHub
- Lighthouse begging for mercy
After:
- All images converted to WebP
- Hosted on a single CDN
- URLs automatically updated
- One hour of work (mostly watching the progress bar)
Performance improvement:
- Average image size: 60-80% smaller
- Lighthouse Performance: 62 → 89
Lessons Learned
-
Automation scales. What would take days manually took an hour to build and minutes to run.
-
Rate limiting is real. Always add retries and backoff. Sites like Imgur will throttle you.
-
Dry-run first. The
--dry-runflag saved me from accidentally breaking 72 files. -
WebP is worth it. Same quality, fraction of the size. There’s no reason to serve JPEGs in 2026.
Try It Yourself
The tool is open source on GitHub.
# Preview what would change
bulk-webp-url-replacer \
--scan-dir ./content \
--download-dir ./downloads \
--output-dir ./webp \
--dry-run
# Run for real
bulk-webp-url-replacer \
--scan-dir ./content \
--download-dir ./downloads \
--output-dir ./webp \
--new-url-prefix "https://your-cdn.com/images" \
--threads 8
Your Lighthouse score will thank you. 🚀
Example Output
After running the migration tool, the URLs are automatically updated to point to the optimized WebP versions:
# In frontmatter
image: "https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/54519397357_403fc67f4a_k.webp"
# In gallery shortcodes
< gallery id="example_id">
- https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/54525108024_adbff3cc9b_k.webp
- https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/54520449879_784f0f24ca_k.webp
< /gallery >
# Standard markdown
<img src="https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/sXyG3GX.webp" alt="My photo" loading="lazy" decoding="async" /> Related posts
-
Caching & Redis: The 'Sticky Note' Mental Model
Why does Redis make everything faster? A mastery guide to cache invalidation (the hardest problem in CS), eviction strategies, and Redis data types.
-
MoneyPrinterV2: What 18,000 Stars Worth of Automated Content Actually Looks Like
An assembly line for AI content — local LLMs write the script, KittenTTS reads it, Gemini paints the pictures. The video uploads itself.
-
Unleashing the Super Agent Harness: A Deep Dive into Bytedance's DeerFlow
Discover how DeerFlow 2.0 transforms from a deep research tool into a full-fledged agent harness with sandboxing, sub-agents, and persistent memory.
-
OpenBB Explained: The Open Data Platform for Investment Research
A deep dive into OpenBB, the open-source platform that unifies financial data APIs into a single interface for Python developers, analysts, and AI agents.