Stop Bloating Your Database: The S3 Pointer Pattern

The day I almost bricked production.

It started with a feature request that sounded innocent enough:

“We need to show the raw transaction logs in the admin panel. Just dump the JSON into the DB so we can list them.”

I was young. I was naive. I thought, “Sure, MySQL has a JSON column type. What could go wrong?”

The logs were 10MB.
The table grew.
And then, one random Tuesday, the database choked.

CPU spiked to 99%. Queries that usually took 10ms were taking 10 seconds. My phone started vibrating with PagerDuty alerts, and I was sweating through my shirt.

Here’s what I learned the hard way using massive JSON blobs in a relational database is a surefire way to kill your performance.

The Trap: “It’s Just Data, Put It In The DB”

It feels logical. You have a User. Users have Logs. Put the Logs in the database.

But when those logs are massive blobs (10MB+), you aren’t just “storing data.” You are fundamentally breaking how a database works.

Why Your DB Hates You Now

Databases like MySQL live and die by their Buffer Pool (RAM). You want your RAM filled with useful stuff:

Index pages (so lookups are instant).
Hot rows (data people actually access).

When you stuff a 10MB blob into a row, that row becomes a Buffer Pool Bully.

It walks into the club, kicks out all the lean, efficient index pages, and sits there taking up space. Even if you just want to read one field, the DB often has to drag that entire fat payload from disk into RAM.

Your cache hit ratio tanks. Your disk IOPS scream. And your users leave.

I built a simulation so you can see exactly what happens inside your RAM:

Simulation: Buffer Pool Eviction

10MB
BLOB

Index Page

The Bully

See what happened? That red bully just nuked your performance cache.

The “Solutions” That Don’t Work

I tried these. Do not try these.

1. “Let’s Use Athena / Data Lake”

The logic: “S3 is cheap, let’s dump it there and query it!” The reality: Athena is designed for analytics, not for your user clicking “View Details”. It takes seconds. Users don’t wait seconds. They rage-click.

2. “The Hybrid Approach”

The logic: “Store small JSON in DB, move big ones to S3!” The reality: Now you have two code paths. Two sources of truth. And an onboarding doc that takes 30 minutes to explain “where the data is.” Don’t be cute. Be boring.

The Smart Fix: The S3 Pointer Pattern

Here is the “Secret Sauce” (it’s actually really standard, but feels like magic).

The DB stores: A pointer (s3_key).
The S3 bucket stores: The fat blob.

That’s it. Your database row is now tiny. It fits in the index. It stays in RAM.

The Flow

1. User asks Backend for data.
2. Backend checks DB: "Oh, file is at logs/123/cat.json"
3. Backend tells S3: "Give me a temporary URL for this file."
4. Backend gives URL to User.
5. User downloads from AWS directly.

Your backend server? Chill. Your database? Chill. Amazon’s massive servers? They carry the heavy load.

Simulation: Pointer vs Payload Path

🗄️

MySQL

Metadata Only

💻

Client

Browser/App

☁️

Amazon S3

Heavy Storage

Ready to request…

“Wait, isn’t direct S3 access dangerous?”

I can hear you screaming: “You want me to open my S3 bucket to the internet? Are you crazy?!”

Calm down. I’m not saying “Public Read”.

I’m saying Presigned URLs.

These are magic tokens your backend generates. They look like this: https://s3.aws.../cat.json?signature=123xyz&expires=300.

They are valid for 5 minutes (or whatever you set).
They only unlock one specific file.
They are cryptographically signed by your backend.

Why you shouldn’t “proxy” the file

You might think: “I’ll just download the file to my backend and pass it to the user. That feels safer.”

This is the “Proxy Trap”.

Unless you are watermarking a PDF on the fly or working around a strict corporate firewall, don’t be a proxy. Let S3 do its job.

”Show me the Code”

Okay, here is how stupidly simple this is.

1. The Schema

Your DB row loses weight immediately.

CREATE TABLE logs (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  s3_key VARCHAR(255) NOT NULL, -- The only thing that matters
  details JSON, -- Store SMALL metadata here if you need to query it
  created_at TIMESTAMP,
  PRIMARY KEY (id)
);

2. The Read Path (Python/Boto3)

def get_log_download_url(log_id, user):
    # 1. Check if user owns this log (CRITICAL STEP)
    log = db.fetch_log(log_id)
    if log.owner_id != user.id:
        raise Forbidden("Nice try, hacker.")

    # 2. Ask AWS for a temporary pass
    url = s3.generate_presigned_url(
        'get_object',
        Params={'Bucket': 'my-bucket', 'Key': log.s3_key},
        ExpiresIn=300 # 5 minutes
    )
    
    # 3. Give it to the frontend
    return {"url": url}

Check the Receipt (Cost)

If you are still unconvinced, let’s talk money.

Storing 1TB of junk in RDS (SSD storage) costs about $115/month. Storing 1TB of junk in S3 Standard IA costs about $12.50/month.

You are paying 10x more to make your database slower. Make it make sense.

TL;DR

Don’t put 10MB blobs in MySQL. It bullies your Buffer Pool.
Do store files in S3 and keep a pointer (s3_key) in the DB.
Do use Presigned URLs so your backend doesn’t have to carry the payload.
Don’t overcomplicate it.

Simple scales. Clever fails.