🚀 Executive Summary
TL;DR: GPTBot’s aggressive crawling can significantly inflate Vercel bills for open-source projects by triggering serverless functions. Developers can mitigate this by implementing a `robots.txt` disallow rule, a more robust `vercel.json` edge configuration to block requests by User-Agent, or, as a last resort, IP blocking.
🎯 Key Takeaways
- The `robots.txt` file offers a simple, honor-system method to request GPTBot and ChatGPT-User to not crawl a site, but it is not a guaranteed block against all crawlers.
- Vercel’s `vercel.json` configuration provides a highly effective edge-level solution to block GPTBot by its User-Agent, returning a `403 Forbidden` status and preventing serverless function invocation costs.
- IP blocking, using OpenAI’s published ranges (retrieved via `dig TXT oai.openai.com`), is a ‘nuclear option’ for persistent, non-compliant scrapers, but requires high maintenance and carries risks of inadvertently blocking legitimate services.
OpenAI’s GPTBot hammering your open-source project and inflating your Vercel bill? I’ve seen it happen to the best of us. Here are three battle-tested ways to block it, from the polite request to the firewall lockdown.
So, GPTBot Found Your Project. Here’s How to Stop Paying for OpenAI’s Scraper.
It was 3 AM, and of course, PagerDuty was screaming about latency on one of our core APIs. I dove in, expecting a bad deploy or a failing `prod-db-01` replica. Instead, I found our load balancers were getting absolutely hammered by a single, relentless user agent I’d never seen before. A new “AI” company had decided our internal pricing API was prime training data. The feeling of watching your carefully architected system get treated like a free-for-all buffet by a mindless scraper is infuriating. So when I saw that Reddit thread about GPTBot running up a developer’s Vercel bill, I felt that all too familiar frustration.
First, What’s Going On Here?
Let’s get one thing straight: GPTBot isn’t malicious, it’s just… voracious. It’s OpenAI’s web crawler, and its job is to scrape the public internet to train future models like GPT-5. The problem is, it’s incredibly aggressive. For a static site, this might just be annoying. But on a platform like Vercel, especially if you have serverless functions firing for each request, every one of those 164,000 daily hits can translate directly into a line item on your invoice. You’re essentially paying for OpenAI to train their next billion-dollar product on your work. We can do better.
Here are three ways to deal with this, from the polite suggestion to the bouncer at the door.
Solution 1: The “Polite Request” using robots.txt
This is the classic, standard way to manage bots. You create a file named robots.txt in the root of your public directory and tell “good” bots what they are and aren’t allowed to do. It’s the honor system of the internet.
The Fix:
In your project’s public folder (the one that gets deployed to the root of your site), create a file called robots.txt with the following content:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
This tells both of OpenAI’s known bots that they are not allowed to crawl any part of your site. Most reputable crawlers, including GPTBot, will respect this. It’s the easiest and fastest thing to implement.
My Take: This is step one. Always do this. But don’t rely on it as your only line of defense. A polite note only works on polite visitors, and not all bots are polite.
Solution 2: The “Firm No” at the Edge (Vercel Config)
If the polite request doesn’t work, or you just want to be certain, you need to block the bot before it ever hits your functions. On Vercel, the best way to do this is with the vercel.json configuration file. This is a platform-level rule that intercepts requests at the edge based on their User-Agent header.
The Fix:
In the root of your project, create or edit your vercel.json file to include a “blocked” rule in the middleware configuration. It will check the `user-agent` header for the string “GPTBot” and, if it matches, return a 403 Forbidden status immediately.
{
"middleware": [
{
"matcher": "/(.*)",
"edge": true,
"handler": "middleware.js"
}
],
"routes": [
{
"src": "/(.*)",
"has": [
{
"type": "header",
"key": "User-Agent",
"value": ".*GPTBot.*"
}
],
"status": 403,
"dest": "/dev/null"
}
]
}
This is far more robust. The bot gets an access denied error without ever executing your code, saving you the invocation cost. This is my recommended approach for anyone on Vercel.
Solution 3: The “Nuclear Option” (IP Blocking)
Sometimes, a bot will spoof its User-Agent, rendering the previous methods useless. In that case, the last resort is blocking the known IP address ranges of the service you want to restrict. OpenAI publishes its IP ranges for this very reason.
The Fix:
This isn’t something you’d typically do in vercel.json. You’d implement this at a higher level, like a Cloudflare WAF (Web Application Firewall) rule or in your AWS security groups if you were self-hosting. You would create a rule that denies all traffic from OpenAI’s published IP blocks.
You can find the official list by running a DNS TXT query:
$ dig +short TXT oai.openai.com
Warning: I call this the ‘nuclear’ option for a reason. IP ranges can change, requiring you to maintain your block list. More importantly, you could inadvertently block legitimate OpenAI services or APIs that your application might actually need to use in the future. Use this with extreme caution and only if you’re facing a truly persistent, non-compliant scraper.
Quick Comparison
Here’s a quick cheat sheet for choosing your strategy.
| Solution | Ease of Implementation | Effectiveness | Maintenance |
|---|---|---|---|
| robots.txt | Very Easy | Low (Relies on trust) | None |
| Vercel Config Block | Easy | High | Low (Update if agent changes) |
| IP Blocking | Moderate / Complex | Very High | High (IPs can change) |
Your passion project shouldn’t be a free lunch for a trillion-dollar company. Start with robots.txt, but have the vercel.json block ready to go. It’s the most effective and pragmatic solution for protecting your project and your wallet. Now go ship something cool.
🤖 Frequently Asked Questions
âť“ What is GPTBot and why is it problematic for Vercel users?
GPTBot is OpenAI’s web crawler designed to scrape public internet data for training future models. It becomes problematic on Vercel, especially for projects utilizing serverless functions, because its high volume of requests directly translates into invocation costs, forcing developers to pay for OpenAI’s data collection.
âť“ How do `robots.txt`, Vercel config, and IP blocking compare for stopping GPTBot?
`robots.txt` is very easy but relies on the bot’s compliance. Vercel config (using `vercel.json`) is easy, highly effective for blocking by User-Agent at the edge, and prevents function costs. IP blocking is moderate/complex, very high in effectiveness against spoofing, but high maintenance due to changing IP ranges and risks blocking legitimate services.
âť“ What is a common implementation pitfall when trying to block GPTBot and how can it be avoided?
A common pitfall is relying solely on `robots.txt`, as it’s an honor system and not all bots comply, leading to continued billing. This can be avoided by implementing a more robust edge-level block using `vercel.json` to check the `User-Agent` header and return a `403 Forbidden` status immediately, ensuring the bot never triggers serverless functions.
Leave a Reply