Protecting Against AI Load and AI Crawlers¶
AI crawlers and bots consume significant bandwidth and resources by scraping content for training models. Additionally, more AIs are ignoring robots.txt and crawling aggressively. For example: "Why store information from Wikipedia on our servers when we can just crawl Wikipedia thousands of times per second?"
AI crawlers such as: - GPTBot (OpenAI) - Anthropic Claude (claude-web) - Google Bard (various Google bots) - Various other model training bots
Will crawl your site aggressively, ignoring robots.txt, and consume bandwidth and resources. YES! Even if you have a robots.txt file, many AI crawlers will ignore it and crawl your site anyway. 💩
- https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
- https://www.theverge.com/2025/3/20/23645081/ai-crawlers-traffic-blocking-websites
- https://www.wired.com/story/made-in-china-niche-websites-are-seeing-a-surge-of-mysterious-traffic-from-china/
- https://digiday.com/media/inside-the-atlantics-ai-bot-blocking-strategy/
- https://pressgazette.co.uk/platforms/eight-in-ten-of-worlds-biggest-news-websites-now-block-ai-training-bots/
- https://www.cloudflare.com/the-net/building-cyber-resilience/regain-control-ai-crawlers/
- https://www.cnet.com/tech/services-and-software/why-wikipedia-is-losing-traffic-to-ai-overviews-on-google/
Apache block Methods to AI Crawlers¶
1. Block Specific User Agents (.htaccess)¶
Honestly, on paper this is the most straightforward method, but it is also the least effective as many AI crawlers ignore user agent blocking and can easily change their user agent string to bypass this method.
Create or edit .htaccess file in your document root:
# Block AI crawlers and bad bots
<IfModule mod_rewrite.c>
RewriteEngine On
# Block GPTBot (OpenAI)
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule ^.* - [F,L]
# Block CCBot (Common Crawl)
RewriteCond %{HTTP_USER_AGENT} CCBot [NC]
RewriteRule ^.* - [F,L]
# Block Claude Web crawlers
RewriteCond %{HTTP_USER_AGENT} Claude-Web [NC]
RewriteRule ^.* - [F,L]
# Block Google Bard
RewriteCond %{HTTP_USER_AGENT} Google-Inspections [NC]
RewriteRule ^.* - [F,L]
# Block Cohere
RewriteCond %{HTTP_USER_AGENT} cohere-ai [NC]
RewriteRule ^.* - [F,L]
# Block other AI/LLM crawlers
RewriteCond %{HTTP_USER_AGENT} (AhrefsBot|SemrushBot|MJ12bot|DotBot) [NC]
RewriteRule ^.* - [F,L]
</IfModule>
2. Block via Apache Configuration (Virtual Host)¶
Similar to the .htaccess method but implemented at the server configuration level. More efficient than .htaccess (cpu cycles) but still can be bypassed by changing user agents.
Add to your VirtualHost configuration:
<VirtualHost *:80>
ServerName example.com
# Block AI crawlers
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} %{HTTP_USER_AGENT|GPTBot|CCBot|Claude-Web|cohere-ai} [NC]
RewriteRule ^.* - [F,L]
</IfModule>
# Alternative: Use Location directive
<LocationMatch "^/">
Require not expr "%{HTTP_USER_AGENT} =~ /GPTBot|CCBot/i"
</LocationMatch>
DocumentRoot /var/www/html
</VirtualHost>
3. Rate Limiting with mod_ratelimit. This is also a good method to slow down real people.¶
Limit bandwidth per IP address:
<VirtualHost *:443>
ServerName my-server.com
<Location "/downloads-to-customer">
SetOutputFilter RATE_LIMIT
SetEnv rate-initial-burst 20480 # 20 MB initial burst
SetEnv rate-limit 10240 # 10 MB per second limit after initial burst available in httpd 2.4.24
</Location>
</VirtualHost>
Nginx block Methods to AI Crawlers¶
Honestly, on paper this is the most straightforward method, but it is also the least effective as many AI crawlers ignore user agent blocking and can easily change their user agent string to bypass this method.
1. Block Specific User Agents¶
Add to your Nginx configuration (/etc/nginx/sites-available/default):
server {
listen 80;
server_name example.com;
if ($http_user_agent ~* (GPTBot|CCBot|Claude-Web|cohere-ai|anthropic-ai|AhrefsBot|SemrushBot|MJ12bot|DotBot)) {
return 403;
}
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
2. Using a Map Block¶
Create a separate file /etc/nginx/snippets/ai-crawler-block.conf:
http {
# AI crawler blocking map (must be here, not inside server)
map $http_user_agent $block_ai_crawler {
default 0;
# AI / LLM bots
~*GPTBot 1;
~*CCBot 1;
~*Claude-Web 1;
~*cohere-ai 1;
~*anthropic-ai 1;
~*aiindexer 1;
~*petalbot 1;
# SEO bots
~*AhrefsBot 1;
~*SemrushBot 1;
~*MJ12bot 1;
~*DotBot 1;
# Other bots
~*YandexBot 1;
~*bingbot 1;
}
server {
listen 80;
server_name example.com;
if ($block_ai_crawler) {
return 403;
}
location / {
proxy_pass http://backend;
}
}
}
Include it in your server block:
server {
listen 80;
server_name example.com;
include /etc/nginx/snippets/ai-crawler-block.conf;
location / {
proxy_pass http://backend;
}
}
3. Rate Limiting (You shuld know this even if you not go to block AI.)¶
Add rate limiting to prevent crawler overload. https://blog.nginx.org/blog/rate-limiting-nginx
http {
# 1. Define the zones
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=crawler:10m rate=2r/s;
server {
listen 80;
server_name example.com;
# 2. Apply general limit to the whole site
location / {
limit_req zone=general burst=20; # Allow a burst of 20, then 503
try_files $uri $uri/ /index.php?$args;
}
# 3. Apply stricter limit to search engine crawlers
location /search/ {
limit_req zone=crawler burst=5 nodelay;
proxy_pass http://backend;
}
}
}
4. Geo-IP Based Blocking¶
Block requests from specific geographic locations. # Create /etc/nginx/geo/ai_crawlers.geo
geo $block_geo {
default 0;
192.0.2.0/24 1;
198.51.100.0/24 1;
}
server {
listen 80;
server_name example.com;
if ($block_geo) {
return 403;
}
location / {
proxy_pass http://backend;
}
}
5. Complete Nginx Protection Block¶
Create /etc/nginx/conf.d/ai-crawler-protection.conf:
map $http_user_agent $block_crawlers {
default 0;
~*gptbot 1;
~*ccbot 1;
~*claude-web 1;
~*cohere-ai 1;
~*anthropic-ai 1;
~*aiindexer 1;
~*petalbot 1;
~*bytespider 1;
~*ahrefsbot 1;
~*semrushbot 1;
~*mj12bot 1;
~*dotbot 1;
~*duckduckbot 1;
~*yandexbot 1;
}
# Rate limiting zones (10 million lookups, 10 requests per minute)
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/m;
limit_req_zone $binary_remote_addr zone=strict:10m rate=1r/m;
server {
listen 80;
server_name example.com;
# Block crawlers before processing other requests
location = / {
# Check for crawler/bot user agents
if ($block_crawlers) {
return 403;
}
# Rate limiting configuration
limit_req zone=general burst=5 nodelay;
limit_req zone=strict burst=2 nodelay;
# Pass to backend
proxy_pass http://backend;
proxy_set_header Host $host;
}
}
Test Nginx configuration:
CAPTCHA¶
- AI-powered CAPTCHA solving services
- Farms of low-paid humans
- Hybrid human-bot networks
Counter: Behavioral-based challenges (Cloudflare Turnstile) that humans pass invisibly