Protecting Against AI Load and AI Crawlers¶

AI crawlers and bots consume significant bandwidth and resources by scraping content for training models. Additionally, more AIs are ignoring robots.txt and crawling aggressively. For example: "Why store information from Wikipedia on our servers when we can just crawl Wikipedia thousands of times per second?"

AI crawlers such as: - GPTBot (OpenAI) - Anthropic Claude (claude-web) - Google Bard (various Google bots) - Various other model training bots

Will crawl your site aggressively, ignoring robots.txt, and consume bandwidth and resources. YES! Even if you have a robots.txt file, many AI crawlers will ignore it and crawl your site anyway. 💩

Apache block Methods to AI Crawlers¶

1. Block Specific User Agents (.htaccess)¶

Honestly, on paper this is the most straightforward method, but it is also the least effective as many AI crawlers ignore user agent blocking and can easily change their user agent string to bypass this method.

Create or edit .htaccess file in your document root:

Nginx Configuration File

# Block AI crawlers and bad bots
<IfModule mod_rewrite.c>
    RewriteEngine On

    # Block GPTBot (OpenAI)
    RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
    RewriteRule ^.* - [F,L]

    # Block CCBot (Common Crawl)
    RewriteCond %{HTTP_USER_AGENT} CCBot [NC]
    RewriteRule ^.* - [F,L]

    # Block Claude Web crawlers
    RewriteCond %{HTTP_USER_AGENT} Claude-Web [NC]
    RewriteRule ^.* - [F,L]

    # Block Google Bard
    RewriteCond %{HTTP_USER_AGENT} Google-Inspections [NC]
    RewriteRule ^.* - [F,L]

    # Block Cohere
    RewriteCond %{HTTP_USER_AGENT} cohere-ai [NC]
    RewriteRule ^.* - [F,L]

    # Block other AI/LLM crawlers
    RewriteCond %{HTTP_USER_AGENT} (AhrefsBot|SemrushBot|MJ12bot|DotBot) [NC]
    RewriteRule ^.* - [F,L]
</IfModule>

2. Block via Apache Configuration (Virtual Host)¶

Similar to the .htaccess method but implemented at the server configuration level. More efficient than .htaccess (cpu cycles) but still can be bypassed by changing user agents.

Add to your VirtualHost configuration:

Nginx Configuration File

<VirtualHost *:80>
    ServerName example.com

    # Block AI crawlers
    <IfModule mod_rewrite.c>
        RewriteEngine On
        RewriteCond %{HTTP_USER_AGENT} %{HTTP_USER_AGENT|GPTBot|CCBot|Claude-Web|cohere-ai} [NC]
        RewriteRule ^.* - [F,L]
    </IfModule>

    # Alternative: Use Location directive
    <LocationMatch "^/">
        Require not expr "%{HTTP_USER_AGENT} =~ /GPTBot|CCBot/i"
    </LocationMatch>

    DocumentRoot /var/www/html
</VirtualHost>

3. Rate Limiting with mod_ratelimit. This is also a good method to slow down real people.¶

Limit bandwidth per IP address:

Nginx Configuration File

a2enmod ratelimit
systemctl restart apache2

Nginx Configuration File

<VirtualHost *:443>
    ServerName my-server.com
        <Location "/downloads-to-customer">
            SetOutputFilter RATE_LIMIT
            SetEnv rate-initial-burst 20480 # 20 MB initial burst
            SetEnv rate-limit 10240 # 10 MB per second limit after initial burst available in httpd 2.4.24
        </Location>
</VirtualHost>

Nginx block Methods to AI Crawlers¶

Honestly, on paper this is the most straightforward method, but it is also the least effective as many AI crawlers ignore user agent blocking and can easily change their user agent string to bypass this method.

1. Block Specific User Agents¶

Add to your Nginx configuration (/etc/nginx/sites-available/default):

Nginx Configuration File

server {
    listen 80;
    server_name example.com;

    if ($http_user_agent ~* (GPTBot|CCBot|Claude-Web|cohere-ai|anthropic-ai|AhrefsBot|SemrushBot|MJ12bot|DotBot)) {
        return 403;
    }

    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

2. Using a Map Block¶

Create a separate file /etc/nginx/snippets/ai-crawler-block.conf:

Nginx Configuration File

http {

    # AI crawler blocking map (must be here, not inside server)
    map $http_user_agent $block_ai_crawler {
        default 0;

        # AI / LLM bots
        ~*GPTBot 1;
        ~*CCBot 1;
        ~*Claude-Web 1;
        ~*cohere-ai 1;
        ~*anthropic-ai 1;
        ~*aiindexer 1;
        ~*petalbot 1;

        # SEO bots
        ~*AhrefsBot 1;
        ~*SemrushBot 1;
        ~*MJ12bot 1;
        ~*DotBot 1;

        # Other bots
        ~*YandexBot 1;
        ~*bingbot 1;
    }

    server {
        listen 80;
        server_name example.com;

        if ($block_ai_crawler) {
            return 403;
        }

        location / {
            proxy_pass http://backend;
        }
    }
}

Include it in your server block:

Bash

server {
    listen 80;
    server_name example.com;

    include /etc/nginx/snippets/ai-crawler-block.conf;

    location / {
        proxy_pass http://backend;
    }
}

3. Rate Limiting (You shuld know this even if you not go to block AI.)¶

Add rate limiting to prevent crawler overload. https://blog.nginx.org/blog/rate-limiting-nginx

Nginx Configuration File

http {
    # 1. Define the zones
    limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=crawler:10m rate=2r/s;

    server {
        listen 80;
        server_name example.com;

        # 2. Apply general limit to the whole site
        location / {
            limit_req zone=general burst=20; # Allow a burst of 20, then 503
            try_files $uri $uri/ /index.php?$args;
        }

        # 3. Apply stricter limit to search engine crawlers
        location /search/ {
            limit_req zone=crawler burst=5 nodelay;
            proxy_pass http://backend;
        }
    }
}

4. Geo-IP Based Blocking¶

Block requests from specific geographic locations. # Create /etc/nginx/geo/ai_crawlers.geo

Nginx Configuration File

geo $block_geo {
    default 0;
    192.0.2.0/24 1;
    198.51.100.0/24 1;
}

server {
    listen 80;
    server_name example.com;

    if ($block_geo) {
        return 403;
    }

    location / {
        proxy_pass http://backend;
    }
}

5. Complete Nginx Protection Block¶

Create /etc/nginx/conf.d/ai-crawler-protection.conf:

Nginx Configuration File

map $http_user_agent $block_crawlers {
    default         0;
    ~*gptbot        1;
    ~*ccbot         1;
    ~*claude-web    1;
    ~*cohere-ai     1;
    ~*anthropic-ai  1;
    ~*aiindexer     1;
    ~*petalbot      1;
    ~*bytespider    1;
    ~*ahrefsbot     1;
    ~*semrushbot    1;
    ~*mj12bot       1;
    ~*dotbot        1;
    ~*duckduckbot   1;
    ~*yandexbot     1;
}

# Rate limiting zones (10 million lookups, 10 requests per minute)
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/m;
limit_req_zone $binary_remote_addr zone=strict:10m rate=1r/m;

server {
    listen 80;
    server_name example.com;

    # Block crawlers before processing other requests
    location = / {
        # Check for crawler/bot user agents
        if ($block_crawlers) {
            return 403;
        }

        # Rate limiting configuration
        limit_req zone=general burst=5 nodelay;
        limit_req zone=strict burst=2 nodelay;

        # Pass to backend
        proxy_pass http://backend;
        proxy_set_header Host $host;
    }
}

Test Nginx configuration:

Bash

sudo nginx -t
sudo systemctl reload nginx

CAPTCHA¶

AI-powered CAPTCHA solving services
Farms of low-paid humans
Hybrid human-bot networks

Counter: Behavioral-based challenges (Cloudflare Turnstile) that humans pass invisibly