I recently faced dealing with some badbots and scrapers,  it was in a LAMP stack with varnish at the edge.  I decided to deal with it in varnish, as I always try handle as many tasks at the edge as I can, and leave apache to serve php.

So, I thought about the problem a bit, and decided to use a token bucket, nothing unusual about that. (I had to modify the source to allow passing values instead of defaulting to 1 token).  However I went a bit further and decided that different pages are 'worth' more than others, i.e. they are more sensitive.  For example, accessing the homepage vs accessing account pages.  This required a patch of the throttle mod to allow you to pass the 'cost' of a page, so more than 1 token is removed from the bucket.  For now it just logs, but I intend to send a user that is exceeding the request rate to a different backend server that will give them fake data to devalue their scraping.

you could detect user agent strings or other patterns and use as a multiplier , so bad user agent will multiply the tokens to be removed by say 5.  you could do the same with cookies too.

sample config below

vcl 4.0;
import var;
import vsthrottle;
import std;


# Default backend definition. Set this to point to your content server.
backend default {
    .host = "127.0.0.1";
    .port = "8080";
}

sub vcl_recv {

        # set weights on pages using regex patterns

var.set_int("sensitivity", 1);
       
        if (req.url ~ "^/browse/?") {
                var.set_int("sensitivity", 10);
        } elsif (req.url ~ "^/stats/?") {
                var.set_int("sensitivity", 20);
        } elsif (req.url ~ "^/account/?") {
                var.set_int("sensitivity", 30);
        }

        # now, lets see if they have enough credit in their token bucket to ask for this page
        # token bucket is set to 150 tokens, and is measured for 10 seconds

        if (vsthrottle.is_denied(client.identity, var.get_int("sensitivity") , 150, 10s)) {

          # Client has exceeded credit limit, lets do things like;
# set req.backend = fakedataserver;
# maybe set a http header into the get request to add to apache logs ?
std.syslog(180, "RECV: " + req.http.host + req.url+ client.identity);
return (synth(429, "Too Many Requests"));
        }

}

sub vcl_backend_response {
}

sub vcl_deliver {
}

Comments

Popular posts from this blog

Baileys liquor Chocolate Chip and Cream desert

nginx decode base64 url for use with imgproxy

Resol VBUS success