I recently faced dealing with some badbots and scrapers, it was in a LAMP stack with varnish at the edge. I decided to deal with it in varnish, as I always try handle as many tasks at the edge as I can, and leave apache to serve php.
So, I thought about the problem a bit, and decided to use a token bucket, nothing unusual about that. (I had to modify the source to allow passing values instead of defaulting to 1 token). However I went a bit further and decided that different pages are 'worth' more than others, i.e. they are more sensitive. For example, accessing the homepage vs accessing account pages. This required a patch of the throttle mod to allow you to pass the 'cost' of a page, so more than 1 token is removed from the bucket. For now it just logs, but I intend to send a user that is exceeding the request rate to a different backend server that will give them fake data to devalue their scraping.
you could detect user agent strings or other patterns and use as a multiplier , so bad user agent will multiply the tokens to be removed by say 5. you could do the same with cookies too.
sample config below
vcl 4.0;
import var;
import vsthrottle;
import std;
# Default backend definition. Set this to point to your content server.
backend default {
.host = "127.0.0.1";
.port = "8080";
}
sub vcl_recv {
# set weights on pages using regex patterns
var.set_int("sensitivity", 1);
if (req.url ~ "^/browse/?") {
var.set_int("sensitivity", 10);
} elsif (req.url ~ "^/stats/?") {
var.set_int("sensitivity", 20);
} elsif (req.url ~ "^/account/?") {
var.set_int("sensitivity", 30);
}
# now, lets see if they have enough credit in their token bucket to ask for this page
# token bucket is set to 150 tokens, and is measured for 10 seconds
if (vsthrottle.is_denied(client.identity, var.get_int("sensitivity") , 150, 10s)) {
# Client has exceeded credit limit, lets do things like;
# set req.backend = fakedataserver;
# maybe set a http header into the get request to add to apache logs ?
std.syslog(180, "RECV: " + req.http.host + req.url+ client.identity);
return (synth(429, "Too Many Requests"));
}
}
sub vcl_backend_response {
}
sub vcl_deliver {
}
So, I thought about the problem a bit, and decided to use a token bucket, nothing unusual about that. (I had to modify the source to allow passing values instead of defaulting to 1 token). However I went a bit further and decided that different pages are 'worth' more than others, i.e. they are more sensitive. For example, accessing the homepage vs accessing account pages. This required a patch of the throttle mod to allow you to pass the 'cost' of a page, so more than 1 token is removed from the bucket. For now it just logs, but I intend to send a user that is exceeding the request rate to a different backend server that will give them fake data to devalue their scraping.
you could detect user agent strings or other patterns and use as a multiplier , so bad user agent will multiply the tokens to be removed by say 5. you could do the same with cookies too.
sample config below
vcl 4.0;
import var;
import vsthrottle;
import std;
# Default backend definition. Set this to point to your content server.
backend default {
.host = "127.0.0.1";
.port = "8080";
}
sub vcl_recv {
# set weights on pages using regex patterns
var.set_int("sensitivity", 1);
if (req.url ~ "^/browse/?") {
var.set_int("sensitivity", 10);
} elsif (req.url ~ "^/stats/?") {
var.set_int("sensitivity", 20);
} elsif (req.url ~ "^/account/?") {
var.set_int("sensitivity", 30);
}
# now, lets see if they have enough credit in their token bucket to ask for this page
# token bucket is set to 150 tokens, and is measured for 10 seconds
if (vsthrottle.is_denied(client.identity, var.get_int("sensitivity") , 150, 10s)) {
# Client has exceeded credit limit, lets do things like;
# set req.backend = fakedataserver;
# maybe set a http header into the get request to add to apache logs ?
std.syslog(180, "RECV: " + req.http.host + req.url+ client.identity);
return (synth(429, "Too Many Requests"));
}
}
sub vcl_backend_response {
}
sub vcl_deliver {
}
Comments