Posts

Showing posts from August, 2016
I recently faced dealing with some badbots and scrapers,  it was in a LAMP stack with varnish at the edge.  I decided to deal with it in varnish, as I always try handle as many tasks at the edge as I can, and leave apache to serve php. So, I thought about the problem a bit, and decided to use a token bucket, nothing unusual about that. (I had to modify the source to allow passing values instead of defaulting to 1 token).  However I went a bit further and decided that different pages are 'worth' more than others, i.e. they are more sensitive.  For example, accessing the homepage vs accessing account pages.  This required a patch of the throttle mod to allow you to pass the 'cost' of a page, so more than 1 token is removed from the bucket.  For now it just logs, but I intend to send a user that is exceeding the request rate to a different backend server that will give them fake data to devalue their scraping. you could detect user agent strings or other patterns and