I recently faced dealing with some badbots and scrapers, it was in a LAMP stack with varnish at the edge. I decided to deal with it in varnish, as I always try handle as many tasks at the edge as I can, and leave apache to serve php. So, I thought about the problem a bit, and decided to use a token bucket, nothing unusual about that. (I had to modify the source to allow passing values instead of defaulting to 1 token). However I went a bit further and decided that different pages are 'worth' more than others, i.e. they are more sensitive. For example, accessing the homepage vs accessing account pages. This required a patch of the throttle mod to allow you to pass the 'cost' of a page, so more than 1 token is removed from the bucket. For now it just logs, but I intend to send a user that is exceeding the request rate to a different backend server that will give them fake data to devalue their scraping. you could detect user agent strings or other ...
Posts
Showing posts from 2016