As from ASP.net 3.5 application cache in memory could be used anywhere and anytime in your web process, but could not be used with WPF nor desktop application development. Since .net 4.0 things have changed with the inclusion of System.Runtime.Caching.
Make sure to include it as a reference if you're using VS2010 (like I do here).
How it works
BotChecker is set up as a class library (not complete in this example). It can be implemented in any application, from websites (Forms and WCF), WPF as well as Desktop applications. In this case the purpose is to check whether we have a list of bots to compare too, if not load it and else use the cached string array as comparator for our inbound user agent identification strings.
Using an editable XML file as source, LINQ is used to extract the keywords used to check against the incoming UA strings from website requests.
The main method is DoesBotExist and as argument the user agent string from the incoming http request.
False = Could not find any part of the user agent string that matches any entry in the cached array of trigger strings to compare to.
True = An entry in the botlist.xml was found in the incoming UA string of the request. Depending on how diverse and distributed the list is, you could consider a host of responses.
To process and dump these requests fast, just return a HTTP Status Code and close the connection.
- 401 - Unauthorized (but this can be confusing suggesting an alternative entry point)
- 402 - Payment required. My personal favorite. If you want to (ab)use my information for your own profit, then payment is required.
- 410 - Gone. Maybe they will believe it and not return. My findings are all but the case.
- 418 - "I'm a teapot' Doubt this will have effect as most low-to-no cost hosters, wanna be cloud operators and their websites are just that! And bound to appear on your radar as zombies in the near future with a lot worse intent in mind.
- Alternative - Redirect (301) to an 'unwelcome' page indicating to the client your do not appreciate them connecting and extracting information from your resource.
Make a class in your project or website and insert the BotChecher code. Then in your operational environment (Global.asax for ASP websites) insert the following to activate the fast cached bot lookup method.
// At any place, like Application_BeginRequest in the Global.asax file for web sites, or InitializeComponent for applications
BotChecker BC = new BotChecker();
To test the user agent string against your list. (for websites)
// Redirect or set header and close connection
The above snippet in the Global.asax will allow you to intercept and deal with requests before they hit your site. This is an inline process.
If all you want to do is log their activity a better solution would be use a separate thread and not delay the requst while you do the check.
Implement the routine as a 'fire and forget' bot logger.
Don't waist time waiting for the logger to check and potentially log bot access, just call the routine with the complete user agent string and pass control back to the web application. It will run in the background and do its work.
// Include this namespace
// In Global.asax 'Application_BeginRequest'
Thread MyThread = default(Thread);
BotChecker BC = new BotChecker();
MyThread = new Thread(new ParameterizedThreadStart(BC.DoesBotExist));
object MyParameters = Request.UserAgent;
// In the DoesBotExistroutine, write the result (if any) to a log file for later analysis.
// You can pass in more parameters, like IP address for registration purposes.
Fact: Bots have become smart, bots have embraced cloud computing long before mortals knew it even existed. Bots hide in plain sight trying to appear like normal human visitors.
With this additional clutter of noise and bots hiding behind 'human browser' user agents, black and white becomes gray and the processes to block bad bots by using their 'user agent string', although presently still rather effective, will become less so in time as they all try to act like and look like anonymous web users.
A better way to track and trace activity (from a receivers perspective) is to use behavior pattern technology. Bots can only go so far to hide what they are are and what they are after. It is thus rather easy to detect these abstracts between normal real user activity on port 80. In future articles I will go into how to detect these 'wolf bots' in 'anonymous user sheeps clothing'. They are very resourcefull in their methods but all fall short of many things that typically dipict us as 'human users online'.
Fully functional VB.net 'SpiderWasp" implementation as a HTTP 403 (payment required) feeder for detecting and stopping known unwanted spiders from crawling your web. For PHP and C# versions - contact me.