robots.txt is a simple text file that every website could have. It was created to help and inform search engines and other automated information gathering systems about whether and what they may and may not get and follow on said website.
In theory a noble idea and potential way to instruct only certain robots to have access to the website, or parts of it. In practice the real world operates quite differently!
In practice the robots.txt file is a value added accessory for the 'hitchhikers guide to hacking' and little more. So why is robots.txt bad:
- Most robots (and other automated trash) ignore robots.txt or do not follow any of the instructions in this file.
- If you do not know the exact name(s) of the robot(s), you cannot ask it not to examine and scrape you entire website.
- Recently 'new generation' robots hide behind 'Client Agent' normal Internet users use or copy the agent used by 'accepatable' search agents like Google (so check those IP numbers) .
- You cannot possibly list 'all' the robots and leave instructions for each of them. Instructions like 'only allow Google, Bing and Yahoo' do not exist and never will. Thus the file will be huge to download, and you will never really know all the robots until they hit your site. A majority even pretend to be or mimic 'normal browser users' (they have no name at all).
- It supplies additional information for hackers in two ways, it can give away the sort of technology you are running the website on and narrows the potential attack vector for hackers, like the use of 'aspx' or 'php' as file extentions or folder names specific to a type of (open source) process or platform . And points them to files and folders where the 'honey' could be found and this fact is exactly what we will leaverage to inform us of potential imminent danger to your website.
- Free popular open source material exposed to Internet without adequate knowledge of implementation can be your biggest downfall.
Catching 'red herrings' and probing robots with ill intent.
1. If you don't already have a simple text file with the name 'robots.txt' in root of you website folder, create one using a plain text editor like notepad, textpad or whatever. Else start at step 2.
2. Place an instruction to ignore and stay away from a specific web file (see example below).
3. Create the forbidden html file (in the example it is ' ClientInformation.html') with a routine to ask for and gather all available information on the remote process that just accessed this page, log it in a text file and/or have it emailed to you.
User-agent: * # applicable for all robots (or should be).
Disallow: /ClientInformation.html # Do NOT try to load, look at, or in anyway access this file.
TIPS: Do not create or use a hyperlinklink to 'ClientInformation.html' anywhere in your website, nor mention it anywhere in any online documentation, emails or other blogs/websites.wiki's..., and if you want to avoid Google finding out about it and coming to have a sneak peek at it,
Do not test it with any browser that has Internet access using the same filename, else 'big daddy' will come and have a look sooner or later to verify the state of things.
Do this and you have created an early warning system that your website is being examined by entities that should be blocked and avoided.
Ask yourself this: If you are a local Baker, why is your site so popular in China, or South America, or Russia? It's not about yout bread, but you may just be listed as an Internet place to avoid on a global scale once your site has been hacked and abused at your expense.