How Can I Check / Validate My robots.txt?
FAQ - Search Engine Optimization (SEO)
Your robots.txt file is a very simple document. It consists of nothing more than a list of URLs, pieces of URLs, wildcards (using asterisks*) and a few lines of heading information specifying which robot, or crawler, it should target. See an example of a robots.txt file here.
Your robots.txt file should exist in the root directory of your website. For example, if your domain is www.yourdomain.com your robots.txt file needs to be located at www.yourdomain.com/robots.txt for Search Engines to find it. To see if your robots.txt file is online simple enter the above URL into your favorite browser - the text file should display in the browser.
Validating your robots.txt is a little less simple, but we all need to do it. Essentially, you need to ensure that it is working to stop crawlers from indexing content on your site that you don't want them to. One of the best ways to do this is to run your own crawling program that will crawl your site the same way a search engine would. One of our favorite such programs is GSiteCrawler.
Most, if not all, crawling programs like this will respect your robots.txt file - this means you can view exactly what URLs the search engines will attempt to index on your site. Running a crawling program like GSiteCrawler allows you to ensure that Googlebot or another search engine crawler will avoid the content you specify.
You might be wondering: "well, I want them to index ALL of my content, right?" Perhaps not. Consider this: sites that run on Content Management Systems or include heavy JavaScript functionality or other scripting languages often include dynamic content that populates your web pages on the fly.
The issue here is that crawlers will generally follow every link on your pages unless you tell them not to. There are many cases in which you won't want them to do this. Consider a calendar script that records a schedule of events for your site for example. Most calendars operate on calculations that determine where the dates and days of the weeks will fall - and users can, feasibly, click infinitely into the past or future. Now imagine this calendar in the hands of a search engine crawler. The crawler doesn't pass judgment the way a user does. A crawler can end up following your calendar into the infinite future.
Of course the crawler at some point will stop - it will determine that it has fallen into an infinite loop and cease crawling your site. So what's the problem? Infinite loops can cause crawlers to leave your site. They can also cause them not to index the important content. If they fall into an infinite loop before they index your main content, guess what - your content doesn't get indexed.
The proper use of a robots.txt file is crucial for your site's Search Engine Optimization.