robots.txt is a text file that instructs web robots (typically search engine robots) how to crawl pages on their website.
The robots.txt file is part of the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content.
List of User Agents
- Google – Googlebot
- Bing – Bingbot
- Yahoo – Slurp
- MSN – Msnbot
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
Here are a few examples of robots.txt in action for a www.example.com site:
a. Blocking all web crawlers from all content
b. Allowing all web crawlers access to all content
c. Blocking a specific web crawler from a specific folder
d. Blocking a specific web crawler from a specific web page
Other quick robots.txt must-knows
- In order to be found, a robots.txt file must be placed in a website’s top-level directory.
- Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt, robots.TXT, or otherwise).
- The /robots.txt file is a publicly available: just add /robots.txt to the end of any root domain to see that website’s directives
- Each sub-domain on a root domain uses separate robots.txt files. This means that both blog.example.com and example.com should have their own robots.txt files
- It’s generally a best practice to indicate the location of any sitemaps associated with the domain at the bottom of the robots.txt file.
Robots.txt is a text file that instructs web robots (typically search engine robots) how to crawl pages on their website.