Controlling Search Engines and Web Crawlers Using robots.txt
The robots.txt file allows you to control which parts of your site search engines and web crawlers can access. You place the file in your document root (public_html) and use directives to specify crawler behavior.
Important Note
Directives in a robots.txt file are requests, not enforceable rules. Most search engines respect them, but some crawlers may ignore them. Do not rely on robots.txt to hide sensitive content.
Common Directives
1. Allow All Crawlers to Access All Files
User-agent: *
Disallow:
-
User-agent: *applies to all crawlers. -
Disallow:with no value allows access to all files.
2. Block All Crawlers from Accessing the Site
User-agent: *
Disallow: /
-
The
/path blocks crawlers from all files on the site.
3. Block Crawlers from a Specific Directory
User-agent: *
Disallow: /scripts/
-
Crawlers are prevented from accessing the
/scripts/directory.
4. Block Crawlers from a Specific File
User-agent: *
Disallow: /documents/index.html
-
Crawlers cannot access the file
/documents/index.html.
5. Control Crawl Interval
User-agent: *
Crawl-delay: 30
-
Crawlers are instructed to wait at least 30 seconds between requests.
For more information about the robots.txt file, visit http://www.robotstxt.org.