|
The robots.txt (not robot.txt as it is
commonly misspelled) and robots meta tag are two similar methods
for excluding the search engine robots from indexing all or
part of your website.
Perhaps you don't want a curious little
robot parsing around your cgi folder or perhaps you have set
up a temporary directory or a private area or perhaps you
have an example page set up demonstrating spamming your own
page with keywords or hidden text links and you don't want
this page penalized by the search engines.
These are all reasons to block a search
engine robot from indexing some pages, images, scripts and
other elements of your website. The robots.txt file needs
to go in the root folder (the same folder as your index.html
file).
Robots.txt File
In order to exclude the search engine robots from all or
parts of your website, many webmasters will use a robots.txt
file. This file is text only (no html) and has a couple of
specific areas needing specific information.
User-Agent
The user-agent area contains the name of the robot, robots
or all robots.
Example 1:
User-agent: * In this example, the wildcard * means all robots.
Example 2:
User-agent: googlebot In this example, only Google's robot
is excluded.
Disallow
The next area of the robots.txt file is the Disallow area.
In this area you can exclude a robot or robots from indexing
your folders, images, html pages, scripts or other files.
Example 1:
Disallow: /cgi-bin In this example, only the cgi-bin folder
is excluded.
Example 2:
Disallow: /query.html In this example, only the query.html
file is excluded.
If you would like your entire site not to be indexed by any
of the search engines you would put this in your robots.txt
file:
User-agent: *
Disallow: /
If you want to exclude all of the robots from a certain directory
on your website, your robots.txt file would look like this:
User-agent: *
Disallow: /images/
If you want to exclude the robot from indexing a certain
file in a certain directory, the robots.txt file would look
like this:
User-agent: *
Disallow: /stuff/wacky.html
If you would like to keep a specific search engine robot
from indexing a specific file, the robots.txt file would look
like this:
User-agent: googlebot
Disallow: /stuff/wacky.html
If you would like to see what the world's top search engine
Google is doing, let's take a look at the Google robots.txt
file:
User-agent: *
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalog_list
Disallow: /news
Disallow: /pagead/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /wml
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local
Disallow: /froogle?
Disallow: /froogle_
If you would like to see what our own White House is doing,
let's take a look at a small part of the whitehouse.gov robots.txt
file (the whole file is too long for display here):
User-agent: *
Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /help
Disallow: /360pics/iraq
Disallow: /360pics/text
Disallow: /911/911day/iraq
Disallow: /911/911day/text
Disallow: /911/heroes/iraq
Disallow: /911/heroes/text
Disallow: /911/iraq
Disallow: /911/patriotism/iraq
Disallow: /911/patriotism/text
Disallow: /911/patriotism2/iraq
Disallow: /911/patriotism2/text
Disallow: /911/progress/iraq
Disallow: /911/progress/text
Disallow: /911/remembrance/iraq
Disallow: /911/remembrance/text
Disallow: /911/response/iraq
Disallow: /911/response/text
Disallow: /911/sept112002/iraq
Disallow: /911/sept112002/text
Disallow: /911/text
Disallow: /afac/index.htm/text
Disallow: /afac/iraq
Disallow: /afac/text
Disallow: /agencycontact/iraq
Disallow: /agencycontact/text
Disallow: /appointments/iraq
Disallow: /appointments/text
The character # is used for a comment in the robots.txt file.
The rule of thumb is to place the # on new line with the comment.
Robots Meta Tag
Another way to exclude the robots from indexing html pages
of a website or not following the links is by using a robots
meta tag. There are 4 main robots meta tags one can use in
order to instruct a robot.
Example:
<HEAD>
<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">
</HEAD>
Index - instructs the robot to index the page.
Noindex - instructs the robot not to index the page.
Follow - instructs the robot to follow the links from the
page and index them.
Nofollow - instructs the robot not to follow the links from
the page and thus not index them.
If there is not robots meta tag on the website, then the default
is "index,follow" which means all robots will index
the page and follow the links to other pages for indexing.
|