SEO

Q. How can I control which pages are indexed by the Search Engines?

A. By adding a robots.txt file to the root directory of your website, you can help control the indexing of your site by robots that ignore the <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> convention.

Control which of your pages are NOT indexed with a robots.txt file

You should add a robots.txt file to the root directory of all your websites to help control the indexing of your site by robots that ignore the <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> convention. In this file you specifically list any pages that you DO NOT want walked and indexed (such as password protected folders and folders which contain only images, etc.). The robots.txt file is very simple yet very powerful and every website should have a robots.txt file on the root directory.

The Terminology

Create a new file with Notepad and call it robots.txt

The two conventions used in robots.txt file are User-agent: and Disallow: /

User-agent: * By using the * or wild card you are addressing ALL robots. If you wish to address individual robots you need to list each robot separately with an individual User-agent: statement. They must be listed by their specific name or IP Address, along with a separate Disallow: / statement listing the folders and files you DO NOT want the specified robot to index.

Tip: Use the * wild card to address all robots..... it is the safest way

Disallow: / List any folders that you do not want to have indexed by robots.

Warning: Disallow: / used without any folder name tells the robot do not index ANY page of the website.

ALL Files and folders in the directory named in the Disallow: / statement as well as all of those under it will NOT be indexed by robots.

Sample of Folders that could be in this website that we would not like the spiders to index with the search engines:

Disallow: /tutorials/meta/  Disallow: /tutorials/images/  Disallow: /tutorials/assets/  Disallow: /tutorials/404redirect/  

Example: Disallow: /tutorials/
Results: All files and sub folders located within the folder tutorials which includes all the folders listed in the above example as well as any other sub folders of the tutorials directory will not be indexed by the robots if you use this statement.

This would mean that the /meta, /images, /assets, /404redirect, AND any other folders as well as all of the files in those foldes will not be seen by indexing robots.

You may also list specific files that you do not want indexed in a robots.txt file.

Sample of Specific Files that could be in this website that we would not like the spiders to index with the search engines:

Disallow: /tutorials/meta_tags.html  Disallow: /tutorials/custom_error_page.html  

# Comments can be placed in a robots.txt file by starting the line with #

The Examples



###############################
#
# sample robots.txt file
#
# addresses all robots by using wild card *
#
User-agent: *
# list folders robots are not allowed to index

Disallow: /tutorials/meta/
Disallow: /tutorials/images/
Disallow: /tutorials/assets/
Disallow: /tutorials/404redirect/
#
# list specific files robots are not allowed to index
#
Disallow: /tutorials/meta_tags.html
Disallow: /tutorials/custom_error_page.html
#
# End of robots.txt file
#
###############################

 

Related Tutorials

Introduction to Meta Tags

Related Reference and Resources

You can read more about spiders (a.k.a. robots), META tags and what they do, as well as search engine optimization at the following URLs:

Search Engine Guide
URL: http://www.searchengineguide.com/1stsearchranking/2001/robots.html

Search Tools
URL: http://www.searchtools.com/robots/robots-txt.html

ZDNet
URL: http://www.zdnet.com/devhead/stories/articles/0,4413,1600632,00.html

 

Google has blacklisted BMW.de after the carmaker violated the search giant's guidelines by using a technique that could artificially boost its search engine rating, according to a Google engineer.

In a blog, Google software engineer Matt Cutts said that Google had removed BMW's German site from its Web index after the site included "doorway pages" that would automatically redirect visitors to a different URL.

Cutts explained that when Google's crawlers visited a BMW page, it saw blocks of text with repeated key search words such as "neuwagen," which means "new car" in German. However, when a user visited the listed page they would be automatically redirected to another page with less text and more pictures, which was more attractive than the page the crawler saw, but would have scored lower in Google's PageRank system.

"This is a violation of our Webmaster quality guidelines, specifically the principle of 'Don't deceive your users or present different content to search engines than you display to users,'" Cutts' blog said.

To regain Google listing status, Cutts expects that BMW.de will have to remove the JavaScript that redirects users around the site in this fashion and then send a reinclusion request to Google's Webspam team, which Cutts leads. BMW.de has already removed some of the redirect pages

BMW may also have to disclose details of who created the doorway pages--and assure Google "that such pages won't reappear on the sites"--before the domains can be reincluded, Cutts said.

The German site of technology product vendor Ricoh is also due to be removed from Google "for similar reasons," Cutts said.

BMW and Ricoh were unavailable for immediate comment.

Tom Espiner of ZDNet UK reported from London.