Random header image... Refresh for more!

Is Your robots.txt File Killing Your Search Engine Rankings?

January 8th, 2008 · 18 Comments

Way back in April 2007 I wrote a post about robots.txt, a small text file that instructs search engines about what can and can’t be indexed on your site.

Since moving to Wordpress a few months ago, it’s become clear that Wordpress has a few issues with duplicate content. In essence, it generates multiple URLs that address the same content. This is generally accepted to be a bad thing, and I’ve taken steps to fix it by adding entries to my robots.txt file. This post explains what I did, and why.

Before I go on, I want to acknowledge the two sources of information I used to understand why I was screwing up:

Both are highly recommended reading, and go way beyond the scope of this robots.txt post.

here’s what my robots.txt file now looks like:

Sitemap: http://stratify.com.au/sitemap.xml
Sitemap: http://stratify.com.au/sitemap.xml.gz

User-agent: *
Disallow: /*20
Disallow: /*category
Disallow: /*feed
Disallow: /*p=
Disallow: /*s=
Disallow: /*trackback
Disallow: /*wp-

Sitemap:

The first two lines should be self explanatory; they point to my sitemap files. Note that I haven’t included “www” in the URL, this is because I have Wordpress set up to show http://stratify.com.au as the root. Not sure that it matters, but at least there’s consistency there.

User-agent: *

This line simply says that the following instructions apply to all search engine spiders.

Disallow:

The disallow operator tells the search engines spider what it should not index. All the major search engines respect robots.txt commands, and as a side bar point logfile analyzer AWStats identifies search engine spiders firstly by their HTTP_USER_AGENT value (it recognises over 400 of them), and secondly by access to the robots.txt file.

/*20 - This stops pages being indexed from the date archive page. Typically these pages are prefixed by the year, so this argument eliminates everything from the 21st century. Should be good for a while. To check this is the case for your blog, just go to the archive page, click a month, and look at the URL of any post in that month. If it’s preceded by the date, then you need this entry.

/*category - Similarly, this one stops pages being indexed from the category archive. Check in the same way above.

/*feed - This stops your RSS feeds from being indexed.

/*p= - If you have permalinks as per the default installation, post URL will look something like “http://stratify.com.au/?p=73″. However your WordPress install has permalinks set correctly, right? This value simply stops any reference to a post that doesn’t use your preferred permalink structure.

/*s= - This stops search results pages (Wordpress-supplied search, that is) from being indexed.

/*trackback - This stops any trackback URLs from being indexed.

/*wp- - Finally, this stops all Wordpress’ internal files including logins from being indexed.

Tags: SEO

18 responses so far ↓

  • 1 Raymond Chua // Jan 8, 2008 at 5:31 am

    Great info. I have bookmarked this page. Very useful. :)

  • 2 Sol Lederman // Jan 8, 2008 at 5:41 am

    Hi, can you tell me where to look for robots.txt? I’ve got several domains hosted at different hosting providers. In one case I’ve got several domains all hosted by one provider, hostgator.

    Is robots.txt a file that exists for each domain I own?

    Thanks,
    Sol

  • 3 Mark // Jan 8, 2008 at 5:48 am

    @Raymond: Thanks for stopping by!

  • 4 Joost de Valk // Jan 8, 2008 at 6:14 am

    Thanks for the link, I’m glad you liked my article!

  • 5 Blaine Moore // Jan 8, 2008 at 2:34 pm

    Good tips, except for one problem: People that use dates in their permalinks should not disallow the /*20 parameter! If your links look like: www.domain.com/2008/01/01/post-title/ then that will effectively block all of your pages that you want indexed.

    If you don’t use dates in your permalinks, then go ahead and use that line.

  • 6 MichelleVan // Jan 8, 2008 at 7:04 pm

    I did all this and more on a site I managed last year and it never made a difference….. When I do seo on some top sites, I see they all have duplicate content… Please let us know if this works!
    Thanks

  • 7 Jens P. Berget // Jan 8, 2008 at 9:39 pm

    I have not included the sitemap in my robots.txt, that’s because I didn’t know that I was supposed to :-)

    Thanks for the information.

    I might be doing some changes to mine…

  • 8 Chris Bloczynski // Jan 9, 2008 at 1:15 am

    Good tip! Thanks, I’m rushing back to my site to check it out.

  • 9 Mark // Jan 9, 2008 at 1:37 am

    @Sol: robots.txt should be located at the root directory of every site you own. It’s not there by default, so what I do is write it in Notepad and just ftp it over when I’m happy with it.

    @Joost: You’re very welcome. And I love your blog metrics plugin, BTW. I use it on this blog (it’s at http://www.joostdevalk.nl/wordpress/blog-metrics/).

    @Blaine: Thanks for pointing this out, you’re 100% correct. I guess I’m guilty of setting up my permalinks one way (%category%/%postname%) and forgetting that others use different schemes.

    @Michelle: It’s really hard to figure out what the search engines deem as duplicate content. What I try to achieve with my robots.txt is a situation where an single post is referenced by two (or more) different URLs. It could be the search engines are smart enough to figure this out and not apply a penalty.

    @Jens: The search engines assume your sitemap files are located in your root directory, so you’re likely to be safe about not declaring the location explicitly in robots.txt.

  • 10 Mark // Jan 9, 2008 at 1:40 am

    @Chris: Thanks for visiting and commenting!

    BTW… later today I will be removing the default ‘nofollow’ parameter for all blog comments, so everyone who’s commented here will get a free one-way link to their site :-). That’s my way of saying ‘thanks’ to all the stumblers and Blog Mastermind people who’ve stopped by.

  • 11 Stratify removes nofollow tags from comments | Stratify Pty Ltd // Jan 9, 2008 at 2:39 am

    […] I wrote about tweaking your robots.txt file so that visiting search engine spiders would behave. One of the commenters, Joost de Valk, has […]

  • 12 Why You Should Use The Wordpress More Tag | Stratify Pty Ltd // Jan 10, 2008 at 10:13 am

    […] few days ago I wrote a post on using robots.txt to control search engine indexing behaviour, and since then I’ve learned of a plugin that […]

  • 13 Rasheed // Feb 25, 2008 at 6:24 pm

    Thanks for your post.

    But I found:

    Disallow: /*p=7

    Will block also

    /?p=7
    /?p=70
    /?p=71
    /?p=72
    /?p=73
    /?p=700

    etc.

  • 14 Mark // Feb 25, 2008 at 7:54 pm

    @Rasheed: Thanks for visiting! You are right - but this is the action that’s intended!

    This post, for example, can be accessed by ../?p=80. It can also be accessed by ../seo/robots-file-search-engine-rankings. As far as the search engine is concerned, these are two different URLs, yet with exactly the same content. My goal is to make sure the search engines have one, and only one, URL per page.

    Mark

  • 15 Rasheed // Feb 25, 2008 at 7:57 pm

    Thanks Mark

  • 16 Mark // Feb 25, 2008 at 7:58 pm

    @Rasheed: Sorry I hit the submit button too fast. I want to avoid the use of /?p= completely, & that’s why the disallow is there. If Google is going to index my page, I want it to use the URL that has my keywords in it!

    The URL does have some SEO value, BTW. There’s a list over at SEOMoz that ranks all the SEO factors that get talked about in forums - well worth heading over there & checking it out.

  • 17 Rasheed // Feb 25, 2008 at 8:04 pm

    BTW I am using the default (ugly) links in my blog (rasheed-b.com). Do you think it will be better to switch to another permalink structure ?

    Concerning the list can you give me a direct link please ? I googled it but did not find any lists.

    Thanks.

  • 18 Mark // Feb 25, 2008 at 8:36 pm

    Yes, absolutely I do. Here’s the string I use:

    /%category%/%postname%/

    Here’s the SEOMoz article. It’s worth printing out - a great piece of research:

    http://www.seomoz.org/article/search-ranking-factors

    regards
    Mark

Leave a Comment