Ah, the sweet romance of a developer and an over excited search engine. How I long for my bandwidth back!
If you stopped by in the last three days you would have recieved a ‘bandwidth usage limit exceeded’ message from my ISP. If you were me you’d think that odd seeing as there are so few visitors at the moment.
To be fair, it was my fault. I’d allowed the bots to crawl my entire website, including every day of the calendar. Google clocked over 44,000 hits!
Read on to see what I did wrong.
In setting up Geeklog, I noticed that the theme included the
tag which requests search engines not to index the content. As I’m actually trying to write useful information for others, I change this to be
I also made sure that no search engines picked up the documentation and admin things from the site by putting entries in my /robots.txt file.
Google listened, and followed my instructions exactly. Where I went wrong was in allowing Google to read my calendar. Though I haven’t put many entries in there, Google attempted to index every hour of every single day the calender listed. In just 5 days, Google clocked over 44,000 hits. Ouch!
I’ve since modified my robots.txt file to include:
Disallow: /search.php Disallow: /users.php Disallow: /stats.php Disallow: /profiles.php Disallow: /calendar.php Disallow: /calendar_event.php
I’ve also gone and hacked the header.thtml file for the theme. The previous meta tags have now been replaced with:
<?php // Don't allow some urls to be picked up by bots if( !empty( $_SERVER['REQUEST_URI'] )) { if( !stristr( $_SERVER['REQUEST_URI'], 'submit.php' ) && !stristr( $_SERVER['REQUEST_URI'], 'calendar.php' ) && !stristr( $_SERVER['REQUEST_URI'], 'calendar_event.php' ) ) { print '<meta name="ROBOTS" content="INDEX,FOLLOW">'; print '<META NAME="resource-type" CONTENT="document">'; print '<META NAME="revisit-after" CONTENT="14 days">'; } else { print '<meta name="ROBOTS" content="NOINDEX,NOFOLLOW,NOARCHIVE,NOCACHE">'; } } // print '<!-- ' . $_SERVER['REQUEST_URI'] . ' -->'; ?>
I expect that google will want to verify all of the calendar entries next time it visits, but hopefully between the headers and the robots.txt file it will drop them from it’s cache and never come back. After a few months, I’ll remove the calendar_event.php page from the listings, so actual calendar entries will be picked up.