The use of mod_rewrite to create SEO friendly URLs is common place now. However, if an application is not coded correctly they can have potentially negative effects.
Take the fowllowing URL, this is the URL the developer intended – the news story is fetched from the ID in the querystring:
http://www.site.com/news/23/some-breaking-story.html
However, if the application is coded badly the following URLs may also exist in order to reach the same content:
http://www.site.com/news/23/ANYTEXTHERE-some-breaking-story.html http://www.site.com/news/23/some-breaking-story.php http://www.site.com/news/23
The latter can have potential issues for your SEO rankings. If search engines pickup on the fact that muliple URLs all goto the same duplicate content then your site send out bad signals to search engines – this should be avoided. There si recent talk that duplicate content is simply ignored by search engines and now a none issue. However, even ignoring direct penalties imposed on your site, having the same content divided over multiple URLs can reduce your rankings simply by diluting content.
The answer is URL correction and correct use of headers. A single URL for a single page is the goal. This is fairly easy toachieve by following the below methodology in your scripts:
- Make your mod_rewrite rules specific (E.g. don’t use lots of ([^/]*) flags as that will match anything!)
- Create a standard base URL for the content
- When the page loads, check that the base URL (or expected URL) matches the page requested by the visitor
- If there is a mismatch, set the correct HTTP headers and redirect to the correct, expected URL
Firstly, adjust your rewrite rule to be more specific:
RewriteRule ^news/([0-9]+)/(-a-zA-Z)\.html$ news.php?TITLE=$2&NEWSID=$1 [L]
This now looks for the following pattern: a number, letters and then a .html extension.
You can also adject your application logic to add in the following:
/* At this point, query your database to retreive information The variables $title and $id would come from the database */ $actual_url = myClass:getCurrentRequestedURL(); $expected_url = LinkFactory::NewsItem($title, $id); ob_start(); if ($expected_url != $actual_url) { ob_clean(); header('HTTP/1.1 31 Moved Permanently'); //Send 301 status code header('Location: ' . $expected_url); //redirect to expected url ob_flush(); }
Depending on your url rewriting structure, you’ll notice that WordPress does a very similar thing.
Is there much cause or reason for keeping the file extension on urls?
I need to use mod_rewrite more, got to climb that regex curve.
I’ll try to dig out the link, but I’ve read that search engines favour .html files as by using a .html extension you are explicitly saying ‘this is a static resource – which of cpurse we all know Google favours. However, this is just a theory and personally I think it looks a lot neater with the .html left on – have never really had any issues with indexing leaving it on.
However, there is the other theory is that adding .html adds extra characters to your url which search engines will ignore.
You’ve got to weigh up the ranking factors here though, as keywords within the url are a very minor ranking factor. Some big sites use no extension, whereas other keep the extension. For example, Currys use a .html extension on their product pages: currys.co. uk/gbuk/sharp-lc32d12e-32-hd-ready-lcd-tv-08082937-pdt.html, whereas comet doesn’t: comet.couk/p/LCD-TVs/buy-SONY-KDL55EX503U-LCD-TV/640468
There is clearly no ‘right’ way to do things here, whatever people may tell you, as both sites above (both huge sites, thousands of indexed pages) have no issues. π
Hello Robert (o Rallport) I am interested in your freelance services and I can’t find you email to contact you, could you please send it?
Hello,
You can find me on twitter @roballport
Thanks
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
Nice article, straight to the point.
And talking about file extensions, Google and other search engines don’t care about it as long as it is rewritten with mod_rewrite. It is only useful for us, human beings.
I tend to trust more websites with pages ending with common extensions (or none like on this blog) than .mom or .xxx pages π
Thanks.
There was theory going around a few months ago that static pages E.g. those ending with a know file extension such as .html were given extra brownie points by Google. I’ve been testing this myself for a few months – varying the urls my cms produces for blog entires. I’ve come the conclusion that there is no major difference – I personally prefer no extension π