• Several months ago, we went from hosting our blog on WordPress.com to a self-hosted blog on our own web server. We made the necessary virtual host entries and 301 redirects to our web server and, for the most part everything works fine. The one exception is when I look at our crawl errors using Google Webmaster Tools, I see a number of 404 ‘not found’ errors. Upon closer examination of some of the URLs, I am noticing that Google is not assembling them the same as what we have specified in our WordPress permalinks setup. As an example, if our WP permalink format is https://www.example.com/blog/2016/09/13/my-post, a Google crawl error may exist because Google has formed the URL as https://www.example.com/blog/my-post/2016/09/13. This seems to be happening primarily with older posts though I’m not sure there aren’t exceptions.

    Any ideas on what might be causing this and how to resolve it would be greatly appreciated.

Viewing 5 replies - 1 through 5 (of 5 total)
  • Moderator bcworkz

    (@bcworkz)

    Google bots don’t reassemble URLs, it picked it up like that somewhere. It may be an old page that’s not even live anymore, it might not even be from your site. Spammy link farms are known to mangle URLs. Drill into the 404 error item in Google Search Console (what they are now calling webmaster tools) until you get to the referrer link where it picked the URL up from.

    If it’s content you have control over, fix it (it may have already been fixed, it seems links to crawl can be cached for quite some time). If the page is no longer accessible or not under your control, there’s nothing to do, Google bot will eventually stop trying to reach it.

    Thread Starter srunck

    (@srunck)

    Thanks for your response. There is no referrer link listed for these errors. Also, the person who has been in charge of the blog says we’ve always used URLs that had the date directories ahead of the title directories (“/2016/09/14/my-post”, not “/my-post/2016/09/14”).

    Here’s an example of the redirect we’re doing. This would be the .htaccess file in the root of the old domain. I’m still learning about building proper 301 redirects. Maybe you can spot an issue here:

    <IfModule mod_rewrite.c>

    RewriteEngine On
    RewriteBase /
    Redirectmatch 301 “^/(.*)” “https://www.example.com/blog/$1&#8221;

    </IfModule>

    Moderator bcworkz

    (@bcworkz)

    There’s no way your redirect rule would cause Google bot to pick up a mangled URL. Without a referrer to use to trace back from, there’s nothing you can do except mark it as resolved and hope it doesn’t recur. If it does recur, maybe there’ll be a referrer next time.

    RedirectMatch is actually part of mod_alias, not mod_rewrite, you could change mod_rewrite.c to mod_alias.c, but unlike mod_rewrite, mod_alias is part of the base install so there is no need to check if the module exists like is done with extensions like mod_rewrite. You may get rid of the if tags altogether.

    Likewise, you don’t need the rewrite engine and base lines, they do not have anything to do with RedirectMatch. RedirectMatch works on its own, it doesn’t have any establishing lines like mod_rewrite does.

    Your RedirectMatch line looks OK AFAICT, assuming you’re not using curly quotes in your actual file. I think the forum did the curly quotes. In the future you can avoid having the forum mess up your posted code by using the “code” button or using `backticks`.

    Thread Starter srunck

    (@srunck)

    Thanks. One of our programmers used the double (curly?) quotes on the Redirectmatch 301 line. I have since removed them. Last month, one of our support people went into Google Webmaster Tools and marked all of the crawl errors (well over 100 of them) as fixed because a programmer had set up this redirect. Since then, however, we have seen these same errors creep back onto the crawl error list (again as 404s) at the pace of a couple per day. I have to assume that if I don’t find and resolve the real issue, we will eventually see the 100+ ‘404 not found’ errors that we had before.

    • This reply was modified 8 years, 6 months ago by srunck.
    Moderator bcworkz

    (@bcworkz)

    There’s all sorts of quote glyphs available, coders should only use the "straight" ones. “Curly” ones (which tend to have right or left slant) can creep in when code is copied from forums or word processor documents where the straight quotes get replaced with more typographically appropriate versions. Code pages want to only see ASCII decimal 34 and 39 versions.

    The 404s will not necessarily grow to the original level. It depends on the source. From what I’ve seen, Google bot will cache URLs to crawl for quite some time, so even if the error is fixed, the erroneous URLs will still be requested, possibly weeks later. As long as the source is fixed or removed, Google bot will eventually give up.

    Of course it’s hard to fix the source if there is no referrer to tell you where the link came from. We can only hope the lack of a referrer is because the source no longer exists, so Google bot will eventually give up on that URL. Maybe another way to determine the referrer is to look at the access logs for the date Google detected the error. If you search for the erroneous URL in the logs there shouldn’t be very many hits. If there’s no referrer here then it’s because someone typed in an erroneous URL, which Google should know nothing about, though I wouldn’t be surprised if they did know ??

    I find it hard to believe all of your 404s have no referrers. While I’ve seen this too, it’s a very rare occurrence in my experience. Let’s make sure we’re talking about the same thing. On the crawl error list, click on a 404 URL. The resulting modal shows error details. There’s also a “Linked from” tab that appears to be grayed out like it’s inactive, but it’s not. Click on it and you should get the list of referrers. Clicking on a referrer link should open the page with the bad URL. If you get a page not found when following this link, the source is gone, nothing to be done about that, move on to the next link.

    Note that Google bot will “see” all links on a source page, even if hidden, or even within HTML comments.

Viewing 5 replies - 1 through 5 (of 5 total)
  • The topic ‘Google 404 crawl errors on blog only’ is closed to new replies.