crawlers crawling many thousands of duplicate pages
-
On a few sites, I’ve seen Googlebot and other crawlers presumably trying to index thousands of copies of a page where I have a calendar widget. The queries look something like this:
/calendar-page/?id=1113957581&ajaxCalendar=1&mo=6&yr=2024
The issue is not that it’s crawling through all the possible months and years (although that could be an issue too, if there’s no cap on how far it could get into the past or future); rather, the problem is the
id=
value which appears to be random. The crawlers end up with many copies of the same page, indexed with different values ofid
.I noticed this simply because it was putting an unusual amount of load on my servers, but it’s also not ideal behavior for the widget. Why is there an
id
value specified at all? Getting rid of that, if possible, would help a lot. Maybe these links should haverel=nofollow
too, so the crawlers won’t bother with them in the first place. Of course, one wouldn’t want to block a crawler’s ability to find a legitimate calendar entry, but surely there’s a better way than to have it scan through every month/year view, regardless. In my case, Googlebot can find all my calendar entries via my sitemap.xml, which is much more efficient.In the meantime, I added something like
Disallow: /calendar-page/?*ajaxCalendar*
to myrobots.txt
which should help. But I’m curious to see if anyone else has a better solution.
- You must be logged in to reply to this topic.