• Resolved kw11

    (@kw11)


    phast.php may be generating JSON with incorrect MIME types. The Google crawler is seems to be doing excessive crawling thinking these are all HTML documents.

    Can you please look into this?

    Thanks!

    • This topic was modified 2 years ago by kw11.
Viewing 12 replies - 1 through 12 (of 12 total)
  • Plugin Author Albert Peschar

    (@kiboit)

    Hi @kw11,

    Unless you share a URL of an indexed resource, I cannot check.

    However, PhastPress sends JSON as text/plain responses because not all server configurations automatically compress application/json responses. It could be that Google automatically indexes these.

    In the latest release, I’ve added a header X-Robots-Tag: none to these JSON responses to avoid any indexing or link following from Google. This should fix the issue.

    –Albert

    Thread Starter kw11

    (@kw11)

    Thanks for doing this.

    However, on closer look, it seems that the phast.php bundler files are being crawled as HTML, which ends up significantly lowering the crawl rate for Googlebot/2.1 Smartphone of HTML pages. Desktop crawl rate is ok, which is weird.

    It seems these JSON text/plain files, which are being interpreted as text/html is what’s leading to the drop in crawl rate.

    When phast.php was inadvertently blocked from Googlebot, the overall amount of pages/files crawled was significantly less, but the crawl rate of Googlebot Smartphone was magnitudes higher.

    Whether not this is a “bug” of the Google crawler or the phast bundler, the resulting crawl rates become unacceptable. With phast bundler blocked, Googlebot can crawl the whole site in under a few days. With phast.php not inadvertently blocked from robots, it might be weeks or longer.

    • This reply was modified 2 years ago by kw11.
    Thread Starter kw11

    (@kw11)

    That being said, the overall crawling, in terms of items crawled, is about 2-3x when phast.php is not blocked from robots. So for example, instead of 2000 documents being crawled a day, 6000 documents are being crawled a day. Googlebot crawls the phast bundler JSON files in very high volume, but is skipping out on crawling the actual page HTML on their smartphone crawler. It also seems to me that the overall crawl budget of real HTML pages is much less in Googlebot smartphone crawler with phast.php not blocked from robots.

    • This reply was modified 2 years ago by kw11.
    Thread Starter kw11

    (@kw11)

    FWIW, admin-ajax.php uses Content-Type: application/json; charset=UTF-8 in the response headers. That won’t work in some browsers?

    It also uses x-robots-tag: noindex

    • This reply was modified 2 years ago by kw11.
    • This reply was modified 2 years ago by kw11.
    Thread Starter kw11

    (@kw11)

    I just reread what you wrote and you said it’s because some servers don’t automatically compress application/json.

    Is there a way you can instruct the server to do this with application/json MIME type? If not, can you detect if a server has the capability to compress application/json, and if it does, serve them as application/json? The latter solution is not a perfect solution from an SEO perspective, but will get rid of this Googlebot issue for those who can compress application/json.

    • This reply was modified 2 years ago by kw11.
    Thread Starter kw11

    (@kw11)

    I edited the plugin so the Content-Type to application/json; charset=utf-8. I’ll let you know what Googlebot thinks in a few days. On our server, it’s automatically Brotli compressed.

    Thread Starter kw11

    (@kw11)

    One more question. Will blocking phast.php* from robots.txt have any negative consequences when it comes to crawling? Or will it just prevent bundled resources from caching?

    It seems like Googlebot and most robots don’t cache anything anyway?

    Plugin Author Albert Peschar

    (@kiboit)

    Hi @kw11,

    We cannot prevent the initial request to phast.php from happening by changing the headers on that response. Because the headers are retrieved only after actually making the request.

    So I suggest adding phast.php to your robots.txt. There’s nothing there that should be crawled by Google.

    –Albert

    Thread Starter kw11

    (@kw11)

    So to be clear, you think it’s fine to block phast.php in robots.txt?

    Plugin Author Albert Peschar

    (@kiboit)

    Hi @kw11,

    I’m actually not sure what will happen.

    On the one hand, it doesn’t stop Google from indexing the content of your page.

    On the other hand, it prevents Google from using the Phast bundler, and I’m not sure what that will do to Google’s impression of your site’s performance.

    I edited the plugin so the Content-Type to application/json; charset=utf-8. I’ll let you know what Googlebot thinks in a few days. On our server, it’s automatically Brotli compressed.

    Thanks for this. I’m looking forward to hearing the results of this experiment. If it makes a big difference I may add a setting to the plugin.

    –Albert

    Thread Starter kw11

    (@kw11)

    Mobile crawl requests are back to normal (actually above previous normal), and desktop crawl requests also spiking.

    Overall crawl requests much higher now.

    phast.php* is blocked with robots.txt. I didn’t have a chance to test it for a significant period of time with just text/json MIME type, but still think it probably was confusing the mobile crawler, which was interpreting text/plain as text/html.

    Crawler was confused and would suggest blocking phast.php from robots. Unfortunately, this plugin has the potentially to seriously impact indexability and rankings without these steps.

    • This reply was modified 2 years ago by kw11.
    Plugin Author Albert Peschar

    (@kiboit)

    Hi @kw11,

    Thanks for your testing.

    I have released version 2.10 which reverts back to using application/json as the MIME type on bundler responses, and removes the X-Robots-Tag header.

    Hopefully this will prevent the issue in future.

    –Albert

Viewing 12 replies - 1 through 12 (of 12 total)
  • The topic ‘Google crawler and web browser seeing phast.php JSON files as HTML document’ is closed to new replies.