• Resolved yclin925

    (@yclin925)


    include('/simple_html_dom.php');
    
    $url = "https://www.2311666.com.tw/";
    $opts = array(
       'http'=> array('header'=>"User-Agent:Chrome/94.0.4606.81\r\n"));
    $context = stream_context_create($opts);
    $html = new simple_html_dom();
    $get = file_get_contents($url,false,$context);
    $html->load($get);
    
    echo $html;

    P.S. The above code is deployed on the server.
    I’ve no problem scraping most of websites but some specific sites with Breeze internal caching return gibberish.

    To be more precise, the first crawl is normal but second crawl show gibberish.

    I’ve tried using cURL or dynamic User-Agent but get the same result. Is there a correct way?

Viewing 6 replies - 1 through 6 (of 6 total)
  • Plugin Author adeelkhan

    (@adeelkhan)

    Kindly share the response of the second crawl.

    Thread Starter yclin925

    (@yclin925)

    Response of second crawl gets garbled code as shown below. Thanks for helping

    ? 簀iwW??硅?﹜?亃云?N'?矻?逴?dyJv?R????隿?'@願?1f礉柮y菪嘄僻9克?堬-N?Jg

    • This reply was modified 3 years, 1 month ago by Yui.
    • This reply was modified 3 years, 1 month ago by Yui. Reason: cut some junk
    Moderator Yui

    (@fierevere)

    永子

    Try to disable GZIP compressing in cache plugin, if such setting exists.

    Usually webserver handles that, no need to compress the output twice

    Thread Starter yclin925

    (@yclin925)

    Thanks for the suggestion.

    My problem is the first crawl normal but second crawl show gibberish on the remote server.

    Thread Starter yclin925

    (@yclin925)

    Response header of first crawl gets garbled code as shown below.

    HTTP/2 200 server: nginx date: Thu, 28 Oct 2021 06:48:18 GMT content-type: text/html; charset=UTF-8 vary: Accept-Encoding x-ua-compatible: IE=edge link: <https:>; rel=”https://api.w.org/&#8221;, <https:>; rel=”alternate”; type=”application/json”, <https:>; rel=shortlink cache-provider: CLOUDWAYS-CACHE-DC last-modified: Thu, 28 Oct 2021 06:48:18 GMT vary: Accept-Encoding cache-control: max-age=0 expires: Thu, 28 Oct 2021 06:48:17 GMT

    Response header of second crawl gets garbled code as shown below.

    HTTP/2 200 server: nginx date: Thu, 28 Oct 2021 06:50:43 GMT content-type: text/html; charset=utf-8 content-length: 29167 cache-provider: CLOUDWAYS-CACHE-DE content-encoding: gzip vary: Accept-Encoding last-modified: Thu, 28 Oct 2021 06:48:18 GMT cache-control: max-age=0 expires: Thu, 28 Oct 2021 06:50:43 GMT

    Thread Starter yclin925

    (@yclin925)

    I add below code into CURL and it works.
    curl_setopt($curl, CURLOPT_ACCEPT_ENCODING, “gzip”);

Viewing 6 replies - 1 through 6 (of 6 total)
  • The topic ‘How to crawl website with breeze caching normaly’ is closed to new replies.