html - PHP DomXPath encoding issue after xpath -


if use echo $doc->savehtml(); show characters accordingly , once reaches xml? @ xpath extract element , issues again.

i cant seem display characters properly. how convert properly. i'm getting:

婢跺繐顒滈拺鍙ョ瀵偓鐞涱偊鈧繑妲戦挅鍕綍婢舵牕顨� 闂€鍌溾敄缂侊綀濮虫稉濠呫€� 娑擃叀顣荤純鎴犵綍閺冭泛鐨绘總鍏呯瑐鐞涳綀鏉藉▎ 

instead of proper chinese:

<head><meta http-equiv="x-ua-compatible" content="ie=edge"><meta charset="gbk"/></head> 

my php code:

$html = file_get_contents('http://item.taobao.com/item.htm?spm=a2106.m874.1000384.41.ag3kbi&id=20811635147&_u=o1ffj7oi9ad3&scm=1029.newlist-0.1.16&ppath=&sku='); $doc = new domdocument();  // based on article http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258#11310258 $searchpage = mb_convert_encoding($html,"html-entities","gbk"); $doc->loadhtml($searchpage); // echo $doc->savehtml();   $xpath = new domxpath($doc); $elements = $xpath->query("//*[@id='detail']/div[1]/h3");  foreach ($elements $e) {    //echo $e->nodevalue;    echo mb_convert_encoding($e->nodevalue,"utf-8","gbk"); } 

you have to_encoding , from_encoding parameters wrong way around in last call mb_convert_encoding. content returned xpath query encoded utf-8, assumedly want output encoded gbk (given you've set meta charset "gbk").

so final loop should be:

foreach ($elements $e) {   echo mb_convert_encoding($e->nodevalue,"gbk","utf-8"); } 

the to_encoding "gbk", , from_encoding "utf-8".

that said, answer given agreeornot should work too, if happy page being encoded utf-8.


as how encoding process works, internally domdocument uses utf-8, why results xpath queries utf-8, , why need convert gbk mb_convert_encoding if character set need.

when call loadhtml, attempts detect source encoding, , convert input encoding utf-8. unfortunately detection algorithm doesn't work well.

for example, although example page has set charset metatag, metatag not recognised loadhtml, defaults assuming source encoding latin1. have worked if had used http-equiv metatag specifying content-type.

<meta http-equiv="content-type" content="text/html; charset=gbk" /> 

the alternative avoid problem altogether, converting non-ascii characters html entities (as have done). way doesn't matter if loadhtml detects character encoding correctly, because there won't characters need converting.


Comments

Popular posts from this blog

javascript - DIV "hiding" when changing dropdown value -

Does Firefox offer AppleScript support to get URL of windows? -

android - How to install packaged app on Firefox for mobile? -