html - PHP DomXPath encoding issue after xpath -
if use echo $doc->savehtml();
show characters accordingly , once reaches xml? @ xpath extract element , issues again.
i cant seem display characters properly. how convert properly. i'm getting:
婢跺繐顒滈拺鍙ョ瀵偓鐞涱偊鈧繑妲戦挅鍕綍婢舵牕顨� 闂€鍌溾敄缂侊綀濮虫稉濠呫€� 娑擃叀顣荤純鎴犵綍閺冭泛鐨绘總鍏呯瑐鐞涳綀鏉藉▎
instead of proper chinese:
<head><meta http-equiv="x-ua-compatible" content="ie=edge"><meta charset="gbk"/></head>
my php code:
$html = file_get_contents('http://item.taobao.com/item.htm?spm=a2106.m874.1000384.41.ag3kbi&id=20811635147&_u=o1ffj7oi9ad3&scm=1029.newlist-0.1.16&ppath=&sku='); $doc = new domdocument(); // based on article http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258#11310258 $searchpage = mb_convert_encoding($html,"html-entities","gbk"); $doc->loadhtml($searchpage); // echo $doc->savehtml(); $xpath = new domxpath($doc); $elements = $xpath->query("//*[@id='detail']/div[1]/h3"); foreach ($elements $e) { //echo $e->nodevalue; echo mb_convert_encoding($e->nodevalue,"utf-8","gbk"); }
you have to_encoding , from_encoding parameters wrong way around in last call mb_convert_encoding. content returned xpath query encoded utf-8, assumedly want output encoded gbk (given you've set meta charset "gbk").
so final loop should be:
foreach ($elements $e) { echo mb_convert_encoding($e->nodevalue,"gbk","utf-8"); }
the to_encoding "gbk", , from_encoding "utf-8".
that said, answer given agreeornot should work too, if happy page being encoded utf-8.
as how encoding process works, internally domdocument uses utf-8, why results xpath queries utf-8, , why need convert gbk mb_convert_encoding if character set need.
when call loadhtml, attempts detect source encoding, , convert input encoding utf-8. unfortunately detection algorithm doesn't work well.
for example, although example page has set charset metatag, metatag not recognised loadhtml, defaults assuming source encoding latin1. have worked if had used http-equiv metatag specifying content-type.
<meta http-equiv="content-type" content="text/html; charset=gbk" />
the alternative avoid problem altogether, converting non-ascii characters html entities (as have done). way doesn't matter if loadhtml detects character encoding correctly, because there won't characters need converting.
Comments
Post a Comment