Crawling UTF-8 pages with Symfony2 DomCrawler

August 29, 2014

In a recent project we use the excellent Symfony2 DomCrawler to parse HTML pages. But on several occasions I noticed that some of the generated data contained broken characters due to wrong character encoding - even though I had made sure to use UTF-8 everywhere.

While I was researching the problem I came across a blog post by Dean Clatworthy , who had a similar problem. In his post he describes that DomCrawler uses ISO-8859-1 as a fallback if it is not able to detect the document’s encoding which got me thinking. In the end I found out that the webpage we were crawling was transmitted with a “text/html;charset=UTF-8” Content-Type HTTP header but did not contain a charset metatag which DomCrawler uses to detect the encoding. Luckily DomCrawler comes with the addHtmlContent method, which allowed us to set the right character encoding manually.

So instead of using

$crawler = new Crawler($html, $url);

we now use

$crawler = new Crawler('', $url);
$crawler->addHtmlContent($html, 'UTF-8');

which solves our problems (the second code snippet is a version of Dean’s solution, which was adapted for use with DomCrawler >=2.5).