In a recent project we use the excellent Symfony2 DomCrawler to parse HTML pages. But on several occasions I noticed that some of the generated data contained broken characters due to wrong character encoding - even though I had made sure to use UTF-8 everywhere.
While I was researching the problem I came across a blog post by Dean Clatworthy , who had a similar problem. In his post he describes that DomCrawler uses ISO-8859-1 as a fallback if it is not able to detect the document’s encoding which got me thinking. In the end I found out that the webpage we were crawling was transmitted with a “text/html;charset=UTF-8” Content-Type HTTP header but did not contain a charset metatag which DomCrawler uses to detect the encoding. Luckily DomCrawler comes with the addHtmlContent method, which allowed us to set the right character encoding manually.
So instead of using
we now use
which solves our problems (the second code snippet is a version of Dean’s solution, which was adapted for use with DomCrawler >=2.5).