PHP 中的 HTML 解析是使用的 DOM模块

  1. $dom = new DOMDocument;
  2. $dom->loadHTML($html);
  3. $images = $dom->getElementsByTagName('img');
  4. foreach ($images as $image) {
  5. $image->setAttribute('src', 'http://example.com/' . $image->getAttribute('src'));
  6. }
  7. $html = $dom->saveHTML();

这里列举所有包含 nofollow 属性的 <a> 标签:

  1. $doc = new DOMDocument();
  2. libxml_use_internal_errors(true);
  3. $doc->loadHTML($html); // loads your HTML
  4. $xpath = new DOMXPath($doc);
  5. // returns a list of all links with rel=nofollow
  6. $nlist = $xpath->query("//a[@rel='nofollow']");

提取谷歌结果的 dom 程序.

  1. <?php
  2. # Use the Curl extension to query Google and get back a page of results
  3. $url = "http://www.google.com";
  4. $ch = curl_init();
  5. $timeout = 5;
  6. curl_setopt($ch, CURLOPT_URL, $url);
  7. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  8. curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
  9. $html = curl_exec($ch);
  10. curl_close($ch);
  11. # Create a DOM parser object
  12. $dom = new DOMDocument();
  13. # Parse the HTML from Google.
  14. # The @ before the method call suppresses any warnings that
  15. # loadHTML might throw because of invalid HTML in the page.
  16. @$dom->loadHTML($html);
  17. # Iterate over all the <a> tags
  18. foreach($dom->getElementsByTagName('a') as $link) {
  19. # Show the <a href>
  20. echo $link->getAttribute('href');
  21. echo "<br />";
  22. }
  23. ?>

FAQ

DOMDocument::loadHTML(): Unexpected end tag : p in Entity

当分析 html 代码的时候如果嵌套类型不正确, 则会报这个错误, 因为 <ol> 是一个块级元素, p之内是不能存放任何块级元素的

  1. $html = '<p>
  2. <ol>
  3. <li>The OS interface layer has been completely reworked</li>
  4. </ol>
  5. </p>';
  6. $Dom = new DOMDocument();
  7. $Dom->loadHTML($html);
  8. $Xpath = new DOMXPath($Dom);
  9. $Xpath->query('//p');

这个 $htmlhttps://validator.w3.org/ 中进行验证会有如下错误

No p element in scope but a p end tag seen.

image.png