• Compared to plain text, a Web page is a two-dimensional presentation
• Rich visual effects created by different font types, formats, separators, blank areas, colours, pictures, etc
• Different parts of a page are not equally important
Mining over content, plus hyperlinks, plus layout: two-dimensional visual layout and DOM tree structure.
DOM is more related to content display, and may not reflect semantic structure
Example: VIPS
Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, VIPS: a Vision-based Page Segmentation Algorithm, Microsoft Technical Report, November 2003.