Web Page Sectioning Using Regex-based Template

Rupesh R. Mehta

Yahoo! R&D
Bangalore, India

Amit Madaan

Yahoo! R&D
Bangalore, India

Copyright is held by the World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2008, April 21-25, 2008, Beijing, China.
ACM 978-1-60558-085-2/08/04.

ABSTRACT

This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach.

Categories & Subject Descriptors

H.3.3[Information Storage, Retrieval] Information Search and Retrieval

General Terms

Algorithms, Design

Keywords

Site-specific segmentation, site-specific noise elimination, tree-based regular expression

1 Introduction

Majority of the existing approaches mainly consists of an optional page-level, rule-based web page segmentation step and section (or DOM node) importance detection step leveraging site specific information and/or page level spatial and content features. Our approach falls in the same category. However, unlike other approaches, our web page segmentation approach leverages structural information across pages via template, along with page level, rule-based information. This leads to robust segmentation quality. Unlike other approaches, our approach helps to detect less important sections local to a cluster of pages (i.e. part of a site), with high confidence, leveraging template learning.

2 Our Approach

Given a website the approach takes k web page samples from the site and learns the template over the DOM structure of those samples. It then learns site specific node, content importance using structural and content features repeating across pages. The approach matches each test page with the learnt template, segment the web page into set of sections, and assigns importance to each section, using template learning, and page level spatial and content features.

Template similar to [1], is a tree-based regular expression learnt over set of structures of pages within a site. Initial template is constructed based on structure of one page and then it is generalized over set of pages by adding set of operators, if the pages are structurally dissimilar. In addition to HTML tags, template generalization part deals with three operators, '*', '?', and ' $\vert$ ' to denote multiplicity (denotes repetition of similar structure), optionality (denotes part of structure is optional), and disjunction (denote presence of one of the structures) in the structural data, respectively. In brief, template is a generalized tree-based regular expression over structure of pages seen till now. To illustrate, consider a template as- (A)

B(C)

D(E $\vert$ F), where A, B, C, D, E, and F represents set of DOM nodes and/or sub-tree in the structure. This template matches all pages having their HTML structure as ABCDE, AABCDE, ABDE, ABDF, ABCDF, etc. Further details on template can be found at [2].

Template helps to captures structural and content repetition across pages which is used to determine section importance. Also, template captures set of structurally similar items under a STAR (*) node helping segmentation process.

2.1 Site Specific Learning

2.2 Segmentation and Importance Detection

2.2.1 Template-based Scoring

2.2.2 Web-page Segmentation

2.2.3 Section Importance Detection

3 Experiments

4 Conclusions

REFERENCES

[1] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB '01: Proc of 27th Int'l Conf on VLDB.

[2] V. G. V. Vydiswaran, R. R. Mehta, A. Madaan, and C. Tiwari. Tree-based template learning for high precision extraction. 2008.

Web Page Sectioning Using Regex-based Template

Rupesh R. Mehta

Yahoo! R&D
Bangalore, India

rupeshm@yahoo-inc.com

Amit Madaan

Yahoo! R&D
Bangalore, India

amitm@yahoo-inc.com

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1 Introduction

2 Our Approach

2.1 Site Specific Learning

2.2 Segmentation and Importance Detection

2.2.1 Template-based Scoring

2.2.2 Web-page Segmentation

2.2.3 Section Importance Detection

3 Experiments

4 Conclusions

REFERENCES

Web Page Sectioning Using Regex-based Template

Rupesh R. Mehta

Yahoo! R&D Bangalore, India

Amit Madaan

Yahoo! R&D Bangalore, India

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

2 Our Approach

3 Experiments

4 Conclusions

REFERENCES

Yahoo! R&D
Bangalore, India

Yahoo! R&D
Bangalore, India