J4 ›› 2012, Vol. 47 ›› Issue (3): 38-42.

• Articles • Previous Articles     Next Articles

Content extraction from web page based on the DOM tree and line-text statistical noise-elimination

LI Xia, JIANG Sheng-yi   

  1. Cisco School of Informatics, Guangdong University of Foreign Studies, Guangzhou 510006, Guangdong, China
  • Received:2011-11-30 Online:2012-03-20 Published:2012-04-01

Abstract:

As different web pages have different codes, the HTML web page first need to be encoded with the uniform code UTF8, and then translated into an XML document which is parsed into the DOM tree. After removing some noise nodes from the DOM tree according to the features of XML language and the rules of the noise characteristics, text contents are extracted from the DOM tree by the method of statistics of punctuation and noise information is continued to be eliminated from contents extracted above by the method of statistics of line-text. The result of experiments on 2000 web pages obtained from different web sites shows that our method has high accuracy, great generality, and simplicity, and can be automatically used to extract the right contents from different web sites.

Key words: content extraction from web pages; DOM tree; statistical of line-text; statistical of punctuatio

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!