The Web is a continuously growing information repository with a rich semantic structure that spans many application areas. The Web, however, has been designed primarily for human consumption rather than automated processing. This is a major obstacle for automating tasks like information searching, filtering and extraction. In this context, the aim of the paper is to present a technique for learning rules to extract product information from HTML information sources that represent product information sheets. The technique exploits the fact that the Web pages that represent product information of a certain producer are generated on the fly from the producer database and therefore they exhibit uniform structures. Consequently, while the extraction task is executed manually for a few information items by a human user, a general-purpose inductive learner can learn extraction rules that will be further applied to the current and other product information sheets to automatically extract other items. The input to the learning algorithm is a relational description of the HTML document tree that defines the HTML tree nodes types and the relationships between them. The approach is demonstrated with appropriate examples, experimental results, and software tools.
Download Full PDF Version (Non-Commercial Use)