Wrapper Generation for Unstructured Data
Wrapper Generation for Unstructured Data
Wrapper Generation for Unstructured data
Abstract— The data on the web is highly unstructured and some times it is present without any HTML tags, so it becomes difficult to query those web-sites and extract data from them. It is also difficult to merge data after colleting from various websites as it is in different formats and data types. The machine can’t understand unstructured data by its own and more-over machine needs both structure and content so as to extract data from web. We need some algorithm that can generate structure form this unstructured data automatically without any manual intervention.
INTRODUCTION
The data on web is high unstructured and it is not possible with our traditional query engines to query it in an efficient and accurate manner. We need some kind of autonomous engine in which we can submit a query and it gives us result hiding all complexities inside it, like a user is not concerned to which web-sites it goes, how it get results , maps them and give us a combined and summarized result. In such a system no manual intervention is needed and wrappers or intermediate tool, all are generated at run time. Such a dynamic system is not affected by the changes on web-pages as all the things are done at run time. The idea that I am presenting in this paper is concerned with wrapper generation, as one of the crucial problems of information extraction from internet is to generate wrappers which are information extracting pattern or rules for a webpage. We describe how to generate wrapper for a web-page that has no HTML tags or few of them. The main focus is to get structure for this data that is presented without tags as to query in an efficient and accurate manner machine needs both the structure and content of the web page.
OVERVIEW
The data on website is highly unstructured without HTML tags. It is difficult for machine to get structure from this data but humans can see there are always visual structures present. In many web-pages the data is present without HTML tags. This data though of very importance but can’t be queried or it is not possible to extract only specific information from it. For the machine to extract data and to query it should know the schema of that unstructured data.
The web pages on these types of data intensive web sites have similar schema and are automatically generated form the back end databases and are presented in an HTML manner. This means that here is a program generating this so there must me some schema common to those similar pages. So to generate the semantics of such the web pages we need to analyze and compare the group of those web pages. The basic idea is that they are generated by a program so there must be some structure and also we have the basic assumption that there are some visual structures present.
RELATE WORK
A large quantity of work has been done in this field already. Many systems are available now days that will generate wrapper for web-pages but most of them are not appropriate as they have lots of limitations or they have some lousy assumptions or they need some manual intervention. Some of them also need large number of training examples so that they can work well in a specific domain. Wrapper is a specified procedure that is designed for extracting specific data form interesting web sites. The result of wrapper should be a formal structure for processing.[4]
We will now discuss the various techniques for wrapper generation and will also compare it with our approach. Autowrapper[5] extracts table from the web-pages using smith-waterman algorithm but it fails when there are no HTML tags, nested tables or single row tables. Our approach works well for data on web that is without HTML tags and any table structure. PickUp[3] is able to extract complex table structures but it fails when there is both data without HTML tags and repeated patterns. It focuses on a single page, to overcome this limitation I analyze many pages before generating wrapper for the unstructured data without HTML tags. RoadRunner[1] required HTML tags and two training documents. It tries to map one document structure with other, handling some irregularities but it fails without HTML tags. Our approach does not need any set of training documents and also it works for data without HTML tags on the web.
ROADRUNNER
In this paper we are extending the work done for RoadRunner that generates wrapper automatically for HTML web pages. We will giving a brief overview of RoadRunner. It basically does the HTML tags match. It needs 2 pages to parse and generate a generalized wrapper. The first page is the reference page. While