Crawler for Competitive Intelligence
ABSTRACT
Competitive intelligence can be understood as the action of gathering, analyzing, and distributing information about products, customers, competitors and any aspect of the environment needed to support executives and managers in making strategic decisions for an organization.The goal of Competitive Intelligence (CI), a sub-area of Knowledge Management , is to monitor a firms external environment to obtain information relevant to its decision-making process[1].In this paper we present a focused web crawler which collects web pages from the Web using different graph search algorithms and is the main component of a CI tool. A focused crawler or topical crawler is a web crawler that attempts to download only web pages that are relevant to a pre-defined topic or set of topics. The performance of a focused crawler depends mostly on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points[2]. CI Crawler collects Web pages from sites specified by the user and applies indexing and categorization analysis on the documents collected, thus providing the user with an up-to-date view of the Web pages.
FEATURES:-
Crawls web pages based on any one of three algorithms:-FCFS, FICA, Dynamic Pagerank as has been explained later in the paper
Crawls those web pages which get redirected:- Many URLs are redirected to some other web page with a new address. This causes a lot of problems for major search engines to locate the exact web page.We solved this problem by analyzing the “meta” tag and its attributes(‘http-equiv)
Resolves relative URLs to absolute URLs:-Some web pages have links which are not absolute. We have solved this problem by converting the relative URL to its corresponding absolute URL.
Extracts the useful text content from the webpages:- As our crawler crawls the web pages,it saves the pages in a “.txt” format in a local repository.We have successfully removed the “html” tags,the “javascripts”,and other useless content from the page to extract out the relevant textual information.
Avoids “ad servers” while crawling:- In a Web page there are some URLs which point to advertisement links which are useless to crawl in a “focused crawler”. We are using a list of adservers(available on the internet) which can be updated dynamically in our program to block visiting such pages.
Crawls in a focussed way, visiting pages pertained to a specific topic:- A CI crawler should crawl pages that are relevent to a particular topic entered by the user.This implies that the crawler should avoid crawling web pages that do not come under the specific topic.We are currently working on this aspect and plan to