Web Robots and Web Mining

https://scout.wisc.edu/report/nsdl/met/2003/0328

Manually indexing the World Wide Web is obviously an impossible task, and it is even a daunting challenge for automated techniques. Web content mining is a general term used to describe these techniques, which are intended for information categorization and filtering. Web robots serve a variety of purposes, including indexing; and they can be useful or, in some cases, harmful. Web usage mining, on the other hand, is used to determine how a Web site's structure and organization effect the way users navigate the site.

The Web Robots Pages (1) is an excellent starting place to learn about these automated programs. Several hundred robots are documented in a database, and a selection of papers considers proper ethics and guidelines for using robots, among other things. An article on Web mining and its subclasses is given on DM Review (2). It describes the basics of Web analysis and outlines many benefits Web mining can offer. A course homepage on Web data mining from DePaul University (3) offers a broad selection of reading material on the subject. Mostly consisting of research papers and journal articles, the documents range from general applications to specific theories and case studies. Two computer scientists from Polytechnic University propose a robust, distributed Web crawler (another term for Web robot), intended for large-scale network interaction (4). The twelve page paper begins with the motivation for the project, and continues with a full description of the system architecture and implementation. The November 2002 issue of Computer magazine featured an article on Data Mining for Web Intelligence (5). It points out that today's Internet is lacking in many key aspects, and that Web mining will play an important role in the development of improved search engines and automatic document classification. A short poster presentation from the 2002 International World Wide Web Conference (6) introduces GeniMiner, a Web search strategy based on a genetic algorithm. GeniMiner operates on the premise of finding a nearly optimal solution in order to minimize manual analysis of the search results. KDnuggets (7) is a free, biweekly newsletter on data and Web mining. In recent issues, special attention has been given to the Total Information Awareness project, which is investigating ways of mining the Web and email for possible information about terrorist activity. Web robots are occasionally used for malicious purposes, namely to automatically register for free email or participate in online polls. A technology that was developed to counter these robots involved using a blurred or distorted word to gain access, which could easily be read by a human but would be impossible for a robot to read. In a press release from the University of California at Berkeley (8), researchers have discovered a way to allow Web robots to crack this security system. The article describes how it was accomplished and provides motivation for more advanced security measures.

Archived Scout Publication URL