My Life in Bits

Home Photography Work

Configuring the Google Mini Crawl

Background

The Google Mini indexes files through http crawls. While it can index many file types, all of these files must be initially located through link crawling. At a simple level, enabling directory browsing on the server can provide links to files. The Google Mini then discovers files by obtaining an initial directory listing (e.g. the Google Mini starts crawling at http://fileserver.acmecorp.com/). Files to be indexed can be controlled on the Google Mini through the use of regular expressions. These regular expressions limit files to crawl by their extension.

The Problem

The primary problem that I encountered was that the Google Mini would also index the directory listings. The indexing of these directory listings lead greatly decreased the search precision and recall. These directory listings are html like any other web page, however they do not end in .htm*. Using a regular expression to block the indexing of files that did not end with an extension would address this problem, but I believe this would also block the crawling of the file server.

Note: It may be possible to work around this problem, I investigated possible solutions, but did not chase down every possible approach to solving this through configuring just the Google Mini.

The Solution

A file system crawler was written in python. This crawler (CrawlFS.py) crawls servers and generates a HTML file that may be used to start the Google Mini crawl on. This HTML file is typically served as the default document in the root of the initial crawl directory.

Other Niceties:

Since I was already going to the trouble of writing a file system crawler, adding a few features also made sense. Most of these are very minor and are easily seen by reading the programming comments in CrawlFS.py.

The neat feature that is worth mentioning is the duplicate file checker. The duplicate file checker reads the first x bytes of a file (2048 seems to work well), computes a md5 value and stores it in memory. When the file system crawl is finished, the duplicate files are identified. All of the files discovered are listed in the output html file, with hyperlinks only given to unique files. This should help to improve search precision.

Implementation Details

At work, the Google Mini was implemented as shown below. This approach of placing the Google Mini behind a firewall router allows it to be used with directory level security and user authentication. This layout also became the default foundation for configuring the crawl. 

Securing the Google Mini

From right to left:

 
   

www.japikse.comrestricted | stats