Rusty Japikse
My Life in Bits
Configuring the Google Mini Crawl
Background
The Google Mini indexes files through http crawls. While it can index many file types, all of these files must be initially located through link crawling. At a simple level, enabling directory browsing on the server can provide links to files. The Google Mini then discovers files by obtaining an initial directory listing (e.g. the Google Mini starts crawling at http://fileserver.acmecorp.com/). Files to be indexed can be controlled on the Google Mini through the use of regular expressions. These regular expressions limit files to crawl by their extension.
The Problem
The primary problem that I encountered was that the Google Mini would also index the directory listings. The indexing of these directory listings lead greatly decreased the search precision and recall. These directory listings are html like any other web page, however they do not end in .htm*. Using a regular expression to block the indexing of files that did not end with an extension would address this problem, but I believe this would also block the crawling of the file server.
Note: It may be possible to work around this problem, I investigated possible solutions, but did not chase down every possible approach to solving this through configuring just the Google Mini.
The Solution
A file system crawler was written in python. This crawler (CrawlFS.py) crawls servers and generates a HTML file that may be used to start the Google Mini crawl on. This HTML file is typically served as the default document in the root of the initial crawl directory.
Other Niceties:
Since I was already going to the trouble of writing a file system crawler, adding a few features also made sense. Most of these are very minor and are easily seen by reading the programming comments in CrawlFS.py.
The neat feature that is worth mentioning is the duplicate file checker. The duplicate file checker reads the first x bytes of a file (2048 seems to work well), computes a md5 value and stores it in memory. When the file system crawl is finished, the duplicate files are identified. All of the files discovered are listed in the output html file, with hyperlinks only given to unique files. This should help to improve search precision.
Implementation Details
At work, the Google Mini was implemented as shown below. This
approach of placing the Google Mini behind a firewall router allows it
to be used with directory
level security and user authentication. This layout also became the default
foundation for configuring the crawl.

From right to left:
-
Google Mini - starting with the Google Mini, configure it to crawl each non-cross linked file server. If you do not configure it to start crawling on a particular server (say server b), but instead rely upon that server being discovered through hyperlinks to that server (from server a), be sure not to block crawling of server b in the Google Mini configuration interface. I know that sounds very self evident, but...
Also, for each server that the Google Mini starts crawling on, have a directory listing file (as generated from CrawlFS.py) in the directory root and configure your server to serve it as the default document.
-
Firewall - in this aspect of operation the firewall is more or less irrelevant to the crawl.
-
Secure Proxy - the secure proxy is optional. In our setup, we were unable to configure the Google Mini to correctly authenticate with each file server that it was to crawl (the file servers were Windows 2000 and Windows NAS machines with integrated authentication turned on). So an alternate solution was found, the NTLM Authorization Proxy Server. This proxy server was installed on the same machine that provided the secure proxy to the Google Mini. The NTLM proxy server provides authorized, proxied connections to the file servers. For security, this proxy was configured only to allow connections from the Linksys box.
-
Not Shown (but essential) - The python script CrawlFS.py can be installed on most any computer connected to the network. It would make the most sense to place it on each of the servers so that network bandwith is not used to transfer file segments (e.g. the first 2048 bytes for duplicate file checking). Unfortunately, that was not possible in our situation. So, CrawlFS.py was placed on the authenticating server. This script was then run on a scheduled basis several hours before the Google Mini starts its twice weekly crawl.
- Download CrawlFS.py (CrawlFS.py).
- Download HyperText, (a required python module).
- Download a python based secure proxy to the Google Mini (CGI based) (gproxy.py).
- You
may also find it useful to read about gproxy.py, a python script for
creating a secure proxy to a Google Mini.
(http://japikse.com/resources/scripting/crawlfs.html)