Google Search Appliance

Office of Information Technology

Diagnosing Index Problems

You should periodically check your Crawl Diagnostics (in the Status and Reports section) for "runaway" crawling on your servers. The Search Appliance will crawl indefinitely on a Web site if it allows for deep-recursive linking. A problematic URL will have the same directories repeated in sequence. The Search Appliance thinks it is descending into subdirectories and discovering new pages to crawl, when actually the same pages are being served at each level.

  1. In the Crawl Diagnostics for your collection, click the Crawled URLs column header to put the servers with the highest number of crawled documents at the top of the list.
  2. Look for an unusually high number of crawled URLs, and then follow the link for the corresponding host name.
  3. Proceed in this manner, looking for directories on the host (server) with the highest successful crawl counts and following the links for those directories. Look for the last page or directory to have the most outstanding crawl count (usually on the order of 100,000 or 1,000,000) as you descend the directory hierarchy.
  4. Follow the link in the File column for a page in this directory to view its crawl diagnostics.
  5. Follow the link to view a list of crawled pages linking to that page. This process will usually give you an indication of which pages are linking improperly to neighboring pages.

If you find a runaway crawl on one of your Web sites, please inform the Support Team so we can either add a URL pattern to exclude the problematic pages or contact the page authors to correct the problem.

Contact U of M Privacy
© Regents of the University of Minnesota. All rights reserved.
The University of Minnesota is an equal opportunity educator and employer.

Last modified 2005-12-08 13:27:58 CST · Retrieved 2009-11-25 17:08:19 CST · URL http://www.umn.edu/google/managers/indexproblems.html