Google Search Appliance

Office of Information Technology

Tips for Web Server Administrators

How we can help you:

How you can help us:


Reducing the crawler's load on your Web server

We have two ways to reduce the crawl load on your server, depending on whether you are having problems with instantaneous load (crawl intensity) or too much crawling overall (crawl frequency).

Please note that if we crawl your Web site less frequently, it could result in slightly more outdated search results for your Web pages.

Before you request crawl reduction, check your Web server logs for infinite URL recursion — the problem may be on your server.

Excluding parts of your Web site from the search index

The Search Appliance obeys the exclusion parameters you place in the document "/robots.txt" on your Web server, at the document root level. This file, if present, must be served with an HTTP 200 (OK) status code in order for the Search Appliance to use it. An HTTP 404 (Not Found) status code simply indicates that your site is to be crawled as usual because robots.txt does not exist. If a request for "/robots.txt" returns an error code other than 404, then your site will not be crawled, and its documents will not appear in search results.


Checking your Web server logs for URL recursion

URL recursion can happen when your Web server allows multiple URLs to fetch the same document, where one URL makes the document appear deeper in the directory structure. When relative links are used to link to neighboring documents on the server, the Search Appliance interprets this as finding more documents in successively deeper directories, often leading to exponential crawling and growth of our search index. This incurs unnecessary load on both the Search Appliance and your Web server.

You can spot URL recursion in your server logs fairly easily. The most obvious sign is a set of requests with repetetitive sequences of directories becoming progressively longer. Here is a simplified example:

	gsa-crawler	GET	/dept/index.html
	gsa-crawler	GET     /dept/support/index.html
	gsa-crawler	GET	/dept/support/dept/index.html
	gsa-crawler	GET	/dept/support/dept/support/index.html
	...

To fix this, check for recursive symbolic links in the filesystem of your Web server. If you must use such filesystem links, then fix the links within your Web pages so that they use site-absolute references instead of relative references. For example, use "/dept/support/index.html" instead of "support/index.html".

Your server should also return HTTP 404 status codes for pages that do not exist, instead of returning a success code and a page of links (the Search Appliance would interpret the latter as a new page to crawl).

Reporting duplicate DNS names for the same Web site

Many Web sites can be reached using both a www and non-www version of the host name in the URL. This can lead to unnecessary document duplication in the Search Appliance index because each unique URL is considered a unique document. It also incurs unnecessary load on your server. We can avoid this by notifying the Search Appliance of duplicate DNS host names.

Ideally, a server will redirect ("URL rewrite") the nonstandard versions to the preferred, canonical version. For example, this server (central Web Hotel) will respond to the URL "http://umn.edu/google" by returning an HTTP 302 (Moved) status code and a redirect header "Location: http://www1.umn.edu/google/". In this case, we would not need to notify the Search Appliance of duplicate hosts; it does not index documents that return error codes.

Contact U of M Privacy
© Regents of the University of Minnesota. All rights reserved.
The University of Minnesota is an equal opportunity educator and employer.

Last modified 2008-05-26 22:19:38 CDT · Retrieved 2008-10-07 09:56:29 CDT · URL http://www.umn.edu/google/serveradmins.html