Tips for Web Server Administrators
How we can help you:
- Your crawler is hitting my server too hard!
How do I request a lower crawl frequency? - How do I keep [parts of] my website out of the search index?
How you can help us:
- Check your Web logs for URL recursion
- Notify us if you have multiple DNS names pointing to the same website
Reducing the crawler's load on your Web server
We have two ways to reduce the crawl load on your server, depending on whether you are having problems with instantaneous load (crawl intensity) or too much crawling overall (crawl frequency).
Please note that if we crawl your Web site less frequently, it could result in slightly more outdated search results for your Web pages.
Before you request crawl reduction, check your Web server logs for infinite URL recursion — the problem may be on your server.
Excluding parts of your Web site from the search index
The Search Appliance obeys the exclusion parameters
you place in the document "/robots.txt" on your Web server,
at the document root level.
This file, if present, must be served with an HTTP 200 (OK) status code
in order for the Search Appliance to use it.
An HTTP 404 (Not Found) status code simply indicates
that your site is to be crawled as usual because robots.txt does not exist.
If a request for "/robots.txt" returns an error code other than 404,
then your site will not be crawled, and its documents will not appear in search results.
Checking your Web server logs for URL recursion
URL recursion can happen when your Web server allows multiple URLs to fetch the same document, where one URL makes the document appear deeper in the directory structure. When relative links are used to link to neighboring documents on the server, the Search Appliance interprets this as finding more documents in successively deeper directories, often leading to exponential crawling and growth of our search index. This incurs unnecessary load on both the Search Appliance and your Web server.
You can spot URL recursion in your server logs fairly easily. The most obvious sign is a set of requests with repetetitive sequences of directories becoming progressively longer. Here is a simplified example:
gsa-crawler GET /dept/index.html gsa-crawler GET /dept/support/index.html gsa-crawler GET /dept/support/dept/index.html gsa-crawler GET /dept/support/dept/support/index.html ...
To fix this, check for recursive symbolic links in the filesystem
of your Web server. If you must use such filesystem links,
then fix the links within your Web pages so that they use site-absolute
references instead of relative references.
For example, use "/dept/support/index.html"
instead of "support/index.html".
Your server should also return HTTP 404 status codes for pages that do not exist, instead of returning a success code and a page of links (the Search Appliance would interpret the latter as a new page to crawl).
Reporting duplicate DNS names for the same Web site
Many Web sites can be reached using both a www and non-www version
of the host name in the URL.
This can lead to unnecessary document duplication in the Search Appliance index
because each unique URL is considered a unique document.
It also incurs unnecessary load on your server.
We can avoid this by notifying the Search Appliance of duplicate DNS host names.
Ideally, a server will redirect ("URL rewrite") the nonstandard versions
to the preferred, canonical version.
For example, this server (central Web Hotel) will respond to the URL
"http://umn.edu/google"
by returning an HTTP 302 (Moved) status code
and a redirect header "Location: http://www1.umn.edu/google/".
In this case, we would not need to notify the Search Appliance
of duplicate hosts; it does not index documents that return error codes.