WWW Search Engine Software

ht://Dig Copyright © 1995-1999 The ht://Dig Group
Please see the file COPYING for license information.

Introduction

The ht://Dig system is a complete world wide web indexing and searching system for a small domain or intranet. This system is not meant to replace the need for powerful internet-wide search systems like Lycos, Infoseek, Webcrawler and AltaVista. Instead it is meant to cover the search needs for a single company, campus, or even a particular sub section of a web site.
As opposed to some WAIS-based or web-server based search engines, ht://Dig can span several web servers at a site. The type of these different web servers doesn't matter as long as they understand the HTTP 1.0 protocol.

ht://Dig was developed at San Diego State University as a way to search the various web servers on the campus network. Here are some examples of the application of ht://Dig on the SDSU network:

Many different types of searches can be set up using only a single search database. For example, the online documentation search above uses the same database as the campus main search. The difference between the searches is that the documentation search will only show results related to the online documentation.

Features

Here are some of the major features of ht://Dig. They are in no particular order.

Intranet searching

ht://Dig has the ability to search through many servers on a network by acting as a WWW browser.

It is free

The whole system is released under the GNU General Public License

Robot exclusion is supported

The Standard for Robot Exclusion is supported by ht://Dig.

Boolean expression searching

Searches can be arbitrarily complex using boolean expressions.

Configurable search results

The output of a search can easily be tailored to your needs by means of providing HTML templates.

Fuzzy searching

Searches can be performed using various configurable algorithms. Currently the following algorithms are supported (in any combination):

exact

soundex

metaphone

common word endings

synonyms

Searching of HTML and text files

Both HTML documents and plain text files can be searched. Searching of other file types will be supported in future versions.

Keywords can be added to HTML documents

Any number of keywords can be added to HTML documents which will not show up when the document is viewed. This is used to make a document more like to be found and also to make it appear higher in the list of matches.

Email notification of expired documents

Special meta information can be added to HTML documents which can be used to notify the maintainer of those documents at a certain time. It is handy to get reminded when to remove the "New" images from a certain page, for example.

A Protected server can be indexed

ht://Dig can be told to use a specific username and password when it retrieves documents. This can be used to index a server or parts of a server that are protected by a username and password.

Searches on subsections of the database

It is easy to set up a search which only returns documents whose URL matches a certain pattern. This becomes very useful for people who want to make their own data searchable without having to use a separate search engine or database.

Full source code included

The search engine comes with full source code. The whole system is released under the terms and conditions of the GNU Public License version 2.0

The depth of the search can be limited

Instead of limiting the search to a set of machines, it can also be restricted to documents that are a certain number of "mouse-clicks" away from the start document.

Full support for the ISO-Latin-1 character set

Both SGML entities like 'à' and ISO-Latin-1 characters can be indexed and searched.

Andrew Scherpbier <andrew@contigo.com>