Roumazeilles.net

Comparison of web site search engines

A selection of open source site search services

 

Why?

When your web site becomes so large that you do not easily find information that you know is there, your visitors need to have a tool to help them in their search of the useful data. Of course, it is often possible to leave that task to Google, but you take the risk of seeing your visitor leaving immediately to some other site that is better ranked than yours. Consequently, you want to add a feature to search your web site (like the one appearing on the top right corner of roumazeilles.net).

Initially (for the last four years), I used Atomz search engine, but was prepared to switch to a new solution.

However, all products are not born equal. I decided to find the right open source solution in the wealth of available products. This led ot the following comparison table.

Comparison table

All products despite their qualities are not equal. I found a lot of them and needed to differentiate them all. Important features are:

Search Engine

Coded in...

Administration

Protocols

Searchable contents

Additional notes

ASPseek

C++ and STL

Command line

http/ proxy http/ ftp/ https

HTML, text

Weight inbound links.

BBDBot

Java

       

DataparkSearch

 

CGI front-end

http/ https/ ftp/ nntp-news

HTML, text, XML, MP3, GIF

Boolean queries.

Relevancy, popularity rank.

Fuzzy searching.

Eureka

CGI in C

 

http

 

No Boolean query.

Documentation is minimal.

ht://dig

C++

template-based HTML front end

http

HTML, PDF, MS Word, PowerPoint

Good user support.

Non-exact matches.

Boolean queries.

ISearch

 

 

http

 

Weak documentation and support, easy to install.

Dated interface.

Z39.50 support for distributed search clients.

JXTA search

 

 

http

 

JXTA -> distributed.

No central index. Designed to be scalable to billions of queries per day.

Not really for simple web sites.

Lucene

Java

 

http

XML, PDF, RTF

Very fast indexing, very small index.

Regex, fuzzy and proximity search.

Now part of Apache Jakarta

mnoGoSearch (formerly UdmSearch)

C

Front-end systems in PHP, CGI and Perl.

User-editable HTML templates for search results

http/ http proxy/ ftp/ nntp-news

External file format converts for PDF, PostScript, Microsoft Word .doc

Uncompresses gzip, compress and deflate formats.

Use a SQL database instead of an index.

Queries can include Boolean query operators, options for stemming, synonyms and substrings.

Namazu

C and PERL

command-line driven, you have to do a "make install"

Indexes local files only (no robot), requires wget or another package to index remote file

Text files, email messages, usenet netnews postings, mhonarc archives, compressed files such as gzip, Microsoft Word, PDF, RFC and TeX formats

Regex support, Boolean queries

Nutch

Java

 

http/ ftp

 

Based on Lucene, targets full global www indexing.

15 relevance quality adjustment options.

Not really for simple web sites.

OpenFTS

Perl

 

http

easy to add parsers.

PostgreSQL database.

Very fast index updates.

PLWeb Turbo (PLS / AOL)

 

 

http

 

No longer really supported.

Swish-e

perl/C

CGIs in Perl and PHP still beta

 

http

Uses external converters to index binary files including PDF, Microsoft Word, compressed files.

Very rich search functionality, including: Follows Robots Exclusion Rules. Fuzzy matching, including truncation, stemming, soundex, metaphone, and double-metaphone indexing.

Simple to install engine.

SWISH++

C++. STL and GNU make

 

Indexes local files, and remote web sites using a robot spider based on wget

Indexes a variety of file types such as mail, news, Unix manual pages, PDFs, Postscript, LaTeX and RTF documents, ID3 tags for MP3 files, and Microsoft Office docs.

Heavily geared for English.

Boolean queries

WAIS and freeWAIS

 

 

http

 

Obsolete.

Boolean queries.

Glimpse & WebGlimpse

 

 

http

 

Free only for non commercial use.

 

As you can quickly see from the table, there is no unique choice solution. Depending on your set of constraints you may find some useful products (and some are definitely useless or outdated).

Independent and commercial solutions

As I told before, I was previously using the Atomz search Engine (recently rename WebSideStory Search and Content Solutions). So, I also checked a few good external solutions as they well be interesting.

Search Engine

Administration

Protocols

Searchable contents

Additional notes

Atomz / WebSideStory

Extensive web-based interface

http/ https

HTML, PDF, Macromedia Flash, MP3

The free product inserts advertisment. Can be removed when paying a license.

FreeFind

Web-based interface

http

PDF files for paid versions only

Free, with advertisment but no fixed page limit.

picoSearch

Extensive web-based interface

http/ https

HTML, text, XML, MP3, MIDI, Shockwave, Word, Excel, PowerPoint, RTF, PostScript, PDF

Free accounts are limited to 250 pages and include sponsored links.

Google AdSense for Search

Extensive web-based interface

http/ https

HTML, PDF

Inserts search-sensitive advertisement. But you get a share of the ads financial benefits.

 

Of course, the free versions have some limitations, but it is up to you to choose. Importantly, we don't care about the coding language since these solutions are hosted elsewhere and can be coded in whatever language is deemed sufficient by the solution provider.

Other location where to find more information.

Conclusions - My choice

Finally, as you certainly noticed, I decided to choose Google AdSense for search. My choice was heavily influenced by the paid ads of Google. It brings some financial independence to a web site that has as low as a few hundred visits per day and good contents.

Even more interesting, it is extremely easy to setup, it has excellent log reports (that help improve the operation of the whole engine and know what your visitors are looking for).

 


http://www.roumazeilles.net/

Copyright (c) 1999-2008 - Yves Roumazeilles (all rights reserved)

Latest update: 30-oct-08

Google.com
Roumazeilles.net
Roumazeilles.net