Comparison of web site search engines

A selection of open source site search services



When your web site becomes so large that you do not easily find information that you know is there, your visitors need to have a tool to help them in their search of the useful data. Of course, it is often possible to leave that task to Google, but you take the risk of seeing your visitor leaving immediately to some other site that is better ranked than yours. Consequently, you want to add a feature to search your web site (like the one appearing on the top right corner of

Initially (for the last four years), I used Atomz search engine, but was prepared to switch to a new solution.

However, all products are not born equal. I decided to find the right open source solution in the wealth of available products. This led ot the following comparison table.

Comparison table

All products despite their qualities are not equal. I found a lot of them and needed to differentiate them all. Important features are:

Search Engine

Coded in...



Searchable contents

Additional notes


C++ and STL

Command line

http/ proxy http/ ftp/ https

HTML, text

Weight inbound links.






CGI front-end

http/ https/ ftp/ nntp-news

HTML, text, XML, MP3, GIF

Boolean queries.

Relevancy, popularity rank.

Fuzzy searching.


CGI in C




No Boolean query.

Documentation is minimal.



template-based HTML front end


HTML, PDF, MS Word, PowerPoint

Good user support.

Non-exact matches.

Boolean queries.






Weak documentation and support, easy to install.

Dated interface.

Z39.50 support for distributed search clients.

JXTA search





JXTA -> distributed.

No central index. Designed to be scalable to billions of queries per day.

Not really for simple web sites.






Very fast indexing, very small index.

Regex, fuzzy and proximity search.

Now part of Apache Jakarta

mnoGoSearch (formerly UdmSearch)


Front-end systems in PHP, CGI and Perl.

User-editable HTML templates for search results

http/ http proxy/ ftp/ nntp-news

External file format converts for PDF, PostScript, Microsoft Word .doc

Uncompresses gzip, compress and deflate formats.

Use a SQL database instead of an index.

Queries can include Boolean query operators, options for stemming, synonyms and substrings.


C and PERL

command-line driven, you have to do a "make install"

Indexes local files only (no robot), requires wget or another package to index remote file

Text files, email messages, usenet netnews postings, mhonarc archives, compressed files such as gzip, Microsoft Word, PDF, RFC and TeX formats

Regex support, Boolean queries




http/ ftp


Based on Lucene, targets full global www indexing.

15 relevance quality adjustment options.

Not really for simple web sites.





easy to add parsers.

PostgreSQL database.

Very fast index updates.

PLWeb Turbo (PLS / AOL)





No longer really supported.



CGIs in Perl and PHP still beta



Uses external converters to index binary files including PDF, Microsoft Word, compressed files.

Very rich search functionality, including: Follows Robots Exclusion Rules. Fuzzy matching, including truncation, stemming, soundex, metaphone, and double-metaphone indexing.

Simple to install engine.


C++. STL and GNU make


Indexes local files, and remote web sites using a robot spider based on wget

Indexes a variety of file types such as mail, news, Unix manual pages, PDFs, Postscript, LaTeX and RTF documents, ID3 tags for MP3 files, and Microsoft Office docs.

Heavily geared for English.

Boolean queries

WAIS and freeWAIS






Boolean queries.

Glimpse & WebGlimpse





Free only for non commercial use.


As you can quickly see from the table, there is no unique choice solution. Depending on your set of constraints you may find some useful products (and some are definitely useless or outdated).

Independent and commercial solutions

As I told before, I was previously using the Atomz search Engine (recently rename WebSideStory Search and Content Solutions). So, I also checked a few good external solutions as they well be interesting.

Search Engine



Searchable contents

Additional notes

Atomz / WebSideStory

Extensive web-based interface

http/ https

HTML, PDF, Macromedia Flash, MP3

The free product inserts advertisment. Can be removed when paying a license.


Web-based interface


PDF files for paid versions only

Free, with advertisment but no fixed page limit.


Extensive web-based interface

http/ https

HTML, text, XML, MP3, MIDI, Shockwave, Word, Excel, PowerPoint, RTF, PostScript, PDF

Free accounts are limited to 250 pages and include sponsored links.

Google AdSense for Search

Extensive web-based interface

http/ https


Inserts search-sensitive advertisement. But you get a share of the ads financial benefits.


Of course, the free versions have some limitations, but it is up to you to choose. Importantly, we don't care about the coding language since these solutions are hosted elsewhere and can be coded in whatever language is deemed sufficient by the solution provider.

Other location where to find more information.

Conclusions - My choice

Finally, as you certainly noticed, I decided to choose Google AdSense for search. My choice was heavily influenced by the paid ads of Google. It brings some financial independence to a web site that has as low as a few hundred visits per day and good contents.

Even more interesting, it is extremely easy to setup, it has excellent log reports (that help improve the operation of the whole engine and know what your visitors are looking for).

Copyright (c) 1999-2008 - Yves Roumazeilles (all rights reserved)

Latest update: 30-oct-08