When your web site becomes so large that you do not easily find information that you know is there, your visitors need to have a tool to help them in their search of the useful data. Of course, it is often possible to leave that task to Google, but you take the risk of seeing your visitor leaving immediately to some other site that is better ranked than yours. Consequently, you want to add a feature to search your web site (like the one appearing on the top right corner of roumazeilles.net).
Initially (for the last four years), I used Atomz search engine, but was prepared to switch to a new solution.
However, all products are not born equal. I decided to find the right open source solution in the wealth of available products. This led ot the following comparison table.
All products despite their qualities are not equal. I found a lot of them and needed to differentiate them all. Important features are:
|
Search Engine |
Coded in... |
Administration |
Protocols |
Searchable contents |
Additional notes |
|
ASPseek |
C++ and STL |
Command line |
http/ proxy http/ ftp/ https |
HTML, text |
Weight inbound links. |
|
BBDBot |
Java |
||||
|
DataparkSearch |
CGI front-end |
http/ https/ ftp/ nntp-news |
HTML, text, XML, MP3, GIF |
Boolean queries. Relevancy, popularity rank. Fuzzy searching. |
|
|
Eureka |
CGI in C |
|
http |
|
No Boolean query. Documentation is minimal. |
|
ht://dig |
C++ |
template-based HTML front end |
http |
HTML, PDF, MS Word, PowerPoint |
Good user support. Non-exact matches. Boolean queries. |
|
ISearch |
|
|
http |
|
Weak documentation and support, easy to install. Dated interface. Z39.50 support for distributed search clients. |
|
JXTA search |
|
|
http |
|
JXTA -> distributed. No central index. Designed to be scalable to billions of queries per day. Not really for simple web sites. |
|
Lucene |
Java |
|
http |
XML, PDF, RTF |
Very fast indexing, very small index. Regex, fuzzy and proximity search. Now part of Apache Jakarta |
|
mnoGoSearch (formerly UdmSearch) |
C |
Front-end systems in PHP, CGI and Perl. User-editable HTML templates for search results |
http/ http proxy/ ftp/ nntp-news |
External file format converts for PDF, PostScript, Microsoft Word .doc Uncompresses gzip, compress and deflate formats. |
Use a SQL database instead of an index. Queries can include Boolean query operators, options for stemming, synonyms and substrings. |
|
Namazu |
C and PERL |
command-line driven, you have to do a "make install" |
Indexes local files only (no robot), requires wget or another package to index remote file |
Text files, email messages, usenet netnews postings, mhonarc archives, compressed files such as gzip, Microsoft Word, PDF, RFC and TeX formats |
Regex support, Boolean queries |
|
Nutch |
Java |
|
http/ ftp |
|
Based on Lucene, targets full global www indexing. 15 relevance quality adjustment options. Not really for simple web sites. |
|
OpenFTS |
Perl |
|
http |
easy to add parsers. |
PostgreSQL database. Very fast index updates. |
|
PLWeb Turbo (PLS / AOL) |
|
|
http |
|
No longer really supported. |
|
Swish-e |
perl/C CGIs in Perl and PHP still beta |
|
http |
Uses external converters to index binary files including PDF, Microsoft Word, compressed files. |
Very rich search functionality, including: Follows Robots Exclusion Rules. Fuzzy matching, including truncation, stemming, soundex, metaphone, and double-metaphone indexing. Simple to install engine. |
|
SWISH++ |
C++. STL and GNU make |
|
Indexes local files, and remote web sites using a robot spider based on wget |
Indexes a variety of file types such as mail, news, Unix manual pages, PDFs, Postscript, LaTeX and RTF documents, ID3 tags for MP3 files, and Microsoft Office docs. |
Heavily geared for English. Boolean queries |
|
WAIS and freeWAIS |
|
|
http |
|
Obsolete. Boolean queries. |
|
Glimpse & WebGlimpse |
|
|
http |
|
Free only for non commercial use. |
As you can quickly see from the table, there is no unique choice solution. Depending on your set of constraints you may find some useful products (and some are definitely useless or outdated).
As I told before, I was previously using the Atomz search Engine (recently rename WebSideStory Search and Content Solutions). So, I also checked a few good external solutions as they well be interesting.
|
Search Engine |
Administration |
Protocols |
Searchable contents |
Additional notes |
|
Atomz / WebSideStory |
Extensive web-based interface |
http/ https |
HTML, PDF, Macromedia Flash, MP3 |
The free product inserts advertisment. Can be removed when paying a license. |
|
FreeFind |
Web-based interface |
http |
PDF files for paid versions only |
Free, with advertisment but no fixed page limit. |
|
picoSearch |
Extensive web-based interface |
http/ https |
HTML, text, XML, MP3, MIDI, Shockwave, Word, Excel, PowerPoint, RTF, PostScript, PDF |
Free accounts are limited to 250 pages and include sponsored links. |
|
Google AdSense for Search |
Extensive web-based interface |
http/ https |
HTML, PDF |
Inserts search-sensitive advertisement. But you get a share of the ads financial benefits. |
Of course, the free versions have some limitations, but it is up to you to choose. Importantly, we don't care about the coding language since these solutions are hosted elsewhere and can be coded in whatever language is deemed sufficient by the solution provider.
Finally, as you certainly noticed, I decided to choose Google AdSense for search. My choice was heavily influenced by the paid ads of Google. It brings some financial independence to a web site that has as low as a few hundred visits per day and good contents.
Even more interesting, it is extremely easy to setup, it has excellent log reports (that help improve the operation of the whole engine and know what your visitors are looking for).
Copyright (C) 1999-2008 - Yves Roumazeilles (all rights reserved)
Latest update: 23-aug-08