Plucene Search Engine Add-On
TWiki original search engine is a simple yet powerful tool. However, it can not
search within attached documents. That has been discused in many topics in the
Codev web:
Time ago I found
Plucene, which
is a Perl port of the java library
Lucene. So this plugin/addon intends to
be a topic/attachment search engine, with Plucene as its backend.
I would like to thank
TWiki:Main.SopanShewale for his many suggestions and
contributions.
Usage
Indexing with plucindex
The
plucindex script indexes all the public webs, and it uses some
TWiki::Func code to retrieve the list of available webs and to retrieve their
topic list. For each topic, the meta data is inspected and indexed, as the text
body. Also, if the topic has attachments, those are indexed (see below for more
details).
By now, you should run this script manually after installation to create the
index files used by
plucsearch. If you want, you can also schedule a weekly
or monthly crontab job to create the index files again, or maybe execute it
manually when you take down your server for maintenance tasks. To prevent
browser access, it has been placed out of the public bin folder.
Updating with plucupdate
The
plucupdate script examins the modification time of all topics to reindex
their text and attachments if it changed since the last time the script checked.
Changes may occur during a normal save, rename or attach operation of TWiki or
due to a plain file operation, e.g. using an external editor.
This script should be executed by an hourly crontab. Note, that the
plucupdate
and
plucindex cronjobs should not overlap.
Attachment file types to be indexed
All the PDF, HTML, Microsoft Word (DOC), Open Document (ODT), Microsoft Excel (XLS),
Powerpoint (PPT) and text attachments are also indexed by default. If you want
to override this setting you can use a TWiki preference
PLUCENEINDEXEXTENSIONS.
The DOT before the extension type is required.
You can copy & paste the next lines in your
TWiki.TWikiPreferences
topic
- Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc, .odt, .xls, .ppt
or whatever extensions you want.
There are a couple external helper programs that are used by the indexer
backends to read proprietary file formats like .doc. See the list of dependencies
in the Add-On information below.
Searching with plucsearch
The
plucsearch script uses a template
plucsearch.tmpl (that can be
adapted to your site skin easily) or the
plucsearch.pattern.tmpl (if you
use the pattern skin). There is also a
PluceneSearch topic with a form ready
to use with the
plucsearch script.
The query syntax has been improved
- you can use
+ for and and - for and not
- you can limit to the topic body or attachment body, using the prefix
text: or just type the search string
- if you want to search using some meta data, you should use the prefix
field: where field is the meta data name (like author)
- if you want to search using some form field, you should use the prefix
field: where field is the form's field name
- every topic has an
web, and topic field so that you can search for specific topic names in a web.
- likewise, meta information is captured in
author, version and date fields
- there is a specific
combined field which holds all of the combined information about a topic in one field, that is its topic text, its title and all of the other attributes that are part of this topic, including the comments to attachments. The combined field is the default field that you don't have to classify in a query in a specific way.
- plucene adds the type field for the indexed attachments, so you can use it to filter your results (like
type:pdf)
- attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments
Query examples (just type it in your
PluceneSearch site topic)
- text:plucene searches for plucene in topic/attachment text
- plucene as above
- author:JoanMVigo searches for topics/attachments authored by this author
- TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
- +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files
Other features
This new version provides some extra functionality:
- skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
- all other webs are indexed, however if a web has
Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
- skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
- index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for
CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO = JohnSmith in its WebPreferences.
- displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.
Search form
The following form submits text to the
plucsearch script. The installation
instructions are detailed below.
Settings
- Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc, .odt, .xls, .ppt
- Set PLUCENESEARCHATTACHMENTSONLY = 0
- Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
- Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
- Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox, TestCases
- Set PLUCENEINDEXSKIPATTACHMENTS = AnAttachment.txt, OtherAttachment.pdf
Add-On Installation Instructions
The following instructions are for the administrator who installs the
add-on on the server where TWiki is running.
- You can install Plucene and its dependencies running:
- perl -MCPAN -e "install Plucene"
- perl -MCPAN -e "install Plucene::SearchEngine"
- Install third party text extracting tools, like
xpdf which provides pdftotext, antiword and odt2txt.
- Download the ZIP file from the Add-on Home (see below)
- Unzip
SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
%$MANIFEST%
- Test if the installation was successful:
- change the working directory to the
tools twiki installation directory
- run ./plucindex
- once finished, open a browser window and point it to the
TWiki/PluceneSearch topic
- just type a query and check the results
- Just create a new hourly crontab entry for the
tools/plucupdate script.
Add-On Info
This work is partly funded by the
DPA - Deutsche Presse Agentur GmbH.
| Add-on Author: |
TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo, TWiki:Main/MichaelDaum |
| License: |
GPL (GNU General Public License) |
| Add-on Version: |
v3.1 |
| Change History: |
|
| 13 Sep 2007: |
fastly recoded, added backends for common office file formats |
| 27 Jun 2006: |
TWikiDakar (v2.200) - Searching issue solved when using template authentication, update index bug solved |
| 27 Jun 2006: |
TWikiCairo (v1.500) - Update index bug solved |
| 21 Mar 2006: |
TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved |
| 03 Mar 2006: |
TWikiDakar (v2.000) & TWikiCairo (v1.300) |
| 15 Dec 2004: |
Use of TWiki preferences for indexing path & attachment extensions (v1.210) |
| 26 Nov 2004: |
TWikiCairo release compatible version (v1.200) |
| 23 Nov 2004: |
Incremental version (v1.100) |
| 18 Nov 2004: |
Initial version (v1.000) |
| CPAN Dependencies: |
CPAN:Bit::Vector::Minimal, CPAN:IO::Scalar, CPAN:Plucene, CPAN:Plucene::SearchEngine, CPAN:Text::German, CPAN:Tie::Array::Sorted, CPAN:Spreadsheet::ParseExcel, CPAN:Time::Piece |
| Other Dependencies: |
xpdf (pdftotext), antiword, odt2txt, and additional 3rd party tools for text extracting |
| Perl Version: |
Tested with 5.8.0 |
| Add-on Home: |
http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn |
| Feedback: |
http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev |
| Appraisal: |
http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal |
--
TWiki:Main/JoanMVigo - 27 Jun 2006
--
TWiki:Main/MichaelDaum - 13 Sep 2007