So here's a poor-mans relatedness feature.
Every BlogEntry has a ``Related'' attribute now which is a list of topic names that are considered to be related somehow to this article. So
it is a totall manual approach to stuff like that, i.e. it is 100% the authors responsibility
to establish relatedness. Nearly. The list of related articles has a tiny bit of automatism
in that the relatedness relation is computed to be reflexive and transitive to a given
depth. Heh, plainly, every article that states to be related to another establishes the
opposite as well. A simple search to find out who is related to me. Actually we compute
this to a depth of two by default, that is every article that is related to an article being
related to another article is taken to be related to the current one.
There is a new tag in the
TWiki.BlogPlugin RELATEDENTRIES that implements all this
in perl. No recursive INCLUDE-SEARCH orgies. Things piled up to ca 500 lines in the BlogPlugin already. IMHO, this is bad news for TWiki's readiness for TWikiApplications.
Anyway, frankly, there
is gear to automate to find
similar documents using Latent Semantic Indexing, see
Wikipedia:Latent_semantic_analysis and a nice hands-on article on
perl.com about
"Building a Vector Space Search Engine in Perl" by Maciej Ceglowski. This implementation is a very rudimentary in-memory search engine, see
Search-VectorSpace.
But there's a more advanced search engine by the same author
Search-ContextGraph which is based on a spreading activation scheme that performs similar to LSI. Hm, before
I will understand what's going on inside this beast I will purchase a copy of
"Foundations of Statistical Natural Language Processing" by Christopher D. Manning and Hinrich Schütze.
I should have put that onto my shelf long before.