Pluf_Search Class Reference

List of all members.

Static Public Member Functions

static search ($query, $stemmer='Pluf_Text_Stemmer_Porter')
static stem ($words, $stemmer)
static searchDocuments ($wids)
static getWordIds ($words)
static index ($doc, $stemmer='Pluf_Text_Stemmer_Porter')


Detailed Description

Class implementing a small search engine.

Ideal for a small website with up to 100,000 documents.


Member Function Documentation

static Pluf_Search::search ( query,
stemmer = 'Pluf_Text_Stemmer_Porter' 
) [static]

Search.

Returns an array of array with model_class, model_id and score. The list is already sorted by score descending.

You can then filter the list as you wish with another set of weights.

Parameters:
string Query string.
Returns:
array Results.

static Pluf_Search::stem ( words,
stemmer 
) [static]

Stem the words with the given stemmer.

static Pluf_Search::searchDocuments ( wids  )  [static]

Search documents.

Only the total of the ponderated occurences is used to sort the results.

Parameters:
array Ids.
Returns:
array Sorted by score, returns model_class, model_id and score.

static Pluf_Search::getWordIds ( words  )  [static]

Get the id of each word.

Parameters:
array Words
Returns:
array Ids, null if no matching word.

static Pluf_Search::index ( doc,
stemmer = 'Pluf_Text_Stemmer_Porter' 
) [static]

Index a document.

The document must provide a method _toIndex() returning the document as a string for indexation. The string must be clean and will simply be tokenized by Pluf_Text::tokenize().

So a recommended way to clean it at the end is to remove all the HTML tags and then run the following on it:

return Pluf_Text::cleanString(html_entity_decode($string, ENT_QUOTES, 'UTF-8'));

Indexing is resource intensive so it is recommanded to run the indexing in an asynchronous way. When you save a resource to be indexed, just write a log "need to index resource x" and then you can every few minutes index the resources. Nobody care if your index is not perfectly fresh, but your end users care if it takes 0.6s to get back the page instead of 0.1s.

Take 500 average documents, index them while counting the total time it takes to index. Divide by 500 and if the result is more than 0.1s, use a log/queue.

FIXME: Concurrency problem if you index at the same time the same doc.

Parameters:
Pluf_Model Document to index.
Stemmer used. ('Pluf_Text_Stemmer_Porter')
Returns:
array Statistics.


The documentation for this class was generated from the following file:

Generated on Wed Feb 3 15:44:52 2010 for Pluf by  doxygen