Full–Text Index For MyISAM in MySQL

Full–Text Index For MyISAM

  • A query expansion search is a modification of a natural language search. The search string is used to perform a natural language search. Then words from the most relevant rows returned by the search are added to the search string and the search is done again. The query returns the rows from the second search. The WITH QUERY EXPANSION modifier specifies a query expansion search.

  • When MATCH() is used in a WHERE clause, as in the example shown earlier, the rows returned are automatically sorted with the highest relevance first. Relevance values are non-negative floating point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

MySQL can perform boolean full-text searches using the IN BOOLEAN MODE modifier:

Boolean full-text searches have these characteristics:

They do not use the 50% threshold.

They do not automatically sort rows in order of decreasing relevance. You can see this from the preceding query result: The row with the highest relevance is the one that contains “MySQL” twice, but it is listed last, not first.

They can work even without a FULLTEXT index, although a search executed in this fashion would be quite slow.

The minimum and maximum word length full-text parameters apply.

The stopword list applies.

The boolean full-text search capability supports the following operators:

+

A leading plus sign indicates that this word must be present in each row that is returned.

-

A leading minus sign indicates that this word must not be present in any of the rows that are returned.

Note: The – operator acts only to exclude rows that are otherwise matched by other search terms. Thus, a boolean-mode search that contains only terms preceded by – returns an empty result. It does not return “all rows except those containing any of the excluded terms.”

(no operator)

By default (when neither + nor – is specified) the word is optional, but the rows that contain it are rated higher. This mimics the behavior of MATCH() … AGAINST() without the IN BOOLEAN MODE modifier.

> <

These two operators are used to change a word’s contribution to the relevance value that is assigned to a row. The > operator increases the contribution and the < operator decreases it. See the example following this list.

( )

Parentheses group words into subexpressions. Parenthesized groups can be nested.

~

A leading tilde acts as a negation operator, causing the word’s contribution to the row’s relevance to be negative. This is useful for marking “noise” words. A row containing such a word is rated lower than others, but is not excluded altogether, as it would be with the – operator.

*

The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it should be appended to the word to be affected. Words match if they begin with the word preceding the * operator.

A phrase that is enclosed within double quote (‘”’) characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words, performs a search in the FULLTEXT index for the words. Prior to MySQL 5.0.3, the engine then performed a substring search for the phrase in the records that were found, so the match must include non-word characters in the phrase. As of MySQL 5.0.3, non-word characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, “test phrase” matches “test, phrase” in MySQL 5.0.3, but not before.

If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty.

The following examples demonstrate some search strings that use boolean full-text operators:

‘apple banana’

Find rows that contain at least one of the two words.

‘+apple +juice’

Find rows that contain both words.

‘+apple macintosh’

Find rows that contain the word “apple”, but rank rows higher if they also contain “macintosh”.

‘+apple -macintosh’

Find rows that contain the word “apple” but not “macintosh”.

‘+apple ~macintosh’

Find rows that contain the word “apple”, but if the row also contains the word “macintosh”, rate it lower than if row does not. This is “softer” than a search for ‘+apple -macintosh’, for which the presence of “macintosh” causes the row not to be returned at all.

‘+apple +(>turnover <strudel)’

Find rows that contain the words “apple” and “turnover”, or “apple” and “strudel” (in any order), but rank “apple turnover” higher than “apple strudel”.

‘apple*’

Find rows that contain words such as “apple”, “apples”, “applesauce”, or “applet”.

‘”some words”‘

Find rows that contain the exact phrase “some words” (for example, rows that contain “some words of wisdom” but not “some noise words”). Note that the ‘”’ characters that enclose the phrase are operator characters that delimit the phrase. They are not the quotes that enclose the search string itself.

Full-Text Searches with Query Expansion

Full-text search supports query expansion (and in particular, its variant “blind query expansion”). This is generally useful when a search phrase is too short, which often means that the user is relying on implied knowledge that the full-text search engine lacks. For example, a user searching for “database” may really mean that “MySQL”, “Oracle”, “DB2”, and “RDBMS” all are phrases that should match “databases” and should be returned, too. This is implied knowledge.

Blind query expansion (also known as automatic relevance feedback) is enabled by adding WITH QUERY EXPANSION following the search phrase. It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search finds the documents that contain the word “MySQL” even if they do not contain the word “database”. The following example shows this difference:

mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
 
mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body)
    -> AGAINST ('database' WITH QUERY EXPANSION);
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  3 | Optimizing MySQL  | In this tutorial we will show ...        |
+----+-------------------+------------------------------------------+
3 rows in set (0.00 sec)

Another example could be searching for books by Georges Simenon about Maigret, when a user is not sure how to spell “Maigret”. A search for “Megre and the reluctant witnesses” finds only “Maigret and the Reluctant Witnesses” without query expansion. A search with query expansion finds all books with the word “Maigret” on the second pass.

Note: Because blind query expansion tends to increase noise significantly by returning non-relevant documents, it is meaningful to use only when a search phrase is rather short.

Fine-Tuning MySQL Full-Text Search

MySQL’s full-text search capability has few user-tunable parameters. You can exert more control over full-text searching behavior if you have a MySQL source distribution because some changes require source code modifications.

Note that full-text search is carefully tuned for the most effectiveness. Modifying the default behavior in most cases can actually decrease effectiveness. Do not alter the MySQL sources unless you know what you are doing.

Most full-text variables described in this section must be set at server startup time. A server restart is required to change them; they cannot be modified while the server is running.

Some variable changes require that you rebuild the FULLTEXT indexes in your tables. Instructions for doing this are given at the end of this section.

· The minimum and maximum lengths of words to be indexed are defined by the ft_min_word_len and ft_max_word_len system variables. The default minimum value is four characters; the default maximum is version dependent. If you change either value, you must rebuild your FULLTEXT indexes. For example, if you want three-character words to be searchable, you can set the ft_min_word_len variable by putting the following lines in an option file:

·                [mysqld]
·                ft_min_word_len=3

Then you must restart the server and rebuild your FULLTEXT indexes. Note particularly the remarks regarding myisamchk in the instructions following this list.

· To override the default stopword list, set the ft_stopword_file system variable.The variable value should be the pathname of the file containing the stopword list, or the empty string to disable stopword filtering. After changing the value of this variable or the contents of the stopword file, restart the server and rebuild your FULLTEXT indexes.

The stopword list is free-form. That is, you may use any non-alphanumeric character such as newline, space, or comma to separate stopwords. Exceptions are the underscore character (‘_’) and a single apostrophe (‘'’) which are treated as part of a word. The character set of the stopword list is the server’s default character set.

· The 50% threshold for natural language searches is determined by the particular weighting scheme chosen. To disable it, look for the following line in myisam/ftdefs.h:

·                #define GWS_IN_USE GWS_PROB

Change that line to this:

#define GWS_IN_USE GWS_FREQ

Then recompile MySQL. There is no need to rebuild the indexes in this case. Note: By making this change, you severely decrease MySQL’s ability to provide adequate relevance values for the MATCH() function. If you really need to search for such common words, it would be better to search using IN BOOLEAN MODE instead, which does not observe the 50% threshold.

· To change the operators used for boolean full-text searches, set the ft_boolean_syntax system variable. This variable can be changed while the server is running, but you must have the SUPER privilege to do so. No rebuilding of indexes is necessary in this case.

· If you want to change the set of characters that are considered word characters, you can do so in two ways. Suppose that you want to treat the hyphen character (‘-’) as a word character. Use either of these methods:

o Modify the MySQL source: In myisam/ftdefs.h, see the true_word_char() and misc_word_char() macros. Add '-' to one of those macros and recompile MySQL.

o Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the <ctype><map> contents in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes.

After making the modification, you must rebuild the indexes for each table that contains any FULLTEXT indexes.

If you modify full-text variables that affect indexing (ft_min_word_len, ft_max_word_len, or ft_stopword_file), or if you change the stopword file itself, you must rebuild your FULLTEXT indexes after making the changes and restarting the server. To rebuild the indexes in this case, it is sufficient to do a QUICK repair operation:

mysql> REPAIR TABLE tbl_name QUICK;

Each table that contains any FULLTEXT index must be repaired as just shown. Otherwise, queries for the table may yield incorrect results, and modifications to the table will cause the server to see the table as corrupt and in need of repair.

Note that if you use myisamchk to perform an operation that modifies table indexes (such as repair or analyze), the FULLTEXT indexes are rebuilt using the default full-text parameter values for minimum word length, maximum word length, and stopword file unless you specify otherwise. This can result in queries failing.

The problem occurs because these parameters are known only by the server. They are not stored in MyISAM index files. To avoid the problem if you have modified the minimum or maximum word length or stopword file values used by the server, specify the same ft_min_word_len, ft_max_word_len, and ft_stopword_file values to myisamchk that you use for mysqld. For example, if you have set the minimum word length to 3, you can repair a table with myisamchk like this:

shell> myisamchk --recover --ft_min_word_len=3 tbl_name.MYI

To ensure that myisamchk and the server use the same values for full-text parameters, place each one in both the [mysqld] and [myisamchk] sections of an option file:

[mysqld]
ft_min_word_len=3
 
[myisamchk]
ft_min_word_len=3

An alternative to using myisamchk is to use the REPAIR TABLE, ANALYZE TABLE, OPTIMIZE TABLE, or ALTER TABLE statements. These statements are performed by the server, which knows the proper full-text parameter values to use.

Leave a Reply