MySQL 5.6 InnoDB Fulltext Round 2: Stopwords

A few months ago, in MySQL 5.6.4, I took a brief look at MySQL 5.6's new InnoDB Fulltext support; see Initial impressions of InnoDB Fulltext. There have been quite a few new point releases since then, and we're into RC releases of MySQL 5.6 now, so I thought I'd take another look.

One of my major concerns in my first look was the security implication of the way any user could see fulltext index details for the table set in the global innodb_ft_aux_table variable. Fortunately, this is "fixed", but not quite in the way I'd like. Now, a user needs the PROCESS privilege to be able to read from the INFORMATION_SCHEMA tables that may contain InnoDB data. So, a user needs SUPER to be able to set innodb_ft_aux_table and then PROCESS to be able to read from the various tables in INFORMATION_SCHEMA. The PROCESS privilege "pertains to display of information about the threads executing within the server", according to the manual, so I'm not really seeing the relationship between this privilege and InnoDB Fulltext index data, but at least the security hole I identified originally no longer exists.

Still, though, the requirement that a user have SUPER and now PROCESS privileges in order simply to see information about fulltext indexes on their own tables is very disappointing. A user with privileges to create tables with fulltext indexes cannot get any details about those indexes unless they are given not one but two global privileges completely unrelated to the analysis of InnoDB Fulltext index information. This is not a reasonable interface to this data nor a reasonable approach for making it available to users.

I got caught up last time in wrapping my head around this INFORMATION_SCHEMA functionality, so this time I wanted to take a look at some of the other functionality available with InnoDB Fulltext. I didn't make it very far, though...

One of the things that caught my attention this time around is all of the "auxiliary" tables that are required for InnoDB Fulltext. When you create a very simple InnoDB table with a single fulltext index, there are 8 additional auxiliary tables created, each with its own .ibd file on the filesystem (if you have innodb_file_per_table enabled, of course), and another 6 that don't have an accompanying .ibd.

mysql 5.6.9-rc (root) [test]> create table i1 (id int unsigned not null auto_increment primary key, c varchar(32), fulltext(c));
Query OK, 0 rows affected (0.06 sec)

mysql 5.6.9-rc (root) [test]> select table_id, name from information_schema.innodb_sys_tables where name like concat('%FTS_',lpad(conv((select table_id from information_schema.innodb_sys_tables where name='test/i1'),10,16),16,'0'),'%');
+----------+----------------------------------------------------+
| table_id | name                                               |
+----------+----------------------------------------------------+
|      202 | test/FTS_00000000000000c2_0000000000000114_DOC_ID  |
|      203 | test/FTS_00000000000000c2_0000000000000114_INDEX_1 |
|      204 | test/FTS_00000000000000c2_0000000000000114_INDEX_2 |
|      205 | test/FTS_00000000000000c2_0000000000000114_INDEX_3 |
|      206 | test/FTS_00000000000000c2_0000000000000114_INDEX_4 |
|      207 | test/FTS_00000000000000c2_0000000000000114_INDEX_5 |
|      208 | test/FTS_00000000000000c2_0000000000000114_INDEX_6 |
|      195 | test/FTS_00000000000000c2_ADDED                    |
|      198 | test/FTS_00000000000000c2_BEING_DELETED            |
|      199 | test/FTS_00000000000000c2_BEING_DELETED_CACHE      |
|      200 | test/FTS_00000000000000c2_CONFIG                   |
|      196 | test/FTS_00000000000000c2_DELETED                  |
|      197 | test/FTS_00000000000000c2_DELETED_CACHE            |
|      201 | test/FTS_00000000000000c2_STOPWORDS                |
+----------+----------------------------------------------------+
14 rows in set (0.00 sec)

kolbe@prosimmon 5.6 $ ls ./data/test/
FTS_00000000000000c2_0000000000000114_DOC_ID.ibd FTS_00000000000000c2_CONFIG.ibd                  i1.frm
FTS_00000000000000c2_ADDED.ibd                   FTS_00000000000000c2_DELETED.ibd                 i1.ibd
FTS_00000000000000c2_BEING_DELETED.ibd           FTS_00000000000000c2_DELETED_CACHE.ibd           i1_stopwords.frm
FTS_00000000000000c2_BEING_DELETED_CACHE.ibd     FTS_00000000000000c2_STOPWORDS.ibd               i1_stopwords.ibd

There are some concerns of course with all of these extra things being created for each InnoDB table where you use fulltext. Additional file descriptors, additional InnoDB data dictionary overhead, et cetera. But even more strange is the fact that not all of these even appear to be used by InnoDB. For some reason I was particularly drawn to the "STOPWORDS" table (FTS_00000000000000c2_STOPWORDS.ibd).

There are two sources of stopwords for MySQL 5.6 InnoDB Fulltext: the default stopwords, viewable in INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD (this is from a static array compiled into the server, apparently based on some minimally multilingual amalgam of some elements of "Google Stopwords"), and "user-defined stopwords", controlled with yet more system variables: innodb_ft_server_stopword_table and innodb_ft_user_stopword_table.

I mostly want to look at innodb_ft_user_stopword_table, because trying to control per-table behavior with MySQL system variables is pretty cumbersome and error-prone and I wanted to see how well this works for InnoDB Fulltext.

If you are creating a new table and you want to use some custom set of stopwords, you would do so like this:

mysql 5.6.9-rc (root) [test]> create table i1_stopwords (value varchar(255)) engine=innodb;
Query OK, 0 rows affected (0.02 sec)

mysql 5.6.9-rc (root) [test]> insert into i1_stopwords values ('apple'),('banana'),('coconut'),('durian');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql 5.6.9-rc (root) [test]> set innodb_ft_user_stopword_table='test/i1_stopwords';
Query OK, 0 rows affected (0.00 sec)

mysql 5.6.9-rc (root) [test]> create table i1 (id int unsigned not null auto_increment primary key, c varchar(32), fulltext(c));
Query OK, 0 rows affected (0.06 sec)

mysql 5.6.9-rc (root) [test]> insert into i1 (c) values ('a crabby apple'),('the banana bonanza'),('some crazy coconuts'),('one daring durian');
Query OK, 4 rows affected (0.01 sec)
Records: 4  Duplicates: 0  Warnings: 0

We can use our old friends innodb_ft_aux_table and INFORMATION_SCHEMA.INNODB_FT_TABLE_INDEX/CACHE to see what made it into our index.

mysql 5.6.9-rc (root) [test]> set global innodb_ft_aux_table='test/i1';
Query OK, 0 rows affected (0.00 sec)

mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_index_cache union all select * from information_schema.innodb_ft_index_table;
+----------+--------------+-------------+-----------+--------+----------+
| WORD     | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| bonanza  |            2 |           2 |         1 |      2 |       11 |
| coconuts |            3 |           3 |         1 |      3 |       11 |
| crabby   |            1 |           1 |         1 |      1 |        2 |
| crazy    |            3 |           3 |         1 |      3 |        5 |
| daring   |            4 |           4 |         1 |      4 |        4 |
| one      |            4 |           4 |         1 |      4 |        0 |
| some     |            3 |           3 |         1 |      3 |        0 |
| the      |            2 |           2 |         1 |      2 |        0 |
+----------+--------------+-------------+-----------+--------+----------+
8 rows in set (0.01 sec)

We can also see the stopword table in INFORMATION_SCHEMA.INNODB_FT_CONFIG:

mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_config;
+---------------------------+-------------------+
| KEY                       | VALUE             |
+---------------------------+-------------------+
| optimize_checkpoint_limit | 180               |
| synced_doc_id             | 0                 |
| last_optimized_word       |                   |
| deleted_doc_count         | 0                 |
| total_word_count          |                   |
| optimize_start_time       |                   |
| optimize_end_time         |                   |
| stopword_table_name       | test/i1_stopwords |
| use_stopword              | 1                 |
| table_state               | 0                 |
+---------------------------+-------------------+
10 rows in set (0.00 sec)

Alright, it looks like our user-defined stopword table worked just fine. So, what's the problem?

What if you need to dump/reload this table? You have to manually verify whether or not the table uses a user-defined stopword table, which requires consulting INFORMATION_SCHEMA.INNODB_FT_CONFIG, which requires setting innodb_ft_aux_table, which requires the SUPER privilege. You have to make sure you also dump/reload the stopword table, if you're moving things to another machine, and you are responsible for manually setting innodb_ft_user_stopword_table as appropriate when you import. But it's even worse than that. OPTIMIZE TABLE discards the stopword list and re-builds the entire fulltext index according to the current value of innodb_ft_user_stopword_table.

mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_config;
+---------------------------+-------------------+
| KEY                       | VALUE             |
+---------------------------+-------------------+
| optimize_checkpoint_limit | 180               |
| synced_doc_id             | 0                 |
| last_optimized_word       |                   |
| deleted_doc_count         | 0                 |
| total_word_count          |                   |
| optimize_start_time       |                   |
| optimize_end_time         |                   |
| stopword_table_name       | test/i1_stopwords |
| use_stopword              | 1                 |
| table_state               | 0                 |
+---------------------------+-------------------+
10 rows in set (0.00 sec)

mysql 5.6.9-rc (root) [test]> optimize table i1;
+---------+----------+----------+-------------------------------------------------------------------+
| Table   | Op       | Msg_type | Msg_text                                                          |
+---------+----------+----------+-------------------------------------------------------------------+
| test.i1 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| test.i1 | optimize | status   | OK                                                                |
+---------+----------+----------+-------------------------------------------------------------------+
2 rows in set (0.07 sec)

mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_config;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 0     |
| last_optimized_word       |       |
| deleted_doc_count         | 0     |
| total_word_count          |       |
| optimize_start_time       |       |
| optimize_end_time         |       |
| stopword_table_name       |       |
| use_stopword              | 1     |
| table_state               | 0     |
+---------------------------+-------+
10 rows in set (0.00 sec)


mysql 5.6.9-rc (root) [test]> select * from information_schema.innodb_ft_index_cache union all select * from information_schema.innodb_ft_index_table;
+----------+--------------+-------------+-----------+--------+----------+
| WORD     | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| apple    |            1 |           1 |         1 |      1 |        9 |
| banana   |            2 |           2 |         1 |      2 |        4 |
| bonanza  |            2 |           2 |         1 |      2 |       11 |
| coconuts |            3 |           3 |         1 |      3 |       11 |
| crabby   |            1 |           1 |         1 |      1 |        2 |
| crazy    |            3 |           3 |         1 |      3 |        5 |
| daring   |            4 |           4 |         1 |      4 |        4 |
| durian   |            4 |           4 |         1 |      4 |       11 |
| one      |            4 |           4 |         1 |      4 |        0 |
| some     |            3 |           3 |         1 |      3 |        0 |
+----------+--------------+-------------+-----------+--------+----------+
10 rows in set (0.00 sec)

A regular user (one without SUPER or PROCESS) cannot consult INFORMATION_SCHEMA.INNODB_FT_CONFIG to see the current stopword_table_name for the table and thus cannot safely optimize or dump/reload their own tables without external accounting (i.e. write down which stopword table(s) your tables use) if the table was created with a user-defined stopword table.

But remember that "FTS_00000000000000c2_STOPWORDS" auxiliary table we saw earlier? Well, that doesn't appear to be used at all.

kolbe@prosimmon 5.6 $ strings -a ./data/test/i1_stopwords.ibd 
infimum
supremum
apple
banana
.coconut
=durian

kolbe@prosimmon 5.6 $ strings -a ./data/test/FTS_00000000000000c2_STOPWORDS.ibd 
P'=+
infimum
supremum

Maybe there's some planned future support for more visible and manageable stopwords, but for now the facility for these is quite cumbersome and problematic.

Beyond these basic usability problems, there's a bug that's causing some memory related to the innodb_ft_user_stopword_table variable to be corrupted. So far it's made my server lock up and crash quite a few times while doing my simple testing. Here's some of what I saw:

mysql 5.6.9-rc (root) [test]> set innodb_ft_user_stopword_table='test/no_stopwords';
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql 5.6.9-rc (root) [test]> select @@innodb_ft_user_stopword_table;                                                                                         
ERROR 2006 (HY000): MySQL server has gone away
mysql 5.6.9-rc (root) [test]> show variables like 'innodb_ft_user_stopword_table';
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| innodb_ft_user_stopword_table |       |
+-------------------------------+-------+
1 row in set, 1 warning (0.00 sec)

mysql 5.6.9-rc (root) [test]> show warnings;
+---------+------+--------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                              |
+---------+------+--------------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x98\xAC\x99\x1A\xE3' for column 'VARIABLE_VALUE' at row 1 |
+---------+------+--------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql 5.6.9-rc (root) [test]> select @@global.innodb_ft_user_stopword_table;
+----------------------------------------+
| @@global.innodb_ft_user_stopword_table |
+----------------------------------------+
| O?o?                                   |
+----------------------------------------+
1 row in set (0.00 sec)

mysql 5.6.9-rc (root) [test]> show variables like 'innodb_ft_user_stopword_table';
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| innodb_ft_user_stopword_table | O    |
+-------------------------------+-------+
1 row in set, 1 warning (0.00 sec)

mysql 5.6.9-rc (root) [test]> show warnings;
+---------+------+------------------------------------------------------------------------------+
| Level   | Code | Message                                                                      |
+---------+------+------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xB9o\xFE\x07' for column 'VARIABLE_VALUE' at row 1 |
+---------+------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)

I've filed bug #67960 and the folks at Oracle have already been very good about analyzing it. I trust it will be fixed before the next release of MySQL 5.6.

About the Author

Kolbe Kegel is a Principal Support Engineer at SkySQL, where he has worked since 2011. Kolbe has worked with MySQL since 2005, first at MySQL, later at Sun Microsystems after its acquisition of MySQL Inc., then at Oracle after its acquisition of Sun.
James Day (not verified)

Thanks Kolbe, please keep up

Thanks Kolbe, please keep up the good feedback. Rough edge elimination is a good thing and we're very interested in as many reports of rough edges that need to be fixed as there are rough edges that need fixing.

Views are my own, for an official Oracle view, please consult a PR person.

James Day, MySQL Senior Principal Support Engineer, Oracle.

Justin Swanhart (not verified)

Awful

That is really awful. Requiring SUPER privs is terrible, because a user with SUPER privs can ignore the read_only flag, change GLOBAL settings and use up all connections because by using the reserved connection for a user with SUPER privs when max_connections is reached.

The PROCESS privilege allows users to view the SQL statements of all other users. It also allows creation of the InnoDB monitor tables which can fill up the disk with logs and make the error log useless.