kolbe's blog

Surprises in store with ndb_restore

While doing some routine fiddling regarding some topic I've now forgotten, I discovered that ndb_restore was doing something quite surprising. It's been common wisdom for some time that one can use ndb_restore -m to restore metadata into a new cluster and automatically have your data re-partitioned across the data nodes in the destination cluster. In fact, this was the recommended procedure for adding nodes to a cluster before online add node came along. Since MySQL Cluster 7.0, though, ndb_restore hasn't behaved that way, though that change in behavior doesn't seem to be documented and most don't know that the change ever took place.

I'll go through some of the methods you can use to find information about the partitioning strategy for an NDB table, talk a bit about why ndb_restore stopped working the way most everyone expected (and still expect) it to, and discuss some possible alternatives and workarounds.

Let's start out with an example of how ndb_restore worked in the pre-7.0 days. I'm going to create a 2-node cluster, create a table, put some rows in it, look at the partitioning strategy for that table, then take a backup and shut down my cluster.

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ cat ~/cluster_2.ini 
[ndb_mgmd]
Hostname=127.0.0.1
Datadir=/home/ndb/cluster-data
NodeId=1

[ndbd default]
#MaxNoOfExecutionThreads=4
Datadir=/home/ndb/cluster-data
NoOfReplicas=2
Hostname=127.0.0.1

[ndbd]
NodeId=3
[ndbd]
NodeId=4

[mysqld]
NodeId=11

[mysqld]
NodeId=12

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgmd -f ~/cluster_2.ini  
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndbd --initial;./bin/ndbd --initial;
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.56 ndb-6.3.45)
Node 4: started (mysql-5.1.56 ndb-6.3.45)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysqld_safe &
[1] 2489
120215 20:10:49 mysqld_safe Logging to '/home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data/ip-10-0-0-59.err'.
120215 20:10:49 mysqld_safe Starting mysqld daemon with databases from /home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.56-ndb-6.3.45-cluster-gpl MySQL Cluster Server (GPL)

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]> create table c1 (id int) engine=ndb;
Query OK, 0 rows affected (0.12 sec)

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]> INSERT INTO c1 (id) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),
(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),
(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),
(91),(92),(93),(94),(95),(96),(97),(98),(99),(100);
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]> Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 206
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'start backup'
Connected to Management Server at: localhost:1186
Waiting for completed, this may take several minutes
Node 3: Backup 1 started from node 1
Node 3: Backup 1 started from node 1 completed
 StartGCP: 88 StopGCP: 91
 #Records: 2156 #LogRecords: 0
 Data: 53208 bytes Log: 0 bytes
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ 
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysqladmin shutdown
120215 20:13:45 mysqld_safe mysqld from pid file /home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data/ip-10-0-0-59.pid ended
[1]+  Done                    ./bin/mysqld_safe
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e shutdown
Connected to Management Server at: localhost:1186
2 NDB Cluster node(s) have shutdown.
Disconnecting to allow management server to shutdown.

So, there we've created a 2-node cluster, created a table and put a few rows in it, created an NDB native backup, and then shut the cluster down. Now, we'll create a 4-node cluster, restore the backup, and see what our table looks like.

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ rm ./data/test/*
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ cat ~/cluster_4.ini 
[ndb_mgmd]
Hostname=127.0.0.1
Datadir=/home/ndb/cluster-data
NodeId=1

[ndbd default]
#MaxNoOfExecutionThreads=4
Datadir=/home/ndb/cluster-data
NoOfReplicas=2
Hostname=127.0.0.1

[ndbd]
NodeId=3
[ndbd]
NodeId=4

[ndbd]
NodeId=5
[ndbd]
NodeId=6

[mysqld]
NodeId=11

[mysqld]
NodeId=12
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgmd -f ~/cluster_4.ini  
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.56 ndb-6.3.45)
Node 4: started (mysql-5.1.56 ndb-6.3.45)
Node 5: started (mysql-5.1.56 ndb-6.3.45)
Node 6: started (mysql-5.1.56 ndb-6.3.45)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_restore -b 1 -r -n 3 -m ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 3
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.ctl'
Backup version in files: ndb-6.3.11 ndb version: mysql-5.1.56 ndb-6.3.45
Connected to ndb!!
Successfully restored table `mysql/def/ndb_apply_status`
Successfully restored table event REPL$mysql/ndb_apply_status
Successfully restored table `test/def/c1`
Successfully restored table event REPL$test/c1
Successfully restored table `mysql/def/ndb_schema`
Successfully restored table event REPL$mysql/ndb_schema
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.3.Data'
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(1) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(4) fragment 0
_____________________________________________________
Processing data in table: test/def/c1(5) fragment 0
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_2_3(3) fragment 0
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(0) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(2) fragment 0
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.log'
Restored 56 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_restore -b 1 -r -n 4 ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 4
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.ctl'
Backup version in files: ndb-6.3.11 ndb version: mysql-5.1.56 ndb-6.3.45
Connected to ndb!!
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.4.Data'
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(1) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(4) fragment 1
_____________________________________________________
Processing data in table: test/def/c1(5) fragment 1
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_2_3(3) fragment 1
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(0) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(2) fragment 1
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.log'
Restored 44 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 206
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               26              26              32768                   0                       0               0                       3,4
1               24              24              32768                   0                       0               0                       5,6
3               20              20              32768                   0                       0               0                       6,5
2               30              30              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

Alright! We created a new cluster with 4 data nodes, restored the backup into the cluster, and confirmed with ndb_desc that the data was automatically re-partitioned to give the table a number of partitions equal to the number of data nodes in the cluster. Why is that important? This way, each data node can be primary for one partition.

You can see in the Nodes column on the very right-hand side of the Per partition info section which nodes hold each partition. The left-most node listed in that column for a given partition is the primary for that partition; any other nodes listed hold secondary replicas for that partition.

When the cluster is handling a request, data is only retrieved from the primary replica. If we had 4 data nodes but only 2 partitions, that would mean that half of our nodes were not primary for any partition, which means that they would never be responsible for sending any data to API/MySQL nodes. Clearly, that is not the best solution in terms of spreading load across the data nodes.

Unfortunately, that is exactly the behavior you get with this same operation starting with MySQL Cluster 7.0.

Here's a demo identical to the one above, but using MySQL Cluster 7.2.4:

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgmd -f ~/cluster_2.ini --config-dir=/home/ndb/cluster-config/ --initial
MySQL Cluster Management Server mysql-5.5.19 ndb-7.2.4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndbd --initial;./bin/ndbd --initial
2012-02-15 20:29:17 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:29:17 [ndbd] INFO     -- Angel allocated nodeid: 3
2012-02-15 20:29:17 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:29:17 [ndbd] INFO     -- Angel allocated nodeid: 4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.5.19 ndb-7.2.4)
Node 4: started (mysql-5.5.19 ndb-7.2.4)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysqld_safe &
[1] 3079
120215 20:29:35 mysqld_safe Logging to '/home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data/ip-10-0-0-59.err'.
120215 20:29:35 mysqld_safe Starting mysqld daemon with databases from /home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.5.19-ndb-7.2.4-gpl MySQL Cluster Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]> create table c1 (id int) engine=ndb;
Query OK, 0 rows affected (0.17 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]> INSERT INTO c1 (id) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),
(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),
(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),
(91),(92),(93),(94),(95),(96),(97),(98),(99),(100);
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]> Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 2
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'start backup'
Connected to Management Server at: localhost:1186
Waiting for completed, this may take several minutes
Node 3: Backup 1 started from node 1
Node 3: Backup 1 started from node 1 completed
 StartGCP: 25 StopGCP: 28
 #Records: 2157 #LogRecords: 0
 Data: 53592 bytes Log: 0 bytes
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysqladmin shutdown
120215 20:30:15 mysqld_safe mysqld from pid file /home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data/ip-10-0-0-59.pid ended
[1]+  Done                    ./bin/mysqld_safe
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e shutdown
Connected to Management Server at: localhost:1186
3 NDB Cluster node(s) have shutdown.
Disconnecting to allow management server to shutdown.

OK, everything there looks about the same as before. We created the same table, inserted the same rows, and we have the same number of partitions that we did after the first half of the exercise on MySQL Cluster 6.3.45. Now, let's try the restore.


[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ rm ./data/test/*
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgmd -f ~/cluster_4.ini --config-dir=/home/ndb/cluster-config/ --initial
MySQL Cluster Management Server mysql-5.5.19 ndb-7.2.4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;
2012-02-15 20:32:43 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:43 [ndbd] INFO     -- Angel allocated nodeid: 3
2012-02-15 20:32:43 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:43 [ndbd] INFO     -- Angel allocated nodeid: 4
2012-02-15 20:32:44 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:44 [ndbd] INFO     -- Angel allocated nodeid: 5
2012-02-15 20:32:44 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:44 [ndbd] INFO     -- Angel allocated nodeid: 6
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.5.19 ndb-7.2.4)
Node 4: started (mysql-5.5.19 ndb-7.2.4)
Node 5: started (mysql-5.5.19 ndb-7.2.4)
Node 6: started (mysql-5.5.19 ndb-7.2.4)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_restore -b 1 -r -n 3 -m ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 3
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.ctl'
File size 14088 bytes
Backup version in files: ndb-6.3.11 ndb version: mysql-5.5.19 ndb-7.2.4
Stop GCP of Backup: 27
Connected to ndb!!
Created hashmap: DEFAULT-HASHMAP-240-2
Successfully restored table `mysql/def/ndb_apply_status`
Successfully restored table event REPL$mysql/ndb_apply_status
Successfully restored table `test/def/c1`
Successfully restored table event REPL$test/c1
Successfully restored table `mysql/def/ndb_schema`
Successfully restored table event REPL$mysql/ndb_schema
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.3.Data'
File size 27448 bytes
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_7_3(8) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_sample(5) fragment 0
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(3) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(9) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_head(4) fragment 0
_____________________________________________________
Processing data in table: test/def/c1(10) fragment 0
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(2) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(7) fragment 0
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.log'
File size 52 bytes
Restored 56 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_restore -b 1 -r -n 4 ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 4
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.ctl'
File size 14088 bytes
Backup version in files: ndb-6.3.11 ndb version: mysql-5.5.19 ndb-7.2.4
Stop GCP of Backup: 27
Connected to ndb!!
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.4.Data'
File size 26688 bytes
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_7_3(8) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_sample(5) fragment 1
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(3) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(9) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_head(4) fragment 1
_____________________________________________________
Processing data in table: test/def/c1(10) fragment 1
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(2) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(7) fragment 1
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.log'
File size 52 bytes
Restored 44 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 2
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       5,6


NDBT_ProgramExit: 0 - OK

Uh oh, this didn't turn out quite the same as the example from MySQL Cluster 6.3.45. There are still only 2 partitions after the restore, even though there are 4 data nodes. Take a look at the Nodes column on the right of "Per partition info" and you can see, in fact, that the 2 partitions are actually on separate node groups. That's sort of interesting. It means that writes are still going to be scaled across all node groups, which is great, but it means that reads will not be scaled. All reads will have to come from nodes 3 and 5, because those nodes are the primaries for their respective partitions.

So, why did this change happen? It's not something that anyone decided to do consciously, I think; instead, I think it's the side effect of the implementation of the new HashMap partitioning algorithm that was introduced and made default in MySQL Cluster 7.0. Frazer Clement provides an exceptional discussion of the HashMap algorithm at http://messagepassing.blogspot.com/2011/03/mysql-cluster-online-scaling.....

It appears that the HashMap is stored as part of the schema data for the table; when the table metadata is restored with ndb_restore -m, the same HashMap is used. MySQL Cluster distributes the partitions across all the node groups in the destination cluster, but it does not change the number of partitions. (As a result, if you had a 6-node cluster, one node group would not hold any partitions for this table; that would mean 3 node groups, but there are only 2 partitions.)

Now we see how ndb_restore works starting in MySQL Cluster 7.0 and we can see that the results are not very desirable. What, then, can be done to get your table distributed across all nodes and node groups so that each data node in the cluster is primary for one partition? There are a couple options.

Part of the reason HashMap was put into place was to make it easier to redistribute data in the cluster in order to support online add node functionality. When using online add node, you execute an ALTER TABLE ... REORGANIZE PARTITION statement after creating the new node group(s) and starting the new data nodes. We can do the same, here, to reorganize the partitions of our table across all nodes in the cluster:

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]> select partition_name, table_rows from information_schema.partitions where table_schema='test' and table_name='c1';
+----------------+------------+
| partition_name | table_rows |
+----------------+------------+
| p0             |         56 |
| p1             |         44 |
+----------------+------------+
2 rows in set (0.00 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]> alter table c1 reorganize partition;
Query OK, 0 rows affected (7.46 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]> select partition_name, table_rows from information_schema.partitions where table_schema='test' and table_name='c1';
+----------------+------------+
| partition_name | table_rows |
+----------------+------------+
| p0             |         26 |
| p1             |         24 |
| p2             |         30 |
| p3             |         20 |
+----------------+------------+
4 rows in set (0.02 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]> Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 16777217
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 4
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               26              116             32768                   0                       0               0                       3,4
2               30              30              32768                   0                       0               0                       4,3
1               24              84              32768                   0                       0               0                       5,6
3               20              20              32768                   0                       0               0                       6,5


NDBT_ProgramExit: 0 - OK

That's a pretty easy way to re-partition a table across your data nodes. However, keep in mind that you'd need to do this for every table in the cluster. It's fairly easy to do that programatically by checking the number of partitions for a given table in information_schema.partitions and executing ALTER TABLE ... REORGANIZE PARTITON for any of them that have fewer partitions than the number of rows in ndbinfo.nodes. Still, though, I don't find that to be terribly appealing. There are also a couple big caveats for ALTER TABLE ... REORGANIZE PARTITION – it doesn't re-partition UNIQUE indexes or BLOBs. The first of those may not be such a big problem, because UNIQUE indexes (implemented in MySQL Cluster as a separate, hidden table) are not likely to be large in size to the point that scaling reads or spreading the data across additional node groups would be so important. BLOBs, on the other hand, (also implemented in MySQL Cluster as a separate, hidden table) can take up a lot of space, so having them relegated to only some nodes in the cluster might mean that those nodes would use considerably more DataMemory than other nodes.

Another solution, if ALTER TABLE ... REORGANIZE PARTITION doesn't strike your fancy, is to use mysqldump --no-data to backup and restore your schema instead of relying on ndb_restore -m. You'd still use ndb_restore to restore data, but you'd get the schema from mysqldump. When you execute the CREATE TABLE statements output by mysqldump, MySQL Cluster sees them as brand new tables and thus partitions them across all data nodes in the Cluster, as as would be the case for any new table created on the cluster.

Using mysqldump has the advantage of backing up triggers and stored routines, which you won't get if you use ndb_restore -m. If you are using those features, this is very important, of course; if you're not using them, there isn't a lot of practical value gained by using mysqldump. In fact, it means that you add an extra step for backup, and you add an extra step for restore. On top of that, you get no guarantee of consistency. Some DDL could be executed between the time that you run mysqldump and the time you start your NDB native backup. That means that there is no guarantee that the table structure in one part of your backup matches the structure of the data in the other part. That's a little bit scary, and it can only be worked around safely by essentially taking the cluster offline (single user mode) when executing a backup.

My hope is that the original (and still widely expected) behavior of ndb_restore will be ... restored. I've opened bug #64302 to track the issue. Let me know your thoughts here, and let the MySQL Cluster developers know your thoughts on the bug report.

MEMORY engine and Handler% status variables via gdb

While working through my previous blog post, rows_examined UNION Handler% ... ?!?, I learned about some unexpected behavior of the MEMORY (HEAP) storage engine and how it interacts with Handler_read_rnd_next. In particular, even a "deleted" row (one that never really made it into the table to begin with) still must be read over when scanning the table, thereby incrementing Handler_read_rnd_next when one might not expect that to happen. I thought this would be an interesting opportunity to use gdb to really watch what the server was doing when this situation is encountered.

I'm far from being a gdb expert. In fact, I've probably only used it a dozen times or so. Nevertheless, I found it pretty straightforward to get things set up and to understand how to make my way around. (I've probably missed out on some shortcuts and some useful functionality here; hopefully I'm not leading people too far astray!) Fortunately, mysqld binaries are "not stripped" these days. That means that debugging symbols are included in the binary, which gives gdb the ability to associate the activities of a running program with the lines in the source code used to build it. Verify that this is the case using file, which should say "not stripped" at the end of its output:

$ file ./bin/mysqld
./bin/mysqld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.8, not stripped

Because we'll need (and want!) to look at the server source code while we do this, we need to download a copy of the source. It is crucial that the source code you have exactly matches the source code used to compile the version of the server you're using. I'm using MariaDB for this testing, so I need to get the version of MariaDB source used to build the version of MariaDB I'm using. I happen to be using MariaDB 5.2.4, so I grabbed mariadb-5.2.4.tar.gz from http://downloads.askmonty.org/MariaDB/5.2.4/.

It's also possible to use bzr to grab a copy of the source, if you're inclined to set that all up, or even to browse the source online on Launchpad. I know that I need revision #2894, because that's the highest revision number in the ChangeLog for MariaDB 5.2.4. You can also see the "Tag" of mariadb-5.2.4 on this revision in the Launchpad revision history.

Once you have obtained the source code that corresponds to the version of MySQL, MariaDB, or Percona server (SkySQL supports all of them!) that you're running, fire up the server, find its PID, and attach gdb. Make sure nobody else is using the server while you're doing this; I'm not taking any steps to narrow the scope of my gdb session to a particular thread. (On Ubuntu starting with 10.10, you either need to run gdb as root or you need to alter the setting of /proc/sys/kernel/yama/ptrace_scope.)

$ gdb -p $( cat ./data/ub-dev.pid )
GNU gdb (Ubuntu/Linaro 7.3-0ubuntu2) 7.3-2011.08
Copyright (C) 2011 Free Software Foundation, Inc.
[...snip...]
(gdb)

When you attach to a running process with gdb, it stops the process. You can then set breakpoints, print data structures and variables, et cetera. We have a mission, though, and that is to try to look at what happens when the MEMORY engine causes Handler_read_rnd_next to be incremented. In order to find any of this, it's important to remember that the MEMORY was historically called HEAP, and it retains that name in the server source code. Let's see what we can find out about Handler_read_rnd_next:

$ pwd
/home/kolbe/dev/src/mariadb-5.2.4
$ grep -r Handler_read_rnd_next * | grep -v mysql-test
sql/mysqld.cc:  {"Handler_read_rnd_next",    (char*) offsetof(STATUS_VAR, ha_read_rnd_next_count), SHOW_LONG_STATUS},

I eliminated results that match "mysql-test" because there are lots of tests that look at the value of Handler_read_rnd_next and those are of no interest to us. We see that Handler_read_rnd_next only shows up once in the entire source tree, and that it appears to act as a more human-friendly alias for a lower-level status counter called ha_read_rnd_next_count.

The various storage engines are implemented by way of code that lies in their respective subdirectory of the storage subdirectory of the source tree. We are interested only in the MEMORY engine here, which is called "heap" internally (and historically). Let's look for ha_read_rnd_next_count, but confine our search to the storage/heap directory. I'll use some options to grep to get some context around any matches.

$ grep -n -A 10 -B 10 ha_read_rnd_next_count storage/heap/*
storage/heap/ha_heap.cc-338-  return error;
storage/heap/ha_heap.cc-339-}
storage/heap/ha_heap.cc-340-
storage/heap/ha_heap.cc-341-int ha_heap::rnd_init(bool scan)
storage/heap/ha_heap.cc-342-{
storage/heap/ha_heap.cc-343-  return scan ? heap_scan_init(file) : 0;
storage/heap/ha_heap.cc-344-}
storage/heap/ha_heap.cc-345-
storage/heap/ha_heap.cc-346-int ha_heap::rnd_next(uchar *buf)
storage/heap/ha_heap.cc-347-{
storage/heap/ha_heap.cc:348:  ha_statistic_increment(&SSV::ha_read_rnd_next_count);
storage/heap/ha_heap.cc-349-  int error=heap_scan(file, buf);
storage/heap/ha_heap.cc-350-  table->status=error ? STATUS_NOT_FOUND: 0;
storage/heap/ha_heap.cc-351-  return error;
storage/heap/ha_heap.cc-352-}
storage/heap/ha_heap.cc-353-
storage/heap/ha_heap.cc-354-int ha_heap::rnd_pos(uchar * buf, uchar *pos)
storage/heap/ha_heap.cc-355-{
storage/heap/ha_heap.cc-356-  int error;
storage/heap/ha_heap.cc-357-  HEAP_PTR heap_position;
storage/heap/ha_heap.cc-358-  ha_statistic_increment(&SSV::ha_read_rnd_count);

We can see that the ha_heap::rnd_next function is the only place the HEAP engine modifies the ha_read_rnd_next_count. This function is pretty straightforward. Let's set a breakpoint right after heap_scan is called, so that we can see what's inside the file variable after the new row has been read.

(gdb) break ha_heap.cc:350
Breakpoint 1 at 0x7d0458: file ha_heap.cc, line 350.

We'll also cd inside of gdb to the location of the matching file so that we can look at the source code while we're in the debugger.

(gdb) cd /home/kolbe/dev/src/mariadb-5.2.4/storage/heap
Working directory /home/kolbe/dev/src/mariadb-5.2.4/storage/heap.

Alright, back on the MySQL side, we need some sample data to work with. You'll want to be using GNU screen or a separate terminal window or something in order to be able to easily look at both the MySQL client and your gdb session. Before we'll be able to execute any statements in MySQL, we have to tell gdb to let the server continue running.

(gdb) continue
Continuing.

Now, some sample data and a SELECT to kick things off.

mariadb 5.2.4> create table h1 (id int) engine=memory;
Query OK, 0 rows affected (0.03 sec)

mariadb 5.2.4> insert into h1 values (1),(2),(3);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mariadb 5.2.4> select * from h1;

You'll notice that the SELECT statement doesn't return. That's because the server hit the breakpoint we set in gdb!

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;

Now, we can take a look around at some data structures to see what is going on. The ha_heap::rnd_next function had called heap_scan(file, buf), so let's see what file and buf are.

(gdb) print buf
$1 = <optimized out>
(gdb) print file
$2 = (HP_INFO *) 0x435c4a0

Well, buf isn't something we'll be able to use, because the compiler decided it wasn't actually going to bring that into scope here. file, on the other hand, is a pointer to a HP_INFO struct (see include/heap.h). We can dereference and print it right at the gdb command-line:

(gdb) set print pretty on
(gdb) print *file
$4 = {
  s = 0x432a490, 
  current_ptr = 0x438bd30 "\375\001", 
  current_hash_ptr = 0x0, 
  current_record = 0, 
  next_block = 3, 
  lastinx = -1, 
  errkey = -1, 
  mode = 2, 
  opt_flag = 0, 
  update = 50, 
  lastkey = 0x435c768 "\321\b", 
  recbuf = 0x435c768 "\321\b", 
  last_find_flag = HA_READ_KEY_EXACT, 
  parents = {0x0 <repeats 65 times>}, 
  last_pos = 0x0, 
  lastkey_len = 0, 
  implicit_emptied = 0 '\000', 
  lock = {
    owner = 0x43026d0, 
    next = 0x0, 
    prev = 0x432a5e0, 
    lock = 0x432a590, 
    cond = 0x0, 
    status_param = 0x0, 
    debug_print_param = 0x436ec00, 
    type = TL_READ, 
    priority = 0
  }, 
  open_list = {
    prev = 0x0, 
    next = 0x4373550, 
    data = 0x435c4a0
  }
}

A couple things look useful right away, current_record and current_ptr especially.

s also looks interesting (and it is!), and we can print it to see what it is, just like we did with file:

(gdb) print file->s
$5 = (HP_SHARE *) 0x432a490
(gdb) print *file->s
[...snip...]

It turns out file->s is a pretty large data structure, so I'm not going to include it here. You can browse the definition of HP_SHARE in include/heap.h. We won't be looking into it during this exercise.

We saw earlier that the return value of heap_scan was assigned to error:

(gdb) print error
$9 = 0

You don't have to take my word for it, though; gdb will gladly print the source code that surrounds the current execution context:

(gdb) list
345
346     int ha_heap::rnd_next(uchar *buf)
347     {
348       ha_statistic_increment(&SSV::ha_read_rnd_next_count);
349       int error=heap_scan(file, buf);
350       table->status=error ? STATUS_NOT_FOUND: 0;
351       return error;
352     }
353
354     int ha_heap::rnd_pos(uchar * buf, uchar *pos)

We'll keep an eye on error going forward. For now, though, let's continue execution of the program.

(gdb) c
Continuing.

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;

It stops again right away. There are several rows in the table, and this function is called for every additional row the server reads. Let's make sure error is still 0 and let's see if anything has changed in file since we last looked at it:

(gdb) print error
$12 = 0
(gdb) print *file
$11 = {
  s = 0x432a490, 
  current_ptr = 0x438bd40 "\375\002", 
  current_hash_ptr = 0x0, 
  current_record = 1, 
  next_block = 3, 
  lastinx = -1, 
  errkey = -1, 
  mode = 2, 
  opt_flag = 0, 
  update = 50, 
  lastkey = 0x435c768 "\321\b", 
  recbuf = 0x435c768 "\321\b", 
  last_find_flag = HA_READ_KEY_EXACT, 
  parents = {0x0 <repeats 65 times>}, 
  last_pos = 0x0, 
  lastkey_len = 0, 
  implicit_emptied = 0 '\000', 
  lock = {
    owner = 0x43026d0, 
    next = 0x0, 
    prev = 0x432a5e0, 
    lock = 0x432a590, 
    cond = 0x0, 
    status_param = 0x0, 
    debug_print_param = 0x436ec00, 
    type = TL_READ, 
    priority = 0
  }, 
  open_list = {
    prev = 0x0, 
    next = 0x4373550, 
    data = 0x435c4a0
  }
}

Two things jump out, current_ptr is now "\375\002" (it was "\375\001" before), and current_record is 1 (it was 0 before). Let's keep going.

(gdb) c
Continuing.

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;
(gdb) print error
$13 = 0
(gdb) print *file
$14 = {
  s = 0x432a490, 
  current_ptr = 0x438bd50 "\375\003", 
  current_hash_ptr = 0x0, 
  current_record = 2, 
  next_block = 3, 
  lastinx = -1, 
  errkey = -1, 
  mode = 2, 
  opt_flag = 0, 
  update = 50, 
  lastkey = 0x435c768 "\321\b", 
  recbuf = 0x435c768 "\321\b", 
  last_find_flag = HA_READ_KEY_EXACT, 
  parents = {0x0 <repeats 65 times>}, 
  last_pos = 0x0, 
  lastkey_len = 0, 
  implicit_emptied = 0 '\000', 
  lock = {
    owner = 0x43026d0, 
    next = 0x0, 
    prev = 0x432a5e0, 
    lock = 0x432a590, 
    cond = 0x0, 
    status_param = 0x0, 
    debug_print_param = 0x436ec00, 
    type = TL_READ, 
    priority = 0
  }, 
  open_list = {
    prev = 0x0, 
    next = 0x4373550, 
    data = 0x435c4a0
  }
}

OK, same as before. There was no error, current_ptr is now "\375\003" and current_record is 2.

(gdb) c
Continuing.

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;
(gdb) print error
$15 = 137

Hey, we have an error! What went wrong? Well, nothing. We got to the end of the table. Take a look at include/my_base.h:

389-#define HA_ERR_OLD_FILE             132     /* old databasfile */
390-#define HA_ERR_NO_ACTIVE_RECORD 133 /* No record read in update() */
391-#define HA_ERR_RECORD_DELETED       134     /* A record is not there */
392-#define HA_ERR_RECORD_FILE_FULL 135 /* No more room in file */
393-#define HA_ERR_INDEX_FILE_FULL      136     /* No more room in file */
394:#define HA_ERR_END_OF_FILE  137     /* end in next/prev/first/last */
395-#define HA_ERR_UNSUPPORTED  138     /* unsupported extension used */
396-#define HA_ERR_TO_BIG_ROW   139     /* Too big row */
397-#define HA_WRONG_CREATE_OPTION      140     /* Wrong create option */
398-#define HA_ERR_FOUND_DUPP_UNIQUE 141        /* Dupplicate unique on write */
399-#define HA_ERR_UNKNOWN_CHARSET       142    /* Can't open charset */

The next time we let the program continue, we get our result set back on the client.

(gdb) c
Continuing.
mariadb 5.2.4> select * from h1;
+------+
| id   |
+------+
|    1 |
|    2 |
|    3 |
+------+
3 rows in set (10 min 29.13 sec)

Our breakpoint was hit 4 times, one for each row in the table and one final time for EOF. But I set out trying to figure out what would happen in the case of a "deleted" row in the MEMORY table. Let's test that out, now.

mariadb 5.2.4> delete from h1 where id=2;

Oops, that triggers our breakpoint! We can disable it like this:

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000007d0458 in ha_heap::rnd_next(unsigned char*) at ha_heap.cc:350
        breakpoint already hit 28 times
(gdb) disable breakpoint 1
(gdb) c
Continuing.

Back in the client, you should see that the row was deleted successfully:

mariadb 5.2.4> delete from h1 where id=2;
Query OK, 1 row affected (48.27 sec)

We want to SELECT from the table again, but our breakpoint is disabled. Instead of disabling it, we could have just kept hitting continue to keep gdb working through the file, or even done continue 4 to get all the way through it, since we know that we hit that breakpoint 4 times scanning through our 3-row table.

We've lost our gdb command line now, though, and we won't get it back until a breakpoint is hit or we manually interrupt the program we're debugging. (The normal way to do that is to hit Ctrl-C, but for whatever reason that doesn't seem to work most of the time when debugging mysqld.) Another technique is to send mysqld the TRAP signal (you may need to do it more than once!). Once we have the (gdb) command prompt back, we'll re-enable our breakpoint.

$ kill -TRAP $( cat ./data/ub-dev.pid )
Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7f8e7a7c6740 (LWP 26113)]
0x00007f8e78ec4913 in select () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) enable breakpoint 1
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000007d0458 in ha_heap::rnd_next(unsigned char*) at ha_heap.cc:350
        breakpoint already hit 28 times
(gdb) c
Continuing.

Now, let's SELECT from the table again and look at some of the data that was interesting the first time through.

mariadb 5.2.4> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.2.4> select * from h1;
Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;
(gdb) print error
$16 = 0
(gdb) print file->current_ptr
$17 = (uchar *) 0x438bd30 "\375\001"
(gdb) print file->current_record
$18 = 0
(gdb) c
Continuing.

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;
(gdb) print error
$19 = 134
(gdb) print file->current_ptr
$20 = (uchar *) 0x438bd40 ""
(gdb) print file->current_record
$21 = 1

The first row came through as we expected, but the second resulted in an error. The snippet of include/ha_base.h I included above shows that 134 corresponds to HA_ERR_RECORD_DELETED. There it is, our deleted record (id=2), just where we would've expected to find it.

(gdb) c
Continuing.

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;
(gdb) print error
$22 = 0
(gdb) print file->current_ptr
$23 = (uchar *) 0x438bd50 "\375\003"
(gdb) print file->current_record
$24 = 2
(gdb) c
Continuing.

Breakpoint 1, ha_heap::rnd_next (this=0x436f4e8, buf=<optimized out>) at ha_heap.cc:350
350       table->status=error ? STATUS_NOT_FOUND: 0;
(gdb) print error
$25 = 137
(gdb) print file->current_record
$27 = 3
(gdb) c
Continuing.

We can see that there are still 4 calls to ha_heap::rnd_next (each of which increments Handler_read_rnd_next), even though only 2 rows are returned to the client.

mariadb 5.2.4> select * from h1;
+------+
| id   |
+------+
|    1 |
|    3 |
+------+
2 rows in set (4 min 4.66 sec)

mariadb 5.2.4> show status like 'Handler_read_rnd_next%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_rnd_next | 4     |
+-----------------------+-------+
1 row in set (0.00 sec)

SHOW STATUS hits our breakpoint, too, which means SHOW STATUS is creating a temporary MEMORY table! Again, you'll either need to continue through the breakpoint or disable it.

So, there it is. We just used gdb to look at the activities of a running MariaDB server (though as I said, this will apply equally to MySQL and Percona servers) as it retrieved rows from a (non-indexed) MEMORY table. I was actually surprised how straightforward it was to do this. I look forward to spending more time tracing things like this to gain a better understanding of the server source, and I hope that this might encourage others to do the same!

rows_examined UNION Handler% ... ?!?

A customer recently asked if I could help clarify the relationship between rows_examined (in the slow query log) and the Handler% variables (in SHOW STATUS output) for a simple UNION. I didn't know the answer off the top of my head, but I didn't think it would be too terribly difficult to find it. Several days and many hours of work later, I think I finally have some idea of how this all works.

Let's first set up some sample data and get some information about it. (I'm using MariaDB 5.2.4 for these tests because that's the version the customer was using. This should basically all apply to other versions/vendors, except specific changes I refer to in MariaDB 5.3.)

maria 5.2.4> create table t1 (id int, key(id));
Query OK, 0 rows affected (0.01 sec)

maria 5.2.4> \! seq -f '(%.0f)' -s, 1 10
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
maria 5.2.4> insert into t1 values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
Query OK, 10 rows affected (0.00 sec)
Records: 10  Duplicates: 0  Warnings: 0

maria 5.2.4> set global slow_query_log=1, log_output='table', long_query_time=0;
Query OK, 0 rows affected (0.03 sec)

 5.2.4> flush status;
Query OK, 0 rows affected (0.00 sec)

maria 5.2.4> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.00 sec)

maria 5.2.4> show status like 'Handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 4     |
+----------------------------+-------+
15 rows in set (0.01 sec)

maria 5.2.4> select * from mysql.slow_log where sql_text like 'select%' order by start_time desc limit 1\G
*************************** 1. row ***************************
    start_time: 2012-01-12 11:12:39
     user_host: root[root] @ localhost []
    query_time: 00:00:00
     lock_time: 00:00:00
     rows_sent: 2
 rows_examined: 7
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union all select * from t1 where id=5
1 row in set (0.01 sec)

So, we have one table (t1) with one column (id) and 10 rows. There's a key on id. There are two different things I want to look at here: the Handler% variables and rows_examined in the slow query log. Initially, they seem to match up perfectly. Handler_read_key + Handler_read_next + Handler_read_rnd_next = 7 = rows_examined. Case closed, right? Not so fast.

How many rows are actually being handled in order to work execute this query? There's one row from each of the base tables in the UNION (well, there are two instances of the same table, but UNION doesn't know or care about that). Those rows are written to a temporary table as part of the UNION, and then they're read back from the temporary table and sent to the client. One row from each base table is two rows, and then two rows from the temporary table make four rows total. Why, then, is rows_examined=7? And why do the Handler_read% variables seem to support this value? There's a lot more going on here than meets the eye. Not having much luck with the reference manual, I set off on an odyssey through the MariaDB source code to try to make sense out of all of this.

If you want to follow along, go grab a copy of the MariaDB source. You can find it at http://downloads.askmonty.org/MariaDB/5.2.4/.

I did lots of grepping and eventually found my way to st_select_lex_unit::exec in sql/sql_union.cc. This is the "meat" of where a UNION is executed. Right at the top of the function (line 443) you see this:

  ha_rows examined_rows= 0;

It looks like st_select_lex_unit::exec is going to be maintaining its own examined rows counter. The next mention of examined_rows is at line #515:

          examined_rows+= thd->examined_row_count;

Here, the examined_rows counter gets the current value that's stored in the thread information structure. The next thing that happens is at line #629:

        thd->examined_row_count+= examined_rows;

This shows the thread information structure that we saw above getting examined_rows added back to it. That can't be right... assigning that value back whence it came doesn't make any sense. Sure enough, in MariaDB 5.3.3 this is done differently; the code we saw at line #515 above (it's actually at line #660 in the MariaDB 5.3.3 source) looks like this, instead:

          examined_rows+= thd->examined_row_count;
          thd->examined_row_count= 0;

Instead of adding thd->examined_row_count back onto itself, thd->examined_row_count is reinitialized to 0. Later examined_rows is added back to thd->examined_row_count just as was done in MariaDB 5.2.4. That means that the value of rows_examined in the slow query log make a lot more sense in MariaDB 5.3.3 than they did in MariaDB 5.2.4:

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.07 sec)

mariadb 5.3.3> select * from mysql.slow_log order by start_time desc limit 1\G
*************************** 1. row ***************************
    start_time: 2012-01-12 11:36:52.136361
     user_host: root[root] @ localhost []
    query_time: 00:00:00.067328
     lock_time: 00:00:00.024490
     rows_sent: 2
 rows_examined: 4
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union all select * from t1 where id=5
1 row in set (0.00 sec)

So, that solves the conundrum of why rows_examined didn't match expectations for a UNION. There was a bug! But... rows_examined, even though it didn't really make much sense, did match the Handler% variables. Those are unchanged in MariaDB 5.3.3:

mariadb 5.3.3> show status like "handler%";
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 2     |
| Handler_update             | 0     |
| Handler_write              | 1     |
+----------------------------+-------+
17 rows in set (0.13 sec)

Whether we're using MariaDB 5.2.4 or MariaDB 5.3.3, Handler_read_key, Handler_read_next, and Handler_read_rnd_next still need to be accounted for. With the assistance of some of the absolutely wonderful people at Monty Program AB, I eventually got a grasp on these.

Handler_read_key means that the server is making a direct access to a specific index entry. In this case, it's going straight to the first entry in the index for value "5" (id=5). There are two tables in the UNION, and we're doing a very simple query against each of them. Each SELECT only matches a single row, so Handler_read_key is 1 for each table, and 2 in total.

Handler_read_next means that the server is looking for the next value in an index. Even though MySQL uses a direct lookup (Handler_read_key) to find the first entry in the index for a particular value, there might be other entries for that same value, so Handler_read_next is required. In this case, there is only one row that matches id=5, so Handler_read_next returns something like EOF. Again, this happens twice in the UNION, so Handler_read_next is 2. If more rows were to match, Handler_read_key would still be 1, but Handler_read_next would increase:

mariadb 5.2.4> insert into t1 values (5),(5),(5);
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings: 0

mariadb 5.2.4> select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
|    5 |
|    5 |
+------+
4 rows in set (0.00 sec)

mariadb 5.2.4> show status like 'Handler_read%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_first    | 0     |
| Handler_read_key      | 1     |
| Handler_read_next     | 4     |
| Handler_read_prev     | 0     |
| Handler_read_rnd      | 0     |
| Handler_read_rnd_next | 0     |
+-----------------------+-------+
6 rows in set (0.00 sec)

Handler_read_rnd_next indicates that the server is doing a table scan. This is happening for our UNION because of the temporary table. Even though there are only 2 rows in the temporary table, Handler_read_rnd_next is 3 for the same reason that Handler_read_next was set—the server must find the end of the data. The last time Handler_read_rnd_next is incremented is when the handler finds the end of the table.

That wasn't so bad, right? There's a little extra wrinkle to look out for (of course). We've been doing UNION ALL, but what changes if we do UNION DISTINCT? I'll use MariaDB 5.3.3 for this example, so that we can compare better values for rows_examined.

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "handler%";
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 2     |
| Handler_update             | 0     |
| Handler_write              | 1     |
+----------------------------+-------+
17 rows in set (0.00 sec)

mariadb 5.3.3> select * from mysql.slow_log order by start_time desc limit 1\G
*************************** 1. row ***************************
    start_time: 2012-01-12 12:07:55.514920
     user_host: root[root] @ localhost []
    query_time: 00:00:00.000745
     lock_time: 00:00:00.000220
     rows_sent: 2
 rows_examined: 4
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union all select * from t1 where id=5
1 row in set (0.01 sec)

mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union distinct select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
+------+
1 row in set (0.02 sec)

mariadb 5.3.3> show status like "handler%";
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_next          | 2     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 3     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 2     |
| Handler_update             | 0     |
| Handler_write              | 1     |
+----------------------------+-------+
17 rows in set (0.00 sec)

mariadb 5.3.3> select * from mysql.slow_log order by start_time desc limit 1\G
*************************** 1. row ***************************
    start_time: 2012-01-12 12:08:23.308634
     user_host: root[root] @ localhost []
    query_time: 00:00:00.023508
     lock_time: 00:00:00.000156
     rows_sent: 1
 rows_examined: 3
            db: test
last_insert_id: 0
     insert_id: 0
     server_id: 0
      sql_text: select * from t1 where id=5 union distinct select * from t1 where id=5
1 row in set (0.00 sec)

When you do UNION [DISTINCT] (DISTINCT is the default if you don't specify either DISTINCT or ALL), the server creates the temporary table with a flag that tells it to create a unique key and thus reject duplicate rows. So, for UNION DISTINCT, why is Handler_read_rnd_next, the one that corresponds to reading from the temporary table, still equal to 3? If there is only 1 row in the table, shouldn't this variable be equal to 2? One for the row, and one for EOF, right? It turns out that the row is actually created in the temporary table before the unique key check is performed. After the unique check fails, the row is marked as "deleted", but it's still there. When working through the table, Handler_read_rnd_next is incremented every time it's called ... even if it finds a deleted row or EOF.

That's an internal feature of how HEAP/MEMORY tables work. If we force a non-heap (MyISAM or Aria) temporary table, we have Handler_read_rnd_next of 2 for UNION DISTINCT, as we might have expected to begin with:

mariadb 5.3.3> alter table t1 add column t text;
Query OK, 10 rows affected (0.01 sec)              
Records: 10  Duplicates: 0  Warnings: 0

mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
|    5 | NULL |
+------+------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler_read_rnd_next";
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_rnd_next | 3     |
+-----------------------+-------+
1 row in set (0.00 sec)

mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union distinct select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
+------+------+
1 row in set (0.00 sec)

mariadb 5.3.3> show status like "Handler_read_rnd_next";
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_rnd_next | 2     |
+-----------------------+-------+
1 row in set (0.00 sec)

The Handler_read% variables now taken care of, we ought to take a look at Handler_write. You may have noticed some differences in those between MariaDB 5.2.4 and MariaDB 5.3.3:

mariadb 5.2.4> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.2.4> select * from t1 where id=5 union all select * from t1 where id=5;
+------+
| id   |
+------+
|    5 |
|    5 |
+------+
2 rows in set (0.07 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 4     |
+---------------+-------+
1 row in set (0.04 sec)
mariadb 5.3.3> flush status;
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
|    5 | NULL |
+------+------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 1     |
+-------------------+-------+
2 rows in set (0.00 sec)

It's easiest to look at MariaDB 5.3.3 first, since it helpfully breaks out temporary table writes for us. We know the UNION writes to the temporary table, so Handler_tmp_write of 2 is easy to explain. I was quite confused about Handler_write at first, until someone reminded me that I had the slow query log turned on with log_output=table!

mariadb 5.3.3> set global log_output='file';
Query OK, 0 rows affected (0.00 sec)

mariadb 5.3.3> select * from t1 where id=5 union all select * from t1 where id=5;
+------+------+
| id   | t    |
+------+------+
|    5 | NULL |
|    5 | NULL |
+------+------+
2 rows in set (0.01 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.01 sec)

Still, though, MariaDB 5.2.4 showed Handler_write of 4, while here we've only accounted for 3 (2 for writing to the temporary table, 1 for writing to the slow query log). Then I noticed/remembered/realized that SHOW STATUS itself actually increments Handler_write.

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 4     |
+---------------+-------+
1 row in set (0.04 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 5     |
+---------------+-------+
1 row in set (0.00 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 6     |
+---------------+-------+
1 row in set (0.00 sec)

mariadb 5.2.4> show status like 'Handler_write%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Handler_write | 7     |
+---------------+-------+
1 row in set (0.00 sec)

That seems to be fixed in MariaDB 5.3:

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.00 sec)

mariadb 5.3.3> show status like "Handler%write";
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_tmp_write | 2     |
| Handler_write     | 0     |
+-------------------+-------+
2 rows in set (0.00 sec)

What I thought would be a simple investigation into rows_examined and the Handler% variables actually ended up being a really interesting trek through the MariaDB source code, a chance to use gdb, a bug in sql_union.cc, some unexpected ancient low-level behavior exposed via status variables, and a chance to learn about some nice new changes in MariaDB 5.3. I hope I've gotten my facts straight, and I hope you'll let me know if I haven't!

Demystifying identifier mapping

While tinkering with some things the other day, I noticed that if I create a table named "ñ", MySQL creates a .frm file with the name of @0x.frm. Similar "encoding" is done for any file that needs to be created (.MYI, .ibd, .TRG, .TRN, et cetera). I'm not sure this is necessary or useful when my filesystem seems to support all kinds of fanciful characters in filenames, and especially when character_set_system is set by default (and unchangeably!) to utf8, but it's happening nevertheless and I decided to learn more about how it works.

This behavior is discussed in the reference manual at Mapping of Identifiers to File Names. There's even a chart that claims to show how the mapping is performed. The chart and text below it on the "Mapping of Identifiers to File Names" page show that characters in the range 00C0..017F should be mapped using the pattern [@][0..4][g..z], but it wasn't very obvious why. (I had no luck understanding any of it at first, in large part because I hadn't yet figured out that the values in the Code Range column are UCS-2 values, not UTF-8 values.)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> select convert(0x00c0 using ucs2);
+-----------------------------+
| convert(0x00c0 using ucs2) |
+----------------------------+
| À                          |
+----------------------------+
1 row in set (0.00 sec)

mysql> create table À (id int);
Query OK, 0 rows affected (0.02 sec)
$ pwd
/home/kolbe/mysql/5.6/data/test
$ ls -l
total 108
-rw-rw---- 1 kolbe kolbe  8556 2012-01-03 13:51 @0G.frm
-rw-rw---- 1 kolbe kolbe 98304 2012-01-03 13:51 @0G.ibd

Well, that shows that it works as described, but I still didn't really understand why. I decided to start digging around the MySQL Server source code. First I looked at a lot of things in mysys, thinking that this functionality might be low-level enough to exist there. I was wrong. I looked around in sql, because that's where the vast majority of all the server code exists, but I also didn't have any luck there, initially, because there are tons of file operations that this algorithm applies to and I didn't know where to start.

Eventually, I stumbled across some use of build_table_filename in sql/sql_base.cc, which led me to sql/sql_table.cc, where that function is defined. Well, build_table_filename calls tablename_to_filename, which uses strconvert to actually perform the conversion of the string from one "character set" to another. The strconvert function, defined in sql/strfunc.cc, is called with a "to" character set (to_cs) of my_charset_filename. So, MySQL seems to be using some internal separate "character set" to control/define the mapping used for filenames. The to_cs argument of the strconvert function is a CHARSET_INFO struct that contains information about a given character set, including some methods that can be used to perform conversion operations. There's a good amount of information about the CHARSET_INFO structure in strings/CHARSET_INFO.txt. The strconvert function uses the wc_mb function in the cset member (type MY_CHARSET_HANDLER) of the CHARSET_INFO struct to perform the conversion.

I found my_charset_filename defined in strings/ctype-utf8.c. The cset member of my_charset_filename is my_charset_filename_handler, defined in the same file. The wc_mb function for this character set is my_wc_mb_filename, once again located in the same file. Here's the contents of that function:

static int
my_wc_mb_filename(const CHARSET_INFO *cs __attribute__((unused)),
                  my_wc_t wc, uchar *s, uchar *e)
{ 
  int code;
  char hex[]= "0123456789abcdef";
  if (wc < 128 && filename_safe_char[wc])
  { 
    *s= (uchar) wc;
    return 1;
  }

  if (s + 3 > e)
    return MY_CS_TOOSMALL3;

  *s++= MY_FILENAME_ESCAPE;
  if ((wc >= 0x00C0 && wc <= 0x05FF && (code= uni_0C00_05FF[wc - 0x00C0])) ||
      (wc >= 0x1E00 && wc <= 0x1FFF && (code= uni_1E00_1FFF[wc - 0x1E00])) ||
      (wc >= 0x2160 && wc <= 0x217F && (code= uni_2160_217F[wc - 0x2160])) ||
      (wc >= 0x24B0 && wc <= 0x24EF && (code= uni_24B0_24EF[wc - 0x24B0])) ||
      (wc >= 0xFF20 && wc <= 0xFF5F && (code= uni_FF20_FF5F[wc - 0xFF20])))
  { 
    *s++= (code / 80) + 0x30;
    *s++= (code % 80) + 0x30;
    return 3;
  }

  /* Non letter */
  if (s + 5 > e)
    return MY_CS_TOOSMALL5;

  *s++= hex[(wc >> 12) & 15];
  *s++= hex[(wc >> 8) & 15];
  *s++= hex[(wc >> 4) & 15];
  *s++= hex[(wc)      & 15];
  return 5;
}

There it is, after all the hunting, the actual code that does the conversion from À to @0G. Let's see how it works.

There are numerous arrays of code points defined in strings/ctype-utf8.c (uni_0C00_05FF, uni_1E00_1FFF, et cetera); they look like this:

/* 00C0-05FF */
static uint16 uni_0C00_05FF[1344]=
{
  0x0017,0x0018,0x0019,0x001A,0x001B,0x001C,0x001D,0x001E,
  0x001F,0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,
  0x0027,0x0028,0x0029,0x002A,0x0067,0x0068,0x0069,0x0000,
  0x006B,0x006C,0x006D,0x006E,0x006F,0x0070,0x0071,0x008A,
  0x0037,0x0038,0x0039,0x003A,0x003B,0x003C,0x003D,0x003E,
  0x003F,0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,
  0x0047,0x0048,0x0049,0x004A,0x0087,0x0088,0x0089,0x0000,
  0x008B,0x008C,0x008D,0x008E,0x008F,0x0090,0x0091,0x0092,
...

(You may notice that the uni_0C00_05FF is misnamed. It should be uni_00C0_05FF, which is even what the comment above its definition says! Oops...)

The uni_0C00_05FF array covers 11 Unicode blocks, including part of Latin-1 Supplement, IPA Extensions, Cyrillic, and Hebrew, among others. Only part of the Latin-1 Supplement block is handled by my_wc_mb_filename, and the entirety of that is shown above, 0x00C0 through 0x00FF, or elements 0 through 63 (since 0x00FF - 0x00C0 = 63).

There's not really a consistent pattern to the contents of the Latin-1 Supplement, which is one reason this array has to exist in the source as a static array with over a thousand elements. Within the second half of the Latin-1 Supplement block, the portion starting at 0x00C0 handled by this function, there are some uppercase vowels with consonants and math operators interspersed, and then the lowercase versions of the same; in other words, the uppercase letters are grouped, and the lowercase letters follow in a separate group, but there's other non-letter stuff in there, too.

The snippet above of the uni_0C00_05FF array consists of the UCS-2 values for G through Z and g through z, with 0x30 subtracted from them, with varying offsets of 80, skipping various non-letter characters. (One such character is 0x00D7, MULTIPLICATION SIGN, element 23 in the array, at the end of the 3rd line above. Such characters exist in filenames using their full unicode representation. In this case, a table named "×" —the MULTIPLICATION SIGN, not the letter "x"!— gets a .frm named @00d7.frm.)

In the Latin Extended-A Block (0x0100 to 0x017F), by contrast, each uppercase letter is directly followed by its lowercase equivalent (Ā, ā, Ă, ă, et cetera). We can see this reflected in another snippet of uni_0C00_05FF directly following the one shown above:

static uint16 uni_0C00_05FF[1344]=
{
...
  0x0073,0x0093,0x0074,0x0094,0x0075,0x0095,0x0076,0x0096,
  0x0077,0x0097,0x0078,0x0098,0x0079,0x0099,0x007A,0x009A,
  0x00B7,0x00D7,0x00B8,0x00D8,0x00B9,0x00D9,0x00BA,0x00DA,
  0x00BB,0x00DB,0x00BC,0x00DC,0x00BD,0x00DD,0x00BE,0x00DE,
...

There's a much clearer pattern there. Each value is followed by itself + 0x20 (decimal 32), which is the distance between an uppercase letter and its lowercase equivalent in Basic Latin (ASCII!):

mysql> select ord('A');
+----------+
| ord('A') |
+----------+
|       65 |
+----------+
1 row in set (0.01 sec)

mysql> select ord('a');
+----------+
| ord('a') |
+----------+
|       97 |
+----------+
1 row in set (0.00 sec)

mysql> select ord('a') - ord('A');
+---------------------+
| ord('a') - ord('A') |
+---------------------+
|                  32 |
+---------------------+
1 row in set (0.00 sec)

As explained in the manual, this conversion/encoding scheme for "identifier mapping" is designed to preserve case sensitivity on the filesystem. That is, À and à become @0G and @0g respectively. That means that table "À" is equal to table "à" on a case-insensitive filesystem (because @0G is equal to @0g) while they're not equal on a case-sensitive filesystem.

The code to perform the conversion essentialy determines which of the arrays to consult, subtracts the offset of the beginning of the range, does a lookup in the appropriate array, adds '0' (0x30), shifts things around a bit with division and modulus to get everything to fit into the scheme of allowed characters ([0..4][g..z]), and that's that.

Because my brain handles character sets/encodings (and arithmetic, I guess) a lot better at the SQL prompt than elsewhere, I put this together:

mysql> create table c (id int not null auto_increment primary key, c int);
Query OK, 0 rows affected (0.10 sec)

mysql> delimiter //
mysql> DROP PROCEDURE IF EXISTS insert_ucs2;
    -> CREATE PROCEDURE insert_ucs2 (c VARCHAR(255)) BEGIN
    ->         DECLARE i INT DEFAULT 1;
    ->         DECLARE n INT DEFAULT 1;
    ->
    ->         WHILE i < LENGTH(c) AND n <> 0 DO
    ->                 SET n = LOCATE(',', SUBSTRING(c FROM i));
    ->                 IF n = 0 THEN SET n = LENGTH(c)-i+2; END IF;
    ->                 INSERT INTO c (c) VALUES ( CONV(SUBSTRING(c FROM i+2 for n-3), 16, 10) );
    ->                 SET i = i + n;
    ->         END WHILE;
    -> 
    -> END//
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

mysql> delimiter ;

mysql> call insert_ucs2('0x0073,0x0093,0x0074,0x0094,0x0075,0x0095,0x0076,0x0096,0x0077,0x0097,0x0078,0x0098,0x0079,0x0099,0x007A,0x009A');
Query OK, 1 row affected (0.04 sec)

That's a little procedure to parse the lines from the conversion mapping arrays and put them into a very basic table. I used the first few line that correspond to the Latin Extended-A Block. Now, we can see how those actually map to filenames:

mysql> select concat('@',floor(c / 80), char(convert(c % 80 + 0x30 using ucs2))) as v from c;
+------+
| v    |
+------+
| @1S  |
| @1s  |
| @1T  |
| @1t  |
| @1U  |
| @1u  |
| @1V  |
| @1v  |
| @1W  |
| @1w  |
| @1X  |
| @1x  |
| @1Y  |
| @1y  |
| @1Z  |
| @1z  |
+------+
16 rows in set (0.00 sec)
mysql> create table Ā (id int);
Query OK, 0 rows affected (0.03 sec)

$ ls /home/kolbe/mysql/5.6/data/test
@1S.frm  @1S.ibd  c.frm  c.ibd

Looks like we guessed right! Ā is the first character of Latin Extended-A, and MySQL does map it to @1S as expected.

So, here's something even more silly, a re-implementation of (part of) the conversion algorithm using SQL:

$ tail -n +6897 < ./ctype-utf8.c | head -n $((7065 - 6897)) | perl -e 'my $i = 0; while (<STDIN>) { chomp; print "call insert_ucs2(\"" if $i == 0; s/\s//g; s/,$// if $i == 3; print; print "\");\n" if $i == 3; $i = ($i + 1) %4 }' > inserts.sql

$ mysql < ~/inserts.sql

mysql> select count(*) from c;
+----------+
| count(*) |
+----------+
|     1344 |
+----------+
1 row in set (0.00 sec)

mysql> select 0x05ff - 0x00c0;
+-----------------+
| 0x05ff - 0x00c0 |
+-----------------+
|            1343 |
+-----------------+
1 row in set (0.00 sec)

mysql> create table u (c char(1) character set utf8);
Query OK, 0 rows affected (0.09 sec)

mysql> insert into u values (convert(0x00c0 using ucs2));
Query OK, 1 row affected (0.01 sec)

mysql> insert into u select convert(char(max(ord(convert(c using ucs2)))+1) using ucs2) from u;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

...repeat a few more times...

mysql> select * from u;
+------+
| c    |
+------+
| À    |
| Á    |
| Â    |
| Ã    |
| Ä    |
| Å    |
| Æ    |
| Ç    |
| È    |
| É    |
+------+
10 rows in set (0.00 sec)

mysql> select concat('@',floor(c.c / 80), char(convert(c.c % 80 + 0x30 using ucs2))) as v from c join u on c.id = ord(convert(u.c using ucs2)) - 0x00c0 +1  ;
+------+
| v    |
+------+
| @0G  |
| @0H  |
| @0I  |
| @0J  |
| @0K  |
| @0L  |
| @0M  |
| @0N  |
| @0O  |
| @0P  |
+------+
10 rows in set (0.00 sec)

mysql> truncate u;
Query OK, 0 rows affected (0.08 sec)

mysql> insert into u values ('Ā'),('ā'),(convert(0x00d7 using ucs2));
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from u;
+------+
| c    |
+------+
| Ā    |
| ā    |
| ×    |
+------+
3 rows in set (0.00 sec)

mysql> select concat(
        '@',
        if(
            concat(
                floor(c.c / 80),
                char(convert(c.c % 80 + 0x30 using ucs2))
            ) <> '00'
        ,
            concat(
                floor(c.c / 80),
                char(convert(c.c % 80 + 0x30 using ucs2))
            )
               ,
                       lower( hex(convert(u.c using ucs2)))
               )
       ) as v  
               from c join u
                       on
               c.id = ord(convert(u.c using ucs2)) - 0x00c0 +1
       ;
+-------+
| v     |
+-------+
| @1S   |
| @1s   |
| @00d7 |
+-------+
3 rows in set (0.00 sec)

And, finally, in case you want that as a stored function...

delimiter //
DROP FUNCTION IF EXISTS my_wc_mb_filename//
CREATE FUNCTION my_wc_mb_filename(wc char(1) character set ucs2) RETURNS CHAR(5)
BEGIN
    DECLARE output CHAR(5);
    SELECT concat(
                '@',
                if(
                        concat( 
                                floor(c.c / 80),
                                char(convert(c.c % 80 + 0x30 using ucs2))
                        ) <> '00'
                ,
                        concat( 
                                floor(c.c / 80),
                                char(convert(c.c % 80 + 0x30 using ucs2))
                        )
               ,
                       lower( hex(wc) )
               )
    ) as v  INTO output
    from c
    where 
    id = ord(wc) - 0x00c0 +1
    ;
    RETURN output;
END
//
delimiter ;
mysql> select my_wc_mb_filename('Ý');
+-------------------------+
| my_wc_mb_filename('Ý')  |
+-------------------------+
| @1P                     |
+-------------------------+
1 row in set (0.00 sec)

mysql> select my_wc_mb_filename('÷');
+-------------------------+
| my_wc_mb_filename('÷')  |
+-------------------------+
| @00f7                   |
+-------------------------+
1 row in set (0.00 sec)

It's worth mentioning that Drizzle does all of this quite differently. (They seem to take some pride in having deliberately torn strconvert out of the source tree entirely.) Instead of the complicated algorithm used by MySQL, Drizzle uses a succinct function that deals in UTF-8. You can find it in drizzled/util/tablename_to_filename.cc. Here's the part of it that performs the conversion of non-ASCII characters:

    to.push_back('@');
    to.push_back(hexchars[(it >> 4) & 15]);
    to.push_back(hexchars[it & 15]);

Here are the effects:

drizzle> select hex('ữỗỿḵḝ');
+--------------------------------+
| hex('ữỗỿḵḝ')                   |
+--------------------------------+
| E1BBAFE1BB97E1BBBFE1B8B5E1B89D |
+--------------------------------+
1 row in set (0.000418 sec)

drizzle> create table ữỗỿḵḝ (id int);
Query OK, 0 rows affected (0.010721 sec)

...

$ ls /var/lib/drizzle/local/test
db.opt  @e1@bb@af@e1@bb@97@e1@bb@bf@e1@b8@b5@e1@b8@9d.dfe

I'm not trying to make a case that re-implementing this kind of thing in SQL has any practical purpose, but it was a fun way to poke around in the MySQL source code, learn more about how this algorithm works, and to keep my character set and stored functions chops honed up a bit.

Initial impressions of InnoDB Fulltext

The much-anticipated, long-awaited, and possibly irrelevant (Sphinx? Lucene?) InnoDB Fulltext finally makes a semi-official appearance in MySQL 5.6.4! MySQL 5.6.4, release December 20, 2011, is the newest "development milestone release" ("DMR") of MySQL 5.6.

It's important to bear in mind that this is the first release of InnoDB Fulltext. Hopefully, that means that considerable improvements will be forthcoming; for now, though, we are left to try to make sense of what is available at present.

Creating an InnoDB table that uses a fulltext index is dead easy:

mysql> create table ibft1 (id int unsigned not null auto_increment primary key, c1 varchar(255), fulltext(c1)) engine=innodb;
Query OK, 0 rows affected (0.08 sec)

mysql> insert into ibft1 (c1) values ('brown cow'),('brown dog'),('white cow'),('white dog'),('yellow horse'),('green iguana');
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from ibft1 where match (c1) against ('brown');
+----+-----------+
| id | c1        |
+----+-----------+
|  1 | brown cow |
|  2 | brown dog |
+----+-----------+
2 rows in set (0.00 sec)

mysql> select * from ibft1 where match (c1) against ('cow');
+----+-----------+
| id | c1        |
+----+-----------+
|  1 | brown cow |
|  3 | white cow |
+----+-----------+
2 rows in set (0.01 sec)

Great! I'm happy to see it works, through I was a bit surprised to see that the default minimum word length was less than the value of ft_min_word_len. Those legacy ft_% variables continue to apply only to MyISAM fulltext indexes, evidently. There's a new bundle of system variables to control InnoDB Fulltext:

mysql> show variables like '%ft%';
+---------------------------------+----------------+
| Variable_name                   | Value          |
+---------------------------------+----------------+
| ft_boolean_syntax               | + -><()~*:""&| |
| ft_max_word_len                 | 84             |
| ft_min_word_len                 | 4              |
| ft_query_expansion_limit        | 20             |
| ft_stopword_file                | (built-in)     |
| innodb_ft_aux_table             |                |
| innodb_ft_cache_size            | 32000000       |
| innodb_ft_enable_diag_print     | ON             |
| innodb_ft_enable_stopword       | ON             |
| innodb_ft_max_token_size        | 84             |
| innodb_ft_min_token_size        | 3              |
| innodb_ft_num_word_optimize     | 2000           |
| innodb_ft_server_stopword_table |                |
| innodb_ft_sort_pll_degree       | 2              |
| innodb_ft_user_stopword_table   |                |
+---------------------------------+----------------+
15 rows in set (0.01 sec)

You'll also notice that there are a number of new tables in information_schema related to InnoDB Fulltext:

mysql> show tables from information_schema like 'INNODB_FT%';
+-------------------------------------------+
| Tables_in_information_schema (INNODB_FT%) |
+-------------------------------------------+
| INNODB_FT_CONFIG                          |
| INNODB_FT_DELETED                         |
| INNODB_FT_DEFAULT_STOPWORD                |
| INNODB_FT_INDEX_CACHE                     |
| INNODB_FT_BEING_DELETED                   |
| INNODB_FT_INDEX_TABLE                     |
| INNODB_FT_INSERTED                        |
+-------------------------------------------+
7 rows in set (0.00 sec)

Curiously, most of those tables are completely empty!

mysql> select * from INNODB_FT_CONFIG;
Empty set (0.00 sec)

mysql> select * from INNODB_FT_DELETED;
Empty set (0.00 sec)

mysql> select * from INNODB_FT_DELETED;
Empty set (0.00 sec)

mysql> select * from INNODB_FT_INDEX_CACHE;
Empty set (0.00 sec)

mysql> select * from INNODB_FT_BEING_DELETED;
Empty set (0.00 sec)

mysql> select * from INNODB_FT_INDEX_TABLE;
Empty set (0.00 sec)

mysql> select * from INNODB_FT_INSERTED;
Empty set (0.00 sec)

Well, that's not so curious if you've read the manual, which explains that you must set the value of innodb_ft_aux_table to the database & table that you wish to examine. That seems a bit of a round-about way to get at information in information_schema, but alright:

mysql> set innodb_ft_aux_table='test/ibft1';
ERROR 1229 (HY000): Variable 'innodb_ft_aux_table' is a GLOBAL variable and should be set with SET GLOBAL

Ah, yes. This takes me back to the good old 5.0 days when so much of the new functionality of 5.0 relied on the SUPER privilege, which carries with it all sorts of things that don't really logically go along with the analysis of InnoDB Fulltext indexes.

Well, alright, let's set the thing and take a look at some of these tables.

mysql> set global innodb_ft_aux_table='test/ibft1';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from information_schema.INNODB_FT_CONFIG;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 7     |
| last_optimized_word       |       |
| deleted_doc_count         | 0     |
| total_word_count          | 8     |
| optimize_start_time       |       |
| optimize_end_time         |       |
| stopword_table_name       |       |
| use_stopword              | 1     |
| table_state               | 0     |
+---------------------------+-------+
10 rows in set (0.00 sec)

mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| brown  |            1 |           2 |         2 |      1 |        0 |
| brown  |            1 |           2 |         2 |      2 |        0 |
| cow    |            1 |           3 |         2 |      1 |        6 |
| cow    |            1 |           3 |         2 |      3 |        6 |
| dog    |            2 |           4 |         2 |      2 |        6 |
| dog    |            2 |           4 |         2 |      4 |        6 |
| green  |            6 |           6 |         1 |      6 |        0 |
| horse  |            5 |           5 |         1 |      5 |        7 |
| iguana |            6 |           6 |         1 |      6 |        6 |
| white  |            3 |           4 |         2 |      3 |        0 |
| white  |            3 |           4 |         2 |      4 |        0 |
| yellow |            5 |           5 |         1 |      5 |        0 |
+--------+--------------+-------------+-----------+--------+----------+
12 rows in set (0.01 sec)

Well, there's some good stuff in there, no doubt. Before I got a sense of what these various things mean, I started wondering about how privileges affect what of this information can be viewed by various users. I created a new database (newdb) and a new table (ibft2):

mysql> create database newdb;
Query OK, 1 row affected (0.00 sec)

mysql> use newdb
Database changed
mysql> create table ibft2 like test.ibft1;
Query OK, 0 rows affected (0.03 sec)

mysql> insert into ibft2 select * from test.ibft1;
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> show create table ibft2\G
*************************** 1. row ***************************
       Table: ibft2
Create Table: CREATE TABLE `ibft2` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `c1` (`c1`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=latin1
1 row in set (0.01 sec)

mysql> select * from ibft2;
+----+--------------+
| id | c1           |
+----+--------------+
|  1 | brown cow    |
|  2 | brown dog    |
|  3 | white cow    |
|  4 | white dog    |
|  5 | yellow horse |
|  6 | green iguana |
+----+--------------+
6 rows in set (0.00 sec)

mysql> set global innodb_ft_aux_table='newdb/ibft2';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from information_schema.INNODB_FT_CONFIG;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 1     |
| last_optimized_word       |       |
| deleted_doc_count         | 0     |
| total_word_count          |       |
| optimize_start_time       |       |
| optimize_end_time         |       |
| stopword_table_name       |       |
| use_stopword              | 1     |
| table_state               | 0     |
+---------------------------+-------+
10 rows in set (0.00 sec)

Whoa, hold on. Why is the contents of information_schema.INNODB_FT_CONFIG different than when I had innodb_ft_aux_table='test/ibft1'? Let's confirm that...

mysql> set global innodb_ft_aux_table='test/ibft1';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from information_schema.INNODB_FT_CONFIG;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 7     |
| last_optimized_word       |       |
| deleted_doc_count         | 0     |
| total_word_count          | 8     |
| optimize_start_time       |       |
| optimize_end_time         |       |
| stopword_table_name       |       |
| use_stopword              | 1     |
| table_state               | 0     |
+---------------------------+-------+
10 rows in set (0.00 sec)

Sure enough, look at total_word_count ... for newdb/ibft2 it was undefined, but for test/ibft1 it is 8. I wonder what INNODB_FT_INDEX_TABLE shows for the new table.

mysql> set global innodb_ft_aux_table='newdb/ibft2';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
Empty set (0.00 sec)

mysql> select * from newdb.ibft2 where match(c1) against ('dog');
+----+-----------+
| id | c1        |
+----+-----------+
|  2 | brown dog |
|  4 | white dog |
+----+-----------+
2 rows in set (0.00 sec)

So... INNODB_FT_INDEX_TABLE is empty(!) for the new table, but I can still make queries against this table? Hmm...

mysql> select * from information_schema.INNODB_FT_INDEX_CACHE;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| brown  |            1 |           2 |         2 |      1 |        0 |
| brown  |            1 |           2 |         2 |      2 |        0 |
| cow    |            1 |           3 |         2 |      1 |        6 |
| cow    |            1 |           3 |         2 |      3 |        6 |
| dog    |            2 |           4 |         2 |      2 |        6 |
| dog    |            2 |           4 |         2 |      4 |        6 |
| green  |            6 |           6 |         1 |      6 |        0 |
| horse  |            5 |           5 |         1 |      5 |        7 |
| iguana |            6 |           6 |         1 |      6 |        6 |
| white  |            3 |           4 |         2 |      3 |        0 |
| white  |            3 |           4 |         2 |      4 |        0 |
| yellow |            5 |           5 |         1 |      5 |        0 |
+--------+--------------+-------------+-----------+--------+----------+
12 rows in set (0.00 sec)

Aha! There's the info I'm looking for; it's in the _CACHE table because ... well, just because it is, I guess. I don't know why it's there now and wasn't there after I created my first table. It turns out that you need to OPTIMIZE TABLE in order to rebuild the index and I guess merge the "new" index data with the "main search index".

mysql> optimize table ibft2;
+-------------+----------+----------+-------------------------------------------------------------------+
| Table       | Op       | Msg_type | Msg_text                                                          |
+-------------+----------+----------+-------------------------------------------------------------------+
| newdb.ibft2 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| newdb.ibft2 | optimize | status   | OK                                                                |
+-------------+----------+----------+-------------------------------------------------------------------+
2 rows in set (0.07 sec)

Oh, first you need to set a different global variable ... innodb_optimize_fulltext_only.

mysql> set global innodb_optimize_fulltext_only=1;
Query OK, 0 rows affected (0.01 sec)

mysql> optimize table ibft2;
+-------------+----------+----------+----------+
| Table       | Op       | Msg_type | Msg_text |
+-------------+----------+----------+----------+
| newdb.ibft2 | optimize | status   | OK       |
+-------------+----------+----------+----------+
1 row in set (0.05 sec)

mysql> select * from information_schema.INNODB_FT_INDEX_TABLE;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| brown  |            1 |           2 |         2 |      1 |        0 |
| brown  |            1 |           2 |         2 |      2 |        0 |
| cow    |            1 |           3 |         2 |      1 |        6 |
| cow    |            1 |           3 |         2 |      3 |        6 |
| dog    |            2 |           4 |         2 |      2 |        6 |
| dog    |            2 |           4 |         2 |      4 |        6 |
| green  |            6 |           6 |         1 |      6 |        0 |
| horse  |            5 |           5 |         1 |      5 |        7 |
| iguana |            6 |           6 |         1 |      6 |        6 |
| white  |            3 |           4 |         2 |      3 |        0 |
| white  |            3 |           4 |         2 |      4 |        0 |
| yellow |            5 |           5 |         1 |      5 |        0 |
+--------+--------------+-------------+-----------+--------+----------+
12 rows in set (0.00 sec)

mysql> select * from information_schema.INNODB_FT_CONFIG;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 7     |
| last_optimized_word       |       |
| deleted_doc_count         | 0     |
| total_word_count          | 8     |
| optimize_start_time       |       |
| optimize_end_time         |       |
| stopword_table_name       |       |
| use_stopword              | 1     |
| table_state               | 0     |
+---------------------------+-------+
10 rows in set (0.00 sec)

Alright, we're finally back to where we started! The thing I had planned to look at is how privileges interact with this information_schema metadata strategy. Basically, you need SUPER to be able to look at the data, which seems like an obvious potential problem in terms of unprivileged users being able to access information about InnoDB Fulltext indexes on their own tables. The more interesting thing, though, is that unprivileged users can learn quite a bit abount the contents of whatever table innodb_ft_aux_table happens to be set to.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| test               |
+--------------------+
2 rows in set (0.00 sec)

mysql> use newdb;
ERROR 1044 (42000): Access denied for user 'noprivs'@'localhost' to database 'newdb'
mysql> select * from newdb.ibft2;
ERROR 1142 (42000): SELECT command denied to user 'noprivs'@'localhost' for table 'ibft2'
mysql> select table_schema,table_name from information_schema.tables where table_name='ibft2';
Empty set (0.01 sec)
mysql> select @@innodb_ft_aux_table;
+-----------------------+
| @@innodb_ft_aux_table |
+-----------------------+
| newdb/ibft2           |
+-----------------------+
1 row in set (0.00 sec)

mysql> select * from information_schema.innodb_ft_config;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 7     |
| last_optimized_word       |       |
| deleted_doc_count         | 0     |
| total_word_count          | 8     |
| optimize_start_time       |       |
| optimize_end_time         |       |
| stopword_table_name       |       |
| use_stopword              | 1     |
| table_state               | 0     |
+---------------------------+-------+
10 rows in set (0.00 sec)

mysql> select * from information_schema.innodb_ft_index_table;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| brown  |            1 |           2 |         2 |      1 |        0 |
| brown  |            1 |           2 |         2 |      2 |        0 |
| cow    |            1 |           3 |         2 |      1 |        6 |
| cow    |            1 |           3 |         2 |      3 |        6 |
| dog    |            2 |           4 |         2 |      2 |        6 |
| dog    |            2 |           4 |         2 |      4 |        6 |
| green  |            6 |           6 |         1 |      6 |        0 |
| horse  |            5 |           5 |         1 |      5 |        7 |
| iguana |            6 |           6 |         1 |      6 |        6 |
| white  |            3 |           4 |         2 |      3 |        0 |
| white  |            3 |           4 |         2 |      4 |        0 |
| yellow |            5 |           5 |         1 |      5 |        0 |
+--------+--------------+-------------+-----------+--------+----------+
12 rows in set (0.00 sec)

My user can't even see whether a particular table or database exists, but he can get all kinds of information about the contents of the table by looking at information_schema.INNODB_FT_INDEX_TABLE/CACHE when innodb_ft_aux_table is set. Not too big a problem, though, right, because any user who sets innodb_ft_aux_table can just unset it after they've gathered whatever metadata/statistics they need. Not so fast!

mysql> set global innodb_ft_aux_table='';
ERROR 1231 (42000): Variable 'innodb_ft_aux_table' can't be set to the value of ''

Yikes.

I am trying to be careful to remind myself that this is the first appearance of InnoDB Fulltext in a public release, but I hope these quirks are just side-effects of immaturity and not design decisions that are intended to be kept around for very long!

optimizer_switch in MariaDB

I've been spending some time looking through the new features of MariaDB 5.3, and one of the big items is an improved and configurable optimizer. Much of the functionality of the new optimizer is controlled using the optimizer switch variable [http://kb.askmonty.org/en/mariadb-53-optimizer_switch].

There's been some discussion (see [http://s.petrunia.net/blog/?p=61] and [https://lists.launchpad.net/maria-developers/msg04142.html]) about splitting optimizer switch into multiple separate variables, but for now we seem to be stuck with this:

mysql 5.3.3-MariaDB-rc (root) [test]> select @@optimizer_switch;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| @@optimizer_switch                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql 5.3.3-MariaDB-rc (root) [test]> select length(@@optimizer_switch);
+----------------------------+
| length(@@optimizer_switch) |
+----------------------------+
|                        547 |
+----------------------------+
1 row in set (0.00 sec)

547 characters! Ouch. What do you do if you want to get the value of a single one of these switches? I wrote a function to help:

mysql 5.3.3-MariaDB-rc (root) [test]> create function get_optimizer_switch(switch_name char(32)) returns char(3) return
     substring_index(substring_index(substring(@@optimizer_switch from locate(switch_name,@@optimizer_switch)),',','1'),'=','-1');
Query OK, 0 rows affected (0.00 sec)

mysql 5.3.3-MariaDB-rc (root) [test]> select get_optimizer_switch('derived_with_keys');
+-------------------------------------------+
| get_optimizer_switch('derived_with_keys') |
+-------------------------------------------+
| on                                        |
+-------------------------------------------+
1 row in set (0.01 sec)

That was still a little bit annoying, so I wrote a stored procedure, too.

DROP TEMPORARY TABLE IF EXISTS optimizer_switch_tmp;
DROP PROCEDURE IF EXISTS get_optimizer_switches//
CREATE PROCEDURE get_optimizer_switches () BEGIN
        DECLARE i INT DEFAULT 1;
        DECLARE n INT DEFAULT 1;
        DECLARE c VARCHAR(64);

        CREATE TEMPORARY TABLE optimizer_switch_tmp (switch_name char(64), val char(3));

        WHILE i < LENGTH(@@optimizer_switch) AND n <> 0 DO
                SET n = LOCATE(',', SUBSTRING(@@optimizer_switch FROM i));
                IF n = 0 THEN SET n = LENGTH(@@optimizer_switch)-i+2; END IF;
                INSERT INTO optimizer_switch_tmp VALUES (
            SUBSTRING_INDEX(SUBSTRING(@@optimizer_switch FROM i FOR n-1),'=',1), 
            SUBSTRING_INDEX(SUBSTRING(@@optimizer_switch FROM i FOR n-1),'=',-1)
        );
                SET i = i + n;
        END WHILE;

        SELECT * FROM optimizer_switch_tmp;
        DROP TEMPORARY TABLE optimizer_switch_tmp;
END//
mysql 5.3.3-MariaDB-rc (root) [test]> call get_optimizer_switches;
+-------------------------------+------+
| switch_name                   | val  |
+-------------------------------+------+
| index_merge                   | on   |
| index_merge_union             | on   |
| index_merge_sort_union        | on   |
| index_merge_intersection      | on   |
| index_merge_sort_intersection | off  |
| index_condition_pushdown      | on   |
| derived_merge                 | on   |
| derived_with_keys             | on   |
| firstmatch                    | on   |
| loosescan                     | on   |
| materialization               | on   |
| in_to_exists                  | on   |
| semijoin                      | on   |
| partial_match_rowid_merge     | on   |
| partial_match_table_scan      | on   |
| subquery_cache                | on   |
| mrr                           | off  |
| mrr_cost_based                | off  |
| mrr_sort_keys                 | off  |
| outer_join_with_cache         | on   |
| semijoin_with_cache           | on   |
| join_cache_incremental        | on   |
| join_cache_hashed             | on   |
| join_cache_bka                | on   |
| optimize_join_buffer_size     | off  |
| table_elimination             | on   |
+-------------------------------+------+
26 rows in set (0.01 sec)

Query OK, 0 rows affected (0.01 sec)

mysql 5.3.3-MariaDB-rc (root) [test]> 

If you know or can think of a better way, let me know in the comments.

Learning to love the InnoDB Lock Monitor

A customer opened a support issue to ask about some help determining why they were seeing a lot of Lock Wait Timeouts. I asked them to enable the InnoDB Lock Monitor so that I could get a look at what was going on in their transactions and whether there might be some locks held longer than necessary.

The customer sent in a 184MB MySQL error log with 4773836 lines. I started looking through it, but I could tell I was going to need a better way to get a better overview of the file than what I'd be able to piece together trying to poke through it and look for individual lines. I started piping the file through a variety of UNIX tools to narrow down what I was seeing.

I ended up with this mess:

 < mysqld.err grep ACTIVE | cut -d' ' -f 2,4 | sort -rn -k 2 | perl -F, -ane 'print "$F[0] $F[1]" if not $v{$F[0]}; $v{$F[0]} = $F[1];' | head

It's hideous, but it's pretty helpful. Here's the output:

1EC080F1  24284
1EC84325  18196
1ECAC476  13430
1ED57B19  4880
1ECFF656  4198
1ED54BF5  1229
1ED59197  193
1ED58A84  164
1ED58A80  164
1ED50FDA  134

It's just a list of the longest-running transactions, showing the transaction ID and the time it's been active.

I decided to re-write it in perl, since I was already using perl in my pipeline and wanted the opportunity to get some other output in a slightly more reasonable way. Here's the perl version:

#!/usr/bin/perl -n
if ( /ACTIVE (\d+) sec/ ) { 
    @r = split(' '); 
    $t{substr($r[1],0,8)} = $r[3] if not defined $t{$r[1]} and $t{$r[1]}<$r[3]; 

}  

END { 
    map { 
        print "$_ $t{$_}\n" 
    } (
        sort { 
            $t{$b} <=> $t{$a} 
        } keys %t
    )[0..9]  
}

$ perl toptrans.pl < mysqld.err
1EC080F1 24284
1EC84325 18196
1ECAC476 13430
1ED57B19 4880
1ECFF656 4198
1ED54BF5 1229
1ED59197 193
1ED58A80 164
1ED58A84 164
1ED50FDA 134

That gives exactly the same output as the first version, but I figure it might be easier to change the criteria I'm looking for or the output contents/format.

But really parsing through InnoDB status or monitor information is not very much fun. I got lucky that all the information I was looking for was on a single line. If I want to get anything more, though, such as the number of locks being held, any information about the locks, the MySQL thread ID, et cetera, it'd be really nasty. I thought that there must be something better out there, and a bit of Google quickly reminded me that Baron Schwartz [http://www.xaprb.com ] wrote a really excellent InnoDB status parser for use in innotop [http://code.google.com/p/innotop/ ].

The innotop script includes the InnoDBParser module inside of the file. You can just do a require to get access to the goods, then instantiate a new InnoDBParser . I ended up using Data::Dumper to make sense of the structure of the data:

#!/usr/bin/perl
use Data::Dumper;
require "innotop";
my $innodb_parser = new InnoDBParser;


my $contents;
{
   local $INPUT_RECORD_SEPARATOR = undef;
   $contents = <STDIN>;
}

my $innodb_status = $innodb_parser->parse_status_text(
   $contents,
   0,
   # Omit the following parameter to get all sections.
#   {  tx => 1 },
);

$Data::Dumper::Indent = 2;
print Dumper $innodb_status;

$ START=191; END=319; tail -n +$START < mysqld.err | head -n $(( $END - $START + 1 )) | perl txinfo.pl

$VAR1 = {
          got_all => 1,
          last_secs => '20',
          sections => {
                        bp => {
                                add_pool_alloc => '0',
                                awe_mem_alloc => 0,
                                buf_free => '7822',
                                buf_pool_hit_rate => undef,
                                buf_pool_hits => undef,
                                buf_pool_reads => undef,
                                buf_pool_size => '8192',
                                complete => 1,
                                dict_mem_alloc => '41875',
                                page_creates_sec => '0.00',
                                page_reads_sec => '0.00',
                                page_writes_sec => '0.00',
                                pages_created => '1',
                                pages_modified => '0',
                                pages_read => '368',
                                pages_total => '369',
                                pages_written => '32',
                                reads_pending => '0',
                                total_mem_alloc => '137363456',
                                writes_pending => '0',
                                writes_pending_flush_list => 0,
                                writes_pending_lru => 0,
                                writes_pending_single_page => 0
                              },
                        bt => {
                                complete => 1,
                                fulltext => 'srv_master_thread loops: 4 1_second, 4 sleeps, 0 10_second, 6 background, 6 flush
srv_master_thread log flush and writes: 4'
                              },
                        ib => {
                                bufs_in_node_heap => undef,
                                complete => 1,
                                free_list_len => undef,
                                hash_searches_s => '0.00',
                                hash_table_size => undef,
                                inserts => undef,
                                merged_recs => undef,
                                merges => undef,
                                non_hash_searches_s => '0.00',
                                seg_size => undef,
                                size => undef,
                                used_cells => undef
                              },
                        io => {
                                avg_bytes_s => '0',
                                complete => 1,
                                flush_type => 'fsync',
                                fsyncs_s => '0.00',
                                os_file_reads => '379',
                                os_file_writes => '30',
                                os_fsyncs => '11',
                                pending_aio_writes => undef,
                                pending_buffer_pool_flushes => '0',
                                pending_ibuf_aio_reads => '0',
                                pending_log_flushes => '0',
                                pending_log_ios => '0',
                                pending_normal_aio_reads => undef,
                                pending_preads => 0,
                                pending_pwrites => 0,
                                pending_sync_ios => '0',
                                reads_s => '0.00',
                                threads => {
                                             '0' => {
                                                      event_set => 0,
                                                      purpose => 'insert buffer thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '0'
                                                    },
                                             '1' => {
                                                      event_set => 0,
                                                      purpose => 'log thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '1'
                                                    },
                                             '2' => {
                                                      event_set => 0,
                                                      purpose => 'read thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '2'
                                                    },
                                             '3' => {
                                                      event_set => 0,
                                                      purpose => 'read thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '3'
                                                    },
                                             '4' => {
                                                      event_set => 0,
                                                      purpose => 'read thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '4'
                                                    },
                                             '5' => {
                                                      event_set => 0,
                                                      purpose => 'read thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '5'
                                                    },
                                             '6' => {
                                                      event_set => 0,
                                                      purpose => 'write thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '6'
                                                    },
                                             '7' => {
                                                      event_set => 0,
                                                      purpose => 'write thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '7'
                                                    },
                                             '8' => {
                                                      event_set => 0,
                                                      purpose => 'write thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '8'
                                                    },
                                             '9' => {
                                                      event_set => 0,
                                                      purpose => 'write thread',
                                                      state => 'waiting for i/o request',
                                                      thread => '9'
                                                    }
                                           },
                                writes_s => '0.00'
                              },
                        lg => {
                                complete => 1,
                                last_chkp => '139205502',
                                log_flushed_to => '139205502',
                                log_ios_done => '12',
                                log_ios_s => '0.00',
                                log_seq_no => '139205502',
                                pending_chkp_writes => '0',
                                pending_log_writes => '0'
                              },
                        ro => {
                                complete => 0,
                                del_sec => '0.00',
                                ins_sec => '0.00',
                                main_thread_id => '4496171008',
                                main_thread_proc_no => 0,
                                main_thread_state => 'waiting for server activity',
                                n_reserved_extents => 0,
                                num_rows_del => '0',
                                num_rows_ins => '0',
                                num_rows_read => '12',
                                num_rows_upd => '0',
                                queries_in_queue => '0',
                                queries_inside => '0',
                                read_sec => '0.30',
                                read_views_open => '1',
                                upd_sec => '0.00'
                              },
                        sm => {
                                complete => 1,
                                mutex_os_waits => '0',
                                mutex_spin_rounds => '0',
                                mutex_spin_waits => '0',
                                reservation_count => '4',
                                rw_excl_os_waits => undef,
                                rw_excl_spins => undef,
                                rw_shared_os_waits => undef,
                                rw_shared_spins => undef,
                                signal_count => '4',
                                wait_array_size => 0,
                                waits => []
                              },
                        tx => {
                                complete => 1,
                                history_list_len => '105',
                                is_truncated => 0,
                                num_lock_structs => undef,
                                purge_done_for => '163A',
                                purge_undo_for => '0',
                                transactions => [
                                                  {
                                                    active_secs => 0,
                                                    has_read_view => 0,
                                                    heap_size => 0,
                                                    hostname => 'localhost',
                                                    ip => '',
                                                    lock_structs => 0,
                                                    lock_wait_status => '',
                                                    lock_wait_time => 0,
                                                    mysql_thread_id => '22',
                                                    os_thread_id => '4529143808',
                                                    proc_no => 0,
                                                    query_id => '339',
                                                    query_status => '',
                                                    query_text => '',
                                                    row_locks => 0,
                                                    tables_in_use => 0,
                                                    tables_locked => 0,
                                                    thread_decl_inside => 0,
                                                    thread_status => '',
                                                    txn_doesnt_see_ge => '',
                                                    txn_id => '0',
                                                    txn_sees_lt => '',
                                                    txn_status => 'not started',
                                                    undo_log_entries => 0,
                                                    user => 'root'
                                                  },
                                                  {
                                                    active_secs => '13',
                                                    has_read_view => 0,
                                                    heap_size => '376',
                                                    hostname => 'localhost',
                                                    ip => '',
                                                    lock_structs => '2',
                                                    lock_wait_status => '',
                                                    lock_wait_time => 0,
                                                    locks => [
                                                               {
                                                                 db => 'test',
                                                                 index => '',
                                                                 insert_intention => 0,
                                                                 lock_mode => 'IX',
                                                                 lock_type => 'TABLE',
                                                                 n_bits => 0,
                                                                 page_no => 0,
                                                                 space_id => 0,
                                                                 special => '',
                                                                 table => 't1',
                                                                 txn_id => '1802',
                                                                 waiting => 0
                                                               },
                                                               {
                                                                 db => 'test',
                                                                 index => 'PRIMARY',
                                                                 insert_intention => 0,
                                                                 lock_mode => 'X',
                                                                 lock_type => 'RECORD',
                                                                 n_bits => '80',
                                                                 page_no => '386',
                                                                 space_id => 0,
                                                                 special => '',
                                                                 table => 't1',
                                                                 txn_id => '1802',
                                                                 waiting => 0
                                                               }
                                                             ],
                                                    mysql_thread_id => '21',
                                                    os_thread_id => '4528869376',
                                                    proc_no => 0,
                                                    query_id => '333',
                                                    query_status => '',
                                                    query_text => '',
                                                    row_locks => '7',
                                                    tables_in_use => 0,
                                                    tables_locked => 0,
                                                    thread_decl_inside => 0,
                                                    thread_status => '',
                                                    txn_doesnt_see_ge => '',
                                                    txn_id => '1802',
                                                    txn_sees_lt => '',
                                                    txn_status => 'ACTIVE',
                                                    undo_log_entries => 0,
                                                    user => 'root'
                                                  }
                                                ],
                                trx_id_counter => '1803'
                              }
                      },
          timestring => '2011-12-22 18:48:26',
          ts => [
                  2011,
                  12,
                  22,
                  18,
                  48,
                  26
                ]
        };

The output of that is 278 lines just for a single transaction holding a couple row locks. After digging around the output for a while, I came up with this script that provides a pretty flexible way to print out various "fields" related to transactions that satisfy particular criteria (note that here I'm only getting the output of the "tx" section, which would give 115 lines of output in the Data::Dumper example above):

#!/usr/bin/perl
use Data::Dumper;
require "innotop";
$Data::Dumper::Indent = 2;
my $innodb_parser = new InnoDBParser;

my @fields = (
        "txn_id",
        "active_secs",
        "lock_structs",
        "row_locks",
        "hostname"
);

my $contents;
{  
   local $INPUT_RECORD_SEPARATOR = undef;
   $contents = <STDIN>;
}

my $innodb_status = $innodb_parser->parse_status_text(
   $contents,
   0,
   # Omit the following parameter to get all sections.
   {  tx => 1 },
);

#print Dumper $innodb_status; exit;

for my $tx (@{$innodb_status->{sections}->{tx}->{transactions}}) {
        if (
                $tx->{active_secs} > 0
        ) {
                print join( "\t",
                        map { $tx->{$_} } @fields
                ), "\n";
        }
}

I ran that against another status snippet that includes a couple different transactions each with some locks. Here's the output (along with what might be a helpful way to extract just a single status output from among a file filled with them):

$ grep -hn 'INNODB MONITOR' mysqld.err | tail -n 2
1703104:111222 19:16:12 INNODB MONITOR OUTPUT
1741522:END OF INNODB MONITOR OUTPUT

$ START=1703104; END=1741522; tail -n +$START < mysqld.err | head -n $(( $END - $START + 1 )) | perl txinfo.pl
1808    380     5       2003    localhost
1806    864     11      6206    localhost

Hey, that's pretty neat, we've actually got some nice info, somewhat programmatically and reliably retrieved from the InnoDB Lock Monitor. If you want to get additional fields from the parsed data structure, you can just add the names of the fields to the @fields array at the top of the program. I'll probably get carried away and add those as command-line options to this thing at some point. I'm looking forward to being able to use this the next time I've got almost 5 million lines of InnoDB status output to make sense of!

Taming the EC2 API

I've been spending some time lately familiarizing myself with EC2, setting up some MySQL servers & clusters here and there, and doing some really basic configuration testing. One situation you'll run into when interacting with EC2 is that it gets unwieldy to use the AWS Management Console web interface for interacting with your instances. There ends up being lots of scrolling, lots of staring, and lots of sighs. Since I'm using SSH to connect to and interact with my instances, I want a reasonable way to find information about them on the Unix command line.

Amazon has an official set of tools [http://aws.amazon.com/developertools/351] that give you this information , at least theoretically. It is some gigantic distribution of shell scripts and Java madness that, if you are very patient, will eventually give you some information about your instances, in a format that is very difficult to work with.

$ time ./bin/ec2-describe-instances i-83c5d4e0
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
RESERVATION     r-7db4731c      801025846226
INSTANCE        i-83c5d4e0      ami-31814f58                    stopped skysql-ec2      0               m1.small        2011-12-09T20:41:39+0000        us-east-1c    aki-805ea7e9                    monitoring-disabled             10.0.0.164      vpc-cd4fafa5    subnet-c44fafac ebs                                  paravirtual      xen             sg-134b547f     default
BLOCKDEVICE     /dev/sda1       vol-19ec6174    2011-12-10T01:30:32.000Z
TAG     instance        i-83c5d4e0      Name    ndb32-02

real    0m7.693s
user    0m10.119s
sys     0m0.451s

OK, it takes me about 7.5 seconds to get data about an instance, and it's given to me in 4 lines. If I get information about all of my data, I have no idea how I would be able to successfully grep through that to interact with any of it programatically. I went looking for a different solution, preferably one that would be faster, more flexible, and easier to use.

I found a great script called, simply, aws, written by Timothy Kay [http://timkay.com/aws/].

$ du -hsc ec2-api-tools*
 14M    ec2-api-tools-1.5.0.1-2011.11.30
 11M    ec2-api-tools.zip
 26M    total

$ ls -sk aws 
 76 aws

Ahem. I'll take a 76K perl script over a 14M mess any day. Let's see how it performs.

$ time aws din i-83c5d4e0
+------------+--------------+----------------------+------------------------------------------+------------+--------------+--------------------------+---------------------------------------------+--------------+----------------+-----------------+--------------+------------------+-----------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+--------------+----------------+----------------+------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------+----------+
| instanceId |   imageId    |    instanceState     |                  reason                  |  keyName   | instanceType |        launchTime        |                  placement                  |   kernelId   |   monitoring   |    subnetId     |    vpcId     | privateIpAddress | sourceDestCheck |                  groupSet                   |                                           stateReason                                           | architecture | rootDeviceType | rootDeviceName |                                                         blockDeviceMapping                                                         | virtualizationType | tagSet | key  |  value   |
+------------+--------------+----------------------+------------------------------------------+------------+--------------+--------------------------+---------------------------------------------+--------------+----------------+-----------------+--------------+------------------+-----------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+--------------+----------------+----------------+------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------+----------+
| i-83c5d4e0 | ami-31814f58 | code=80 name=stopped | User initiated (2011-12-10 01:29:51 GMT) | skysql-ec2 | m1.small     | 2011-12-09T20:41:39.000Z | availabilityZone=us-east-1c tenancy=default | aki-805ea7e9 | state=disabled | subnet-c44fafac | vpc-cd4fafa5 | 10.0.0.164       | true            | item= groupId=sg-134b547f groupName=default | code=Client.UserInitiatedShutdown message=Client.UserInitiatedShutdown: User initiated shutdown | i386         | ebs            | /dev/sda1      | item= deviceName=/dev/sda1 ebs= volumeId=vol-19ec6174 status=attached attachTime=2011-12-10T01:30:32.000Z deleteOnTermination=true | paravirtual        |        |      |          |
|            |              |                      |                                          |            |              |                          |                                             |              |                |                 |              |                  |                 |                                             |                                                                                                 |              |                |                |                                                                                                                                    |                    |        | Name | ndb32-02 |
+------------+--------------+----------------------+------------------------------------------+------------+--------------+--------------------------+---------------------------------------------+--------------+----------------+-----------------+--------------+------------------+-----------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+--------------+----------------+----------------+------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------+----------+

real    0m1.546s
user    0m0.123s
sys    0m0.035s

Well, the output format isn't exactly any more appealing than what you get from the Amazon tool, but it sure gives it to you a lot faster! A little poking around showed me that the aws tool allows you to forego the pretty-printing and get the actual XML that the tool receives from the AWS API.

$ aws --xml din i-83c5d4e0
<?xml version="1.0" encoding="UTF-8"?>
<DescribeInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2011-11-01/">
    <requestId>4e1bf76d-ad02-439b-b255-108e09713251</requestId>
    <reservationSet>
        <item>
            <reservationId>r-7db4731c</reservationId>
            <ownerId>801025846226</ownerId>
            <groupSet/>
            <instancesSet>
                <item>
                    <instanceId>i-83c5d4e0</instanceId>
                    <imageId>ami-31814f58</imageId>
                    <instanceState>
                        <code>80</code>
                        <name>stopped</name>
                    </instanceState>
                    <privateDnsName/>
                    <dnsName/>
                    <reason>User initiated (2011-12-10 01:29:51 GMT)</reason>
                    <keyName>skysql-ec2</keyName>
                    <amiLaunchIndex>0</amiLaunchIndex>
                    <productCodes/>
                    <instanceType>m1.small</instanceType>
                    <launchTime>2011-12-09T20:41:39.000Z</launchTime>
                    <placement>
                        <availabilityZone>us-east-1c</availabilityZone>
                        <groupName/>
                        <tenancy>default</tenancy>
                    </placement>
                    <kernelId>aki-805ea7e9</kernelId>
                    <monitoring>
                        <state>disabled</state>
                    </monitoring>
                    <subnetId>subnet-c44fafac</subnetId>
                    <vpcId>vpc-cd4fafa5</vpcId>
                    <privateIpAddress>10.0.0.164</privateIpAddress>
                    <sourceDestCheck>true</sourceDestCheck>
                    <groupSet>
                        <item>
                            <groupId>sg-134b547f</groupId>
                            <groupName>default</groupName>
                        </item>
                    </groupSet>
                    <stateReason>
                        <code>Client.UserInitiatedShutdown</code>
                        <message>Client.UserInitiatedShutdown: User initiated shutdown</message>
                    </stateReason>
                    <architecture>i386</architecture>
                    <rootDeviceType>ebs</rootDeviceType>
                    <rootDeviceName>/dev/sda1</rootDeviceName>
                    <blockDeviceMapping>
                        <item>
                            <deviceName>/dev/sda1</deviceName>
                            <ebs>
                                <volumeId>vol-19ec6174</volumeId>
                                <status>attached</status>
                                <attachTime>2011-12-10T01:30:32.000Z</attachTime>
                                <deleteOnTermination>true</deleteOnTermination>
                            </ebs>
                        </item>
                    </blockDeviceMapping>
                    <virtualizationType>paravirtual</virtualizationType>
                    <clientToken/>
                    <tagSet>
                        <item>
                            <key>Name</key>
                            <value>ndb32-02</value>
                        </item>
                    </tagSet>
                    <hypervisor>xen</hypervisor>
                </item>
            </instancesSet>
            <requesterId>058890971305</requesterId>
        </item>
    </reservationSet>
</DescribeInstancesResponse>

Sweet, sweet data! Hold on, though, I can't use grep to get at that. I'm going to have to remember how to interact with XML documents; I decided I had better see if I could dig up any XPath knowledge.

The next question was what tool I wanted to use to execute some XPath expressions against. I was not very keen on having to write an entire perl or python script to read the XML, build it into some DOM, and then loop several times over crusty data structures to get the data I wanted. I wanted to be able to do some more generalized things that are very easily accomplished in XPath, such as getting a list of instances based on a prefix of their Name, get a list of "stopped" instances, get a list of instances with public IP addresses, et cetera.

I figured there must be some command-line tool that would let me execute arbitrary XPath against an XML file. After poking around a while, I found XMLStarlet [http://xmlstar.sourceforge.net/]. Installing this on my MacBook Pro using Homebrew [http://mxcl.github.com/homebrew/] was very easy and I was off to the races.

After grappling for a very annoying amount of time with XML namespaces, I eventually figure I'd just strip the thing out so that I didn't have to deal with it. (If you leave the namespace in, you have to give it an alias and then specify that before every tag in your XPath expressions. No, thanks.)

$ cat strip_xmlns 
sed 's/ xmlns="[^>]*"//'

The xmlstarlet/xmlstar/xml tool works by specifying a template that includes some expression to match and some expressions to generate output. The tool does a lot, so some of the options can appear to be a bit verbose at first. Here's a very basic use of the tool to get just a list of instance IDs:

$ aws --xml din | strip_xmlns | xml sel -T -t -m '//instancesSet/item' -v 'instanceId' -n
i-d1dbceb2
i-afdacfcc
i-cbc2d7a8
i-99bfaafa
i-1d40547e
i-f7c5d494
i-83c5d4e0
i-77c4d514
i-75c4d516
i-47feee24
i-707d9512

You can see the XSLT that the tool is applying internally by using the -C option:

$ aws --xml din | strip_xmlns | xml sel -C -t -m '//instancesSet/item' -v 'instanceId' -n
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<xsl:for-each select="//instancesSet/item">
<xsl:value-of select="instanceId"/>
<xsl:value-of select="'&#10;'"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

OK, so, there's a tool that will let me execute some XPath and get back information about my instances, that's nice. Instead of trying to parse some formatted output, I should be able to select the XML elements I want for a particular task.

Say I want the instance IDs of all instances that are stopped:

  aws --xml din | strip_xmlns | xml sel -T -t -m '//instancesSet/item[instanceState/name="stopped"]' -v 'instanceId' -n

Or maybe I want all instances with Names that start with the string "ndb":

  aws --xml din | strip_xmlns | xml sel -T -t -m '//instancesSet/item[starts-with(tagSet/item[key="Name"]/value, "ndb32")]' -v 'instanceId' -n

Instead of having to write several loops in perl or python, I'm able to write a very straightforward expression that matches just the nodes I want. Instead of writing that XPath every time, of course, I'll put a few of the more popular ones into a script along with some flexibility to provide arbitrary filtering. (I call this WHERE in the script because that's the first thing my DBMS-addled brain came up with!)

#!/bin/bash
while getopts "p:s:w:" OPTION
do
    case $OPTION in
        p)
            WHERE="[starts-with(tagSet/item[key='Name']/value, '$OPTARG')]"
            ;;
        s)
            WHERE="[instanceState/name = '$OPTARG']"
            ;;
        w)
            WHERE="$OPTARG"
            ;;
    esac
done

MATCHEXPR="/DescribeInstancesResponse/reservationSet/item/instancesSet/item$WHERE"

aws --xml din | strip_xmlns | xml sel -T -t -m "$MATCHEXPR" -o "instanceId    " -v instanceId -n -o "instanceName    " -v tagSet/item[key=\"Name\"]/value -n -o "privateIp    " -v privateIpAddress -n -o "ipAddress    " -v ipAddress -n -o "instanceState    " -v instanceState/name -n -n

$ ec2-ls -p ndb
$ ec2-ls -s stopped
$ ec2-ls -w "[instanceType='m1.small']"

My script returns several items that may or may not be of interest to others. Further extension to the script could easily make the list of items returned a bit more useful. From that basically reasonable if limited script, I vastly overreached my bash skills and turned it into this monstrosity:

#!/bin/bash
OUTPUT="instanceId;instanceName:tagSet/item[key='Name']/value;privateIp:privateIpAddress;ipAddress;instanceState:instanceState/name"
DELIM=" " #there might be a <tab> in there!
declare -a XMLARGS

push()            # Push item on stack.
{
        if [ -z "$1" ]    # Nothing to push?
        then
          return
        fi
        XMLARGS[${#XMLARGS[*]}]="$1"
        return
}


while getopts "p:s:w:o:d:D" OPTION
do
        case $OPTION in
                p)
                        WHERE="[starts-with(tagSet/item[key='Name']/value, '$OPTARG')]"
                        ;;
                s)
                        WHERE="[instanceState/name = '$OPTARG']"
                        ;;
                w)
                        WHERE="$OPTARG"
                        ;;
                o)
                        OUTPUT="$OPTARG"
                        ;;
                d)
                        DELIM="$OPTARG"
                        ;;
                D)
                        DEBUG=1
                        ;;
                
        esac
done
shift $((OPTIND-1)) #something about argument processing, supposedly

for i in "sel" "-T" "-t" "-m"; do
        push "$i"
done

MATCHEXPR="/DescribeInstancesResponse/reservationSet/item/instancesSet/item$WHERE"
push "$MATCHEXPR";

OLDIFS=$IFS;
IFS=";"
for f in $OUTPUT; do
        FIELDNAME=$(echo $f | cut -d':' -f 1)
        FIELDEXPR=$(echo $f | cut -d':' -f 2)
        if [[ -z $FIELDEXPR ]]; then
                FIELDEXPR=$FIELDNAME
        fi
        push "-o";
        push "$FIELDNAME$DELIM";
        push "-v";
        push "$FIELDEXPR";
        push "-n";
done
push "-n";
IFS=$OLDIFS


if [[ $DEBUG -eq 1 ]]; then
        echo "$MATCHEXPR" >&2 
        echo "${XMLARGS[@]}" >&2
fi

aws --xml din | strip_xmlns | xml "${XMLARGS[@]}"

I'm sure there are plenty of problems with that script, but at least now I can finally get the information I want about my EC2 instances!

$ ec2-ls -p ndb32 -o "instanceId;privateIp:privateIpAddress"
Syndicate content
There are currently 1 user and 36 guests online.

Online users

  • anthony.flynn@s...
Training
Events