FieldTop - find columns that might overflow soon
Table of Contents
Intro
In the case of a database that's in use for a long time and contains user-generated data, some columns might outgrow the data types that were used to store the data.
This blog post will describe a tool called fieldtop that checks for overflows and underflows in all tables and databases stored in a MySQL server.
Use-case
Gnoosic.com has been running for over 10 years. A user first tells Gnoosic artists they like. Then the user is presented with other artists he might like and he can pick which ones he likes and which ones he doesn't. As time goes by, Gnoosic learns more about what users like and stores those preferences in a database.
Among other information, Gnoosic also stores the popularity of
different bands. This information is stored as an integer data type.
So, the popularity field for the entry about Pink Floyd reached 1.3 billion.
The maximum allowed for the INT data type is 2.1 billion (more
precisely, 2,147,483,647) Because of this, a collaboration was made
to develop a tool that would check for situations like these so they
can be avoided.
Running the tool uncovered a few other columns in different tables of the Gnoosic database that required attention, and since this tool was built to support decisions about the schema of MySQL database, it was opensourced under MIT license.
In addition to the main use-case, there is a dual use-case. If the tool shows that the maximum values stored in a column are a long way from reaching the maximum, and if the data type allows (for text data types), the length of these columns can be fine-tuned for more efficient use of disk space.
What data types are applicable
While MySQL supports many data types. The tool is applicable to:
- signed and unsigned integer data types
- char, varchar
- text data types
This tool is not applicable to:
- BIT fields
- columns storing UUID values
- columns storing IP addresses
- columns storing hash values (MD5, SHA1 or other ones)
The inner-workings of fieldtop
The tool gets the data types of each column from information_schema, then it computes how close those values are to the maximum allowed values for their respective data types.
The following query is used to fetch information from information_schema:
SELECT b.COLUMN_NAME, b.COLUMN_TYPE, b.DATA_TYPE, b.signed, a.TABLE_NAME, a.TABLE_SCHEMA FROM ( -- get all tables SELECT TABLE_NAME, TABLE_SCHEMA FROM information_schema.tables WHERE TABLE_TYPE IN ('BASE TABLE', 'VIEW') AND TABLE_SCHEMA NOT IN ('mysql', 'performance_schema') ) a JOIN ( -- get information about columns types SELECT TABLE_NAME, COLUMN_NAME, COLUMN_TYPE, TABLE_SCHEMA, DATA_TYPE, (!(LOWER(COLUMN_TYPE) REGEXP '.*unsigned.*')) AS signed FROM information_schema.columns ) b ON a.TABLE_NAME = b.TABLE_NAME AND a.TABLE_SCHEMA = b.TABLE_SCHEMA ORDER BY a.TABLE_SCHEMA DESC;
The results of this query are fetched inside a PHP program 1 that analyzes them and computes the percentage of the maximum value for the data type that was found in that each column.
Demo
Below is a short demo that showcases the tool described in the post:
Conclusion
We've covered the usage of a database tool that can be used to prevent overflowing columns in a proactive manner.
Footnotes:
It's possible to have all the logic inside of a stored procedure, in a similar way to this SO thread, but this tool covers more situations and a flexible language (PHP) was required to describe and handle all those situations