Intro

In the case of a database that's in use for a long time and contains user-generated data, some columns might outgrow the data types that were used to store the data.

This blog post will describe a tool called fieldtop that checks for overflows and underflows in all tables and databases stored in a MySQL server.

Use-case

Gnoosic.com has been running for over 10 years. A user first tells Gnoosic artists they like. Then the user is presented with other artists he might like and he can pick which ones he likes and which ones he doesn't. As time goes by, Gnoosic learns more about what users like and stores those preferences in a database.

Among other information, Gnoosic also stores the popularity of different bands. This information is stored as an integer data type. So, the popularity field for the entry about Pink Floyd reached 1.3 billion. The maximum allowed for the INT data type is 2.1 billion (more precisely, 2,147,483,647) Because of this, a collaboration was made to develop a tool that would check for situations like these so they can be avoided.

Running the tool uncovered a few other columns in different tables of the Gnoosic database that required attention, and since this tool was built to support decisions about the schema of MySQL database, it was opensourced under MIT license.

In addition to the main use-case, there is a dual use-case. If the tool shows that the maximum values stored in a column are a long way from reaching the maximum, and if the data type allows (for text data types), the length of these columns can be fine-tuned for more efficient use of disk space.

What data types are applicable

While MySQL supports many data types. The tool is applicable to:

This tool is not applicable to:

  • BIT fields
  • columns storing UUID values
  • columns storing IP addresses
  • columns storing hash values (MD5, SHA1 or other ones)

The inner-workings of fieldtop

The tool gets the data types of each column from information_schema, then it computes how close those values are to the maximum allowed values for their respective data types.

The following query is used to fetch information from information_schema:

SELECT
    b.COLUMN_NAME,
    b.COLUMN_TYPE,
    b.DATA_TYPE,
    b.signed,
    a.TABLE_NAME,
    a.TABLE_SCHEMA
FROM (
    -- get all tables
    SELECT
    TABLE_NAME, TABLE_SCHEMA
    FROM information_schema.tables
    WHERE 
    TABLE_TYPE IN ('BASE TABLE', 'VIEW') AND
    TABLE_SCHEMA NOT IN ('mysql', 'performance_schema')
) a
JOIN (
    -- get information about columns types
    SELECT
    TABLE_NAME,
    COLUMN_NAME,
    COLUMN_TYPE,
    TABLE_SCHEMA,
    DATA_TYPE,
    (!(LOWER(COLUMN_TYPE) REGEXP '.*unsigned.*')) AS signed
    FROM information_schema.columns
) b ON a.TABLE_NAME = b.TABLE_NAME AND a.TABLE_SCHEMA = b.TABLE_SCHEMA
ORDER BY a.TABLE_SCHEMA DESC;

The results of this query are fetched inside a PHP program 1 that analyzes them and computes the percentage of the maximum value for the data type that was found in that each column.

Demo

Below is a short demo that showcases the tool described in the post:

Conclusion

We've covered the usage of a database tool that can be used to prevent overflowing columns in a proactive manner.

Footnotes:

1

It's possible to have all the logic inside of a stored procedure, in a similar way to this SO thread, but this tool covers more situations and a flexible language (PHP) was required to describe and handle all those situations