background

May 9, 2022

Adding new columns to large Magento core database tables

Yireo Blog Post

A common belief amongst Magento developers is that adding a new column to a large database table will cause the Magento setup upgrade scripts to run for a long time, thus causing downtime. I thought this as well in the past. But is this so? As far as I can see, things are different today.

The origin

In a previous blog - Modifying Magento core database tables? - I blogged about how extending core tables might simplify Magento development, compared to enforcing additional tables for extension attributes: Adding your own table might be clean, but causes so much additional work, leading to the conclusion that it might be easier to just add a new column to the existing tables. And thanks to Magento its declaritive schemas, this can be done easily and in a maintainable way.

One reply I got - I'm not going to call out Luuk by name - was that adding new columns to huge core tables (tons of records) would take hours. And because such an upgrade is connected to code, it needs to be performed during a bin/magento setup:upgrade while in maintenance mode. In short, it leads to hours of downtime as well.

But is this do? Does such a change lead to so much downtime? And is that downtime really needed? Let's see.

The setting

With Magento, we can generate performance fixtures nicely via the CLI. I choose the extra large set to make sure my database was largest as possible.

bin/magento setup:perf:generate-fixtures -s ./setup/performance-toolkit/profiles/ce/extra_large.xml

I have a computer with 32Gb of RAM and plenty of SSD space. Generating the data of this set only took 7 hours or so and it generated a database of 7,5 Gb in total. I cranked up the MySQL instance with InnoDB buffering and such, but that's a bit besides the point. The performance fixtures is documented to create about 600.000 products plus some configurable products. In the end, those configurables turned out to also require their own simple products. I got 995.928 SKUs in total, being 995.928 entries in the catalog_product_entity database table for instance.

Needless to say, running bin/magento index:reindex took some time as well (3 hours).

MySQL 5.6+

Now, the claim was that a simple ADD COLUMN would also take hours. However, even though I'm not a database expert, I do know from experience that these issues happened with MySQL 5.4. But I also remember having read that since MySQL 5.6 things are diffeent. Reading back the MySQL manual, I was able to find that for MySQL versions < 5.6 locks are required, but if you are using 5.6+ and InnoDB, locks can be avoided from many ALTER TABLE operations. The Percona Toolkit Helper offers a tool pt-online-schema-change to overcome this with MySQL 5.5 or older. But I have to say that - since MySQL 5.6 - I didn't really bump into issues like this anymore.

When locks are required, expect hours of work, equal to the reindexing jobs. But when does a lock occur and when not? And does the speed of a column addition actually relate to locking, or is there simply another performance countermeasure taken?

Again, I don't see myself as a database expert, so I started to test things out on my fresh new database to see for myself. Note that these tests are not run on MySQL 5.6, but on MySQL 5.7 instead.

Seconds, not hours

I added a new module with a simple db_schema.xml file that added a text field to the catalog_product_entity containing about a million entries:

<?xml version="1.0"?>
<schema xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:noNamespaceSchemaLocation="urn:magento:framework:Setup/Declaration/Schema/etc/schema.xsd">
    <table name="catalog_product_entity" resource="default">
        <column xsi:type="varchar" name="example_custom"/>
    </table>
</schema>

After the module was enabled, I ran bin/magento setup:upgrade so that the schema would be applied. This took 30 seconds, including the 25 seconds or more that it normally takes to execute all commands. In other words, adding a new fresh column to an existing table of a million records took merely a few seconds.

Alright, this was what I understood: If MySQL is maintaining indexes and keys for a table, adding a new column would simply add a new field, but not change in the existing table structure. Therefore, little calculation is needed.

Adding a column with indexing

But what if that new column was also combined with an existing column like the entity_id to calculate a new index. In this case, MySQL would be forced to go through all 1 million records and recalculate things right?

<?xml version="1.0"?>
<schema xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:noNamespaceSchemaLocation="urn:magento:framework:Setup/Declaration/Schema/etc/schema.xsd">
    <table name="catalog_product_entity" resource="default">
        <column xsi:type="varchar" name="example_custom"/>
        <index referenceId="EXAMPLE_CUSTOM" indexType="btree">
            <column name="entity_id"/>
            <column name="example_custom"/>
        </index>
    </table>
</schema>

The outcome here was actually unchanged: Few seconds and bin/magento setup:upgrade was done. This already surprised me. So apparently, it is simply that MySQL seems to have become a lot smarter when it comes to reindexing and recalculations on the table.

Changing an existing column

Last but not least, I went in and changed the existing column sku from 64 to 255 chars. First of all, don't do this. In my previous blog I pointed out that there are various evil things that an extension could do and this is one of them.

<?xml version="1.0"?>
<schema xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:noNamespaceSchemaLocation="urn:magento:framework:Setup/Declaration/Schema/etc/schema.xsd">
    <table name="catalog_product_entity" resource="default">
        <column xsi:type="varchar" name="sku" nullable="false" length="255" comment="SKU"/>
    </table>
</schema>

But still, applying this to the Magento database took again only a few seconds.

Ok, I hear you

So in my simplified tests, I was not able to confirm that you would need hours of downtime when adding a new column to an existing Magento core table. Perhaps I'm overlooking something, perhaps I'm not seeing a certain configuration that is going to be harmful. But in general, I would simply say that adding custom columns to core tables is not a bad thing - other points being mentioned also in my blog Modifying Magento core database tables?.

But let's say that there is still such an edge-case scenario. First of all, I would say that it is the job of an implementor (an agency, the developer aiding the merchant) to test this out. Are you scared of hours of downtime? Test it first.

But is there anything else we can do?

MySQL 8 instant column adds

Interestingly, MySQL 8 has a new feature of instant column adds. While in my cases, the columns were added with only a little bit of downtime, there could be other cases were the downtime is still undesired: Simply because zero downtime is most important. Or perhaps the database tables are larger. Or there is a combination of columns that I didn't test for. In those cases, this MySQL 8 might come to the rescue.

With some if not all ALTER statements in MySQL 8, you can add an ALGORITHM flag which could be set to various values including INSTANT and INPLACE.

ALTER TABLE foo ADD bar VARCHAR(256), ALGORITHM=INSTANT
ALTER TABLE foo ADD bar VARCHAR(256), ALGORITHM=INPLACE

For every algorithm, there are caveats and things to think of. Refer to the docs. Also, a possible requirement could be to temporarily turn off unique_checks and foreign_key_checks. It makes it a cool thing to consider but with Magento, it would be quite complicated. In general, the algorithm is something to implement on a per-query basis, and therefore just hacking it into vendor/magento/framework/DB/Adapter/Pdo/Mysql.php will bring you further from home.

Blue Green deployment

Another way to look at the theoretical problem is to say that - if locking is no longer an issue - another problem could be that a newly created column is going to be referred to by code. And if the code refers to the column, while the database is not yet finished with applying the column, PHP Fatal Errors could be the end-result. However, the database will only alter the structure of the table after the work is done, so if you're code is intelligent enough, there should not be an issue here.

A model (DTOs) could include a setter and getter for the new column, but this could just return empty if the database is not ready for that column yet. Likewise, collections and repositories could also just work around this. In other words, you could update the code first and then apply the database change, and things could still work.

With Magento 2.4.4, a new deployment option is added: Blue Green deployment. The term could mean many things, but in the case of Magento, it seems to refer to a deployment where the code goes live, while the database is not yet up-to-date (bin/magento setup:upgrade has not completed yet) but no exception is given. When adding a new column, I believe there are other caveats like the DDL definitions being cached and such. I haven't tried this out, but I'm quite optimistic that this could be working somehow, assuming you're willing to experiment with an alternative deployment than your average DeployerPHP recipee.

Back to the beginning

Bottomline, in my previous blog, I proposed that it is not evil anymore to let extensions add new columns to core tables. In this blog, I went through some testing with large core tables and still found no evil. Disagree with me? Let me know!

EDIT 1 (May 28th 2022)

Following from comments on LinkedIn, I dived benchmarked what kind of change would cause the longest waiting time: For adding a new column, a new index or changing columns or index, recalculation of all records are not needed (as mentioned in this blog), and therefore the Magento downtime is minimal. However, adding foreign key restraints does have a huge impact. It is in that specific situation that MySQL needs to go through all rows and then (with millions of records) the downtime is the largest.

EDIT 2 (June 1st 2022)

Following a comment on Twitter that mentioned that modifying a core table like catalog_product_entity is a bit of a dumb scenario, while a table like sales_order would be better to compare: I've redone the benchmark but as expected, the addition of a column (or its variation) does not vary here. In other words, the findings of this blog are still true for the order tables that change most, and even when it deals with millions of records.

Posted on May 9, 2022

About the author

Author Jisse Reitsma

Jisse Reitsma is the founder of Yireo, extension developer, developer trainer and 3x Magento Master. His passion is for technology and open source. And he loves talking as well.

Sponsor Yireo

Looking for a training in-house?

Let's get to it!

We don't write too commercial stuff, we focus on the technology (which we love) and we regularly come up with innovative solutions. Via our newsletter, you can keep yourself up to date on all of this coolness. Subscribing only takes seconds.

Do not miss out on what we say

This will be the most interesting spam you have ever read

We don't write too commercial stuff, we focus on the technology (which we love) and we regularly come up with innovative solutions. Via our newsletter, you can keep yourself up to date on all of this coolness. Subscribing only takes seconds.