Transparent Database Sanitization with GDPR-dump
With GDPR in full effect, sanitization of user data is a fairly hot topic. Here at Amazee we take our clients and our clients’ clients privacy seriously, so we have been investigating several possible approaches to anonymizing data.
In the Drupal world, and the PHP world more generally, there are several options available. Here, though, I’d like to discuss one we think is particularly cool.
At Amazee Labs’ Global Maintenance, we work with several different projects per day. We move data from our production to staging and dev servers, and from our servers to our local development environments. Especially on legacy systems, site-specific configuration details often exist only in the databases, and even if that weren’t the case, the issues we’re investigating routinely require that we dig into the database as it (more or less) is on the production servers. Anonymization is crucial for our day to day work.
So our considerations here are, how do we balance productivity while keeping things anonymous?
One way of achieving this is to make Anonymization transparent to the developer. Essentially, we want our developers to be able to pull down the live database as it exists at the moment that they pull it down, and have it be anonymized.
How can we achieve this?
Well, one way is to analyse the daily workflow to see if there are any points at which the data has to flow through before it reaches the developer?
It turns out that, if you’re working with mysql, this “final common path” that the data flows through is the mysqldump utility.
If you’re running backups, chances are you’re using mysqldump.
If you’re doing a drush sql-sync there’s a call to mysqldump right at the heart of that process.
Mysqldump is everywhere.
The question is, though, how do we anonymize data using myqldump?
The standard mysqldump binary doesn’t support anonymization of data, and short of writing some kind of plugin, this is a non-starter.
Fortunately for us, Axel Rutz came up with an elegant solution, namely, a drop in replacement for the mysqldump binary, which he called gdpr-dump. A few of us here at Amazee loved what he was doing, and started chipping in.
The central idea is to replace the standard mysqldump with gdpr-dump so that any time the former is called, the latter is called instead.
Once the mysqldump call has been hijacked, so to speak, the first order of business is to make sure that we are actually able to dump the database as expected.
This is where mysqldump-php comes in. It’s the library on which the entire gdpr-dump project is based. It provides a pure PHP implementation of mysqldump as a set of classes. On its own, it simply dumps the database, just as the native mysqldump cli tool does.
A great starting point, but it only gets us part of the way.
What we’ve added is the ability to describe which tables and columns in the database being dumped you would like to anonymize. If, for instance, you have a table describing user data with their names, email, telephone numbers, etc. You can describe the structure of this table to gdpr-dump and it will generate fake, but realistic looking, data using the Faker library.
This requires some upfront work, mapping the tables and columns, but once it is done you’re able to call mysqldump in virtually any context, and it will produce an anonymized version of your database.
There is still a lot of thinking and work to be done, but we think it’s worth investing time in this approach. The fact that it can be used transparently is its most compelling aspect - being able to simply swap out mysqldump with gdpr-dump and have the anonymization work without having to change any of the dependent processes.
If any of this piques your interest and you’re looking for more details about how you might be able to use gdpr-dump in your own workflow, feel free to check out the project (and submit PRs): https://github.com/machbarmacher/gdpr-dump.