Schema Evolution with Hive and Parquet using partitioned views

One cool feature of parquet is that is supports schema evolution. But let’s take a step back and discuss what schema evolution means.

As every DBA knows, data definitions can change with time: we may want to add a new column, remove one that is obsolete, or do more complex things, for instance break down one column into multiple columns, like breaking down a string address “1234 Spring Road, Springfield, MA” into its components (number, street, city, state).

Schema changes can, unfortunately, break some queries that the users may run.  If we for instance change the type or name of a column, any query that uses that column would fail.

It’s worth noticing that if we add a new column, old queries will still work. Already existing records will have NULL as value for the new column, unless we populate it ourselves.

Parquet supports this kind of mild schema evolution, with some caveats described in this excellent article: Data Wrangling at Slack

Now, the techniques I will talk about are not actually tied to parquet, and they can work also for other formats (text, json..).

Basically, different versions of parquet used in different tools (presto, spark, hive) may handle schema changes slightly differently, causing a lot of headaches. At Slack, they solved it by forking their own ParquetInput/OutputFormat and using that in all their tools (spark, hive, presto..) in order to avoid discrepancies between them.

What if we want to evolve the schema in a way that Parquet can’t support ?

We just said how Parquet basically only supports the addition of new columns, but what if we have a change like the following:

  • renaming of a column
  • changing the type of a column, including from scalar to array
  • not changing the type, but the meaning of a column (for instance, switching from pounds to Kilos)

Now, we want to support compatibility both ways, that is, forward and backward compatibility. That is, after we upgrade the schema:

  • backward compatibility: old queries would still run and be correct
  • forward compatibility: old data will work with new queries

Let’s go through this exercise with a practical example. Let’s say we have a  partitioned table like this:

Not that the table is partitioned by date. It is really important for partition pruning in hive to work that the views are aware of the partitioning schema of the underlying tables.

Hive will do the right thing, when querying using the partition, it will go through the views and use the partitioning information to limit the amount of data it will read from disk.

Now we are in this situation, with the box denoting a view, while the round ellipsis denotes an actual table:

We are going to create a new table for each schema, to make sure we don’t unintentionally mix data with different schemata.

We are going to call the tables with the pattern
people_data_v?? and the views as people_v??
with the implicit contract that under a view version, data would look like that version.

This way we can keep backward compatibility. We can keep an extra top view called people that points to the most recent view.

Let’s suppose that the weight is in pounds, but we are asked to change the field to report Kg.

We are now in this situation (again, square boxes are views while round ones are tables):

It’s worth noting that your strategy may vary: your ‘people’ table may at first point to the old schema instead of the new. Using views to abstract schema changes is cheap and extremely easy to roll back in case your users are not ready to upgrade to the new schema.
So your strategy may be different depending on what you have to prioritize – backward compatibility, or prompt availability of the new schema.

Finally, let’s suppose that we may have multiple phone numbers, so one column has to change type.

For forward compatibility, we just need to wrap the phone number in an array. So Data from previous versions would look like either a NULL value or an array of size 1.

So you can grow a tree of views, storing the logic for each schema reconciliation at each step. Note that if you want to preserve all the schema versions in both backward and forward compatible way, you’ll actually end up with two trees of views. In practice, backward compatibility is more important – new, more complex data has more value than old data – so you should adapt these techniques to your case.

A little bit more about forward compatibility (in the form of the table people_v2_back), that is, having new data working with the old schema.  This may not always be possible if we don’t have a way to translate one value to the other.

Also, there may be more than a way to manage the transformation. Which one would make more sense ?

In this case, when querying the old schema, we are expecting only one phone number, not many, so what can we do ?
I can think of two solutions:

  • join all the phone numbers in a single string using a separator (,) – return all the data, but data consumers may not expect the change of format.
  • return the first available phone number. Fully compatible, but masks some data.

Which one is the best ? It really depends on your users. I chose the one that won’t break things for sure (returning the first phone number).

Facebook Comments

Leave a Reply