Git in My SQL's Stead

Video

Transcript

Good evening, good afternoon, everybody. My name is Jan Stępień at INNOQ here in Berlin. We’re a software consultancy. We help people build better software. And in this talk, I’d like to share with you one of those stories.

A recent story from one of our customers where we’ve faced some interesting problems and came up with some exciting solutions. The customer is a mail order firm where you as a client get a catalog and then you can call them and order your items using telephones. And there are people on the other side who use software systems to take care of your orders. And this is where we come in and try to help them build a piece of software to manage customer data.

Let me present the landscape of the project. Our system is there to serve agents of our client, people who take those phone calls and need to see all the information about each customer, relevant data about shipping, invoicing and so on. They have a web UI, traditional HTML, CSS, peppered with JavaScript. On the other hand, we have an existing legacy system, which is handling everything. And we are slowly replacing this system, or at least part of its functionality. Our goal is to replace parts responsible for client customer specific data. Now let’s zoom in and take a look at our system. Inside we have two web applications. One of them is exposing an HTML user interface and the other one serves JSON encoded data to other machines, including the legacy system. They all speak to each other and the core API is connected to a MariaDB instance where all of the data is stored. Nearly all of the data there is also an Elasticsearch somewhere for auto completion, but that’s a detail.

The important thing is we have just a single instance of the database, which is really lovely because thanks to it, we don’t need to worry about consistency, about replication and such distributed issues. What I’m showing you right now is a simplification. The real system is a bit more complex, but luckily those details do not matter for now because we’re going to just focus on the database and what’s happening in the database.

What data do we have there? Our world consists of persons. Each person has a person identifier. It’s a business key, a country where they’re based, name and other personal details. That’s the first entity. And this entity is connected to several other entities representing brand specific information. Our client has several brands in various countries and you as a customer can make orders in one of those or several of those brands. At each of those brands, you’re identified by a dedicated customer number. Those are the entities and we attach a couple more values to them. We need some addresses. We need one address where you get invoiced and the other one and potentially several other ones to ship the goods you order.

Now there’s one important detail here. This is, those things you see on this slide right now form our aggregates. Aggregates from domain driven design. That is, those are the boundaries of our consistency. Those things separately make no sense. They only make sense in the context of this entire aggregated piece of information. That’s important. So that’s what we work with.

Now, given this data, we use this data to fulfill various needs. As an agent, I need to look at this data to be able to tell, okay, this is the customer, this is the invoice address. Maybe I need to update something in shipping addresses, but it’s important that I also see the history of changes within that aggregate, including details like which field was changed when, by whom, for which reason. And as a supervisor, I can see all the changes for a given brand. On the other side, we have the legacy system, which runs as we replace it. So it still operates. It still submits changes to us and we have to accept them. And also it can download an entire dump of our database, like a snapshot. And after it gets a snapshot, it subscribes to updates with all the things that happened since that snapshot of changes. Think about like RSS feed with an ID, with a stream, a feed of updates since that change.

How do we represent this data to fulfill those use cases? We have our traditional normalized database tables, customer details, invoice addresses, persons, invoice addresses, shipping addresses. That’s enough for this situation right now, but we need some historical information. So we have another table for all the history and all the differences between individual fields to be able to show you this history of changes as a supervisor. And then we have feeds to be able to produce those updates which happened since the last time our legacy neighbor downloaded the dump of the database. To keep them all in sync, we need a couple of listeners or callbacks, if you like, which keep everything in sync. And even more of them to keep other tables in sync. You probably see already there will be problems, right?

We quickly run into problems, into difficulties maintaining this entire system, keeping it all understandable. It was not clear anymore what is writing and when is it writing. It was too much automation. And the mistake we made, we automated complexity instead of solving the complexity in the first place.

So we look for solutions. There’s always solution zero, just do nothing. But we quickly rejected it because we were, we noticed that there are problems. We are facing a slowdown of our development process. We just don’t understand what is going on anymore. So we kept looking.

We looked for some off the shelf library framework for our programming language and our technology. Our technological choices make no difference here. But some of the libraries we looked at resemble tools like PaperTrail, a gem for Ruby, you might know, which keeps track of changes to your active record instances. It’s a wonderful piece of technology, but the problem is just like the other tools we found, PaperTrail focuses on changes within a single entity, within a single table. And those, while valuable, were meaningless to us because we needed changes and modifications spanning the entire aggregate within a single transaction.

So that thing and things we found within our technology environment were not fulfilling our requirements. Maybe we can reevaluate the choice of our database. And instead of a relational point of view, doing MariaDB, we could replace it with some sort of temporal database. Temporal databases are a very exciting piece of technology because they treat time as a first class value in your database. They keep track of all the historical changes you’ve made to your data. So you can always say, “What did I know two months ago?” It’s fantastic and fulfills most of criteria we needed. Problem is, it turns your world upside down because it’s not a SQL database anymore. You need a different model. You need different repositories around it. We decided we are not ready to teach ourselves all of this new stuff and replace this half of our application, half the persistence layer. We kept looking.

There’s event sourcing. Event sourcing is an interesting architectural pattern where you represent your data, your model, not as point in time, current state, but as a sequence of commands, business commands, which alter the state of your entities. And then when you want your current state, you take all the events, all the commands, which affected your entity, and run them to get the final state. It’s super useful, but this would require essentially a complete rewrite. We had neither budget nor time for that. And then we thought, can we introduce some kind of persistent log and keep it within MariaDB? Not replacing our database, keep it there. And as we write our normalized tables, just keep track of all the other things we’re doing. And as we gazed into this solution, into this idea, we realized that the actual solution was gazing back at us from our terminal window the entire time.

Git is a distributed version control system. Git clone allows me to obtain a fresh copy of data stored on some remote server. Then once I have a copy, I can pull new changes afterwards and get all the stuff which happened in the meantime. Then given Git log, I can take a look at all the stuff which happens since I looked last time, both in the entire repository and per file basis. So I can see changes within a single unit. Then using diff, I can see exactly what’s changed, which lines change within a given file. And then finally, Git checkout allows me to get a working copy so that I can operate on what I cloned and push it up again. So the proposal to replace MariaDB with Git was swiftly rejected and probably for good reasons, but still Git remained an important source of inspiration as we sketched our solution.

So what we came up with looks as follows. We preserved our existing tables. They remain in place, normalized ones. And on top of that, we introduced the action log. Action log is an entire history of every single change, every modification made to the database. It looks as follows. We have our technical artificial primary key and a timestamp. When did we inject a given, insert a given row? And we have an action which we performed and the entity we performed that action on. We still focus on a single aggregate, but we need to know for business reasons, there are requirements for that, which entity specifically that we modify. Then we have a whole lot of foreign keys allowing us to pinpoint exactly which row in the normalized tables was modified and the actual data. Metadata, which tells us who made the change, when, what was the reason, all the stuff describing the actual modification and the aggregate state. And this is a big JSON with the entire aggregate serialized. Since we never need this content, we don’t need the contents of this aggregate for any joins or any other operations, we can just store it in an opaque JSON format. So that’s the gist of the database. So let’s take a look at what do we do with it, when we write.

When we write into this new database, we begin with the standard insert or update or delete in our normalized tables. And then we record by inserting an extra log line and the action log, what has been modified with all the foreign keys pointing to the right entity. The important thing is this happens within a single database transaction. So if anything goes wrong, if we accidentally violate some constraints or any kind of inconsistency would have been introduced, the transaction is rolled back and we end up in a stable state, we’re super happy.

Now let’s try to read stuff. All the existing use cases we had, which relied on those normalized tables remain unchanged, right? Nothing has changed to the data and its representation. So this code is untouched, that’s super nice. It gets more interesting when you start to think about all those historical use cases. For example, when as an agent, I want to take a look at all the changes which happens to a given customer and see which fields were affected. We just use actions log to select all the rows related to a given aggregate. We have foreign keys with an index on them for that. And then we have to perform the diff to calculate the actual changes within fields. But that’s not an expensive operation because typically there aren’t many rows per single customer. It gets more exciting when we get to the other side of our system, that is our legacy neighbor.

When our neighbor needs feed of updates since the last snapshot, it’s quite simple because it turns out all we need is we need to find the ID they have seen, like the ID of the snapshot they saw, and then give them all the items from the actions log, which are newer than the given ID, because those are the changes which they haven’t seen. That’s fantastic. They just say, “Hey, last snapshot I’ve seen”was the one with this ID. “Give me anything newer.” Fantastic. Then we just download this, select this data, stream this data, just realize all the JSON fields from aggregate state, and they have the data they need. Gets more interesting when we need to prepare that snapshot for them if our consumer hasn’t seen any data yet, because then say, “Hey, what’s the data?”Where is it?”

So then we need to find all the rows in actions log, which represents the newest state of each aggregate. We cannot use our normalized tables for that because there is a risk that while we stream the snapshot to them, data underneath changes leading to inconsistencies, and that’s unacceptable. So we need to rely on the log, on the data which has been written already. To achieve that goal, we introduce one more column, and this is when the analogy with Git breaks down a bit.

We introduce superseded-by, which is another foreign key, which points to the same table, points to the row which supersedes the current row, which means that for a given aggregate, we have like a linked list of rows reflecting updates to that individual aggregate. Let’s take a look at the right again. When we write, the first entry for a given person has nothing to supersede because we see the person for the first time. So we just write a row into actions log. That’s my row. Superseded by is null. Then a new change comes, and I insert that new item into the actions row for the same person, the person with customer ID 100, and then I make the previous superseded-by point at the newly inserted change. And as I keep changing this person, that is keep adding items into my action log, I keep updating superseded-by so that it keeps pointing to the immediately following action log item. And then it turns out that the current state can be obtained by finding all the rows in actions log where superseded-by is null. And that’s something which with introducing an appropriate index is something we can do relatively cheaply.

And this allows us to implement all the use cases. We have all the traditional use cases, use cases using normalized tables, and all the use cases which rely on the historical data. We’re super happy. This is just a description of how it all looks like.

Now let’s apply some healthy criticism and see what are the good, bad, and remaining parts of this solution. Well, the good thing is that we understand what is going on again. There is no implicit magic. There are no callbacks or other kind of implicit auto-magical ways of updating rows into the database. Everything is inserting. You can just read it and you understand what is going on. There’s nothing magical happening behind the scenes. Our old tables are still there. So all the code which relied on old tables remains unmodified. That’s super useful. We didn’t need to rewrite as much as we feared we would have had to. We still have a single source of truth. By not introducing any additional database or splitting data between two places, we keep everything in sync in a single database. That’s immensely useful. No caching is necessary, for instance. And no technology had to be replaced. We didn’t need to learn a new database. We didn’t need to learn how to operate Kafka and ZooKeeper, what have you. It’s still a same familiar technological universe. On the flip side, there are some problems.

When we write something into the database, we write both to our normalized tables as well as to the action log, which incurs a small but still noticeable cost. But that’s nothing. It’s nothing compared to inconsistencies which might be introduced by writes. If in a very unlikely scenario, if you have two agents on the phone with the same customer performing a modification to the same customer, they might try to insert an actions log item at the same time leading to an inconsistent value superseded-by: a race condition. Luckily, with an adequate transaction isolation level is something you can avoid. And we were able to address this problem after noticing it and spending quite some time figuring out what is going on.

Generating test data is suddenly more complex because we cannot generate just the current state. We need to generate all the historical data to be able to reliably test the entire history functionality. So we need to generate, I’d say, a sequence of commands of updates to our business entities and then turn them into insert statements into the database. And then finally, there is a certain coupling between feeds and history, between those two functionalities, the history for our agents and feeds for our neighboring system. Their use cases aren’t identical and we’re solving those two problems from the same source of data, which might lead to some cost if those two use cases diverge. We’ll see about that. So far, everything seems to be working, and working well together.

And the last thing which I wanted to share with you is the remaining bit: the database.

So MariaDB is an open source database. That’s how the website looks like, or it used to look like this a couple of months ago. And MariaDB has a number of shortcomings. My critique also applies to MySQL. But the biggest problems we had was the fact that MariaDB does not have materialized views.

Views in the database allow you to save and a SELECT query into database, and then come back to it and say, “Hey, can I select from this view?” Which is as if you were selecting from that select query. Again, it’s nice. It saves you some time. It simplifies your queries. But still, it runs the query every single time. And then materialized view persists the data which the query fetches on disk. On the disk, and you can even add an index to make it faster. And the best part is, it’s automatically kept in sync with other tables in the database.

Given a different database, which does support materialized views for several years now, such as Postgres, we wouldn’t need superseded-by at all because we would have a materialized view representing the current state. Materialized view representing the current state of the entity, the current state of the database.

And all this work with keeping superseded-by consistent and in sync, and all the locking and transaction isolation, those problems would have been gone because Postgres fulfills this need out of the box. Excellent.

This is all I wanted to show you tonight. This was an interesting problem to solve, and the solution we came up with still appears to be working. I’d be really curious, what has your experience been when it comes to modeling historical changes, historical data just like here? And I’d love to talk about it during the virtual socializing in the breakout rooms. And until then, I’ll be very happy to take your questions if we have any time remaining for those.

Thank you all very much. Tobi, back to you.