Evolving Data Storage and Querying at Customer.io

In this article

SaaS products generate a lot of data. And as the business grows, so does that store of data. The more customers you have and the more they use your product, the more your systems are inundated. Managing it all requires a scalable solution for storing and querying an ever-expanding collection.

What works at the startup stage often falters as data volume increases: query performance declines, maintenance becomes harder, and costs rise. Without a crystal ball to predict future needs, you must be ready to evolve as growth strains your data management capacity.

That’s the challenge we faced at Customer.io last year—and we solved it by iterating on different approaches until we landed on a scalable, cost-effective solution.

How it started, how it’s going: A brief overview

Each workspace in our platform generates data for every message delivery—details like the delivery ID, customer, campaign, newsletter, and multiple associated metrics. With billions of weekly deliveries across our platform, this quickly adds up to an enormous amount of data that must be stored, indexed, and queried efficiently. No matter how much data we manage, we need a system that can perform well (and be budget-smart).

The challenge: As our data store grew, our existing system for indexing and querying it couldn’t keep up. The indexes became so large that query performance tanked, we couldn’t do proper maintenance, and our transactions were constantly deadlocking.

The (eventual) solution: We replaced our original bitmap-indexed MySQL database with a multi-tiered storage solution based on timeboxed data, leveraging both MySQL and GCS, with uniform bitmap indexing.

The outcome: Our solution provided significant cost cutting and immense improvement in manageability of the data—and it enhanced the customer experience too. A win all around!

The process: Of course, we didn’t just jump straight to the ultimate solution. This was a journey of exploring, iterating, doubling back, and building on the realizations we encountered along the way. Our evolution breaks down into three stages—read on for the story of how we progressed through them.

Stage zero: Where we began

If we think of this as an evolution, we should start with the primordial ooze: our initial data storage and indexing setup. We started with one primary MySQL database using bitmap-based indexes. Our mapping strategy was to store bitmaps in chunks of 30,000 bits. The approach actually gave us some nice features:

We could combine bitmaps in a variety of elegant ways
We used roaring bitmaps, which provided nice compression and optimizations
We didn’t have to load an enormous bitmap when we were just modifying a couple bits

And for a while, it was great! MySQL was doing a good job of holding the data and letting us access it. We could update all these bitmap indexes as new data came in, updating profiles and deliveries in lockstep, and queries performed well. So we just kept throwing stuff in there—because early on, we didn’t know any better.

When good bitmaps go bad

What we couldn't know at the outset was that as more data came in, the setup wasn't scalable. We were chunking up data batches into two categories: profile IDs and delivery IDs, but incoming batches of data included data for both profiles and deliveries.

In one transaction, we'd try to update nicely separated profile bitmap chunks while at the same time updating a wide variety of different overlapping delivery chunks. Because all those chunks didn't align, we had constant conflicts in the read/write access—and, in many cases, wound up deadlocking our transactions.

Evolution stage one: MySQL indexing for everything

Our first attempt at a solution was to say goodbye to delivery bitmaps. If we did away with chunking stuff into delivery versus profile bitmaps, we’d eliminate the problems with updating deliveries. We were storing all this stuff in MySQL anyway, so we figured we could get it to index everything properly.

We started with all the data and indexes on one table, but soon realized it was unsustainable. Because we had so many different queries we wanted to do, we wound up with hundreds of indexes, and the whole thing just performed terribly.

The two-table approach: Our next strategy was to split the table into two pieces—one for deliveries and one for metrics—and combine queries across them. Whenever a query involved a delivery metric, we retrieved a bunch of delivery IDs from the delivery_metrics table, then used those IDs to do point queries in the deliveries table. (Otherwise, we just resolved the query directly against the deliveries table.) This actually worked fairly well, generally speaking. Even though we were generating new deliveries constantly, at a very high volume, most query performance was acceptable. The deliveries tables were very write-heavy and read-light, and we didn’t need to query old deliveries very often. We did have queries with heavy read loads, but they tended to be point loads—like when a customer was pulling metrics for a specific person or campaign.

Endlessly growing tables = endlessly mounting issues

Handling everything with MySQL indexes kind of worked—emphasis on kind of. We found ourselves chasing after a variety of recurring issues that made this setup unsustainable in the long term.

Indexing issues: First, the indexing wasn’t performing beautifully. We had a constant dribble of issues in which a query would perform poorly—the joins weren’t working quite right or a query wasn’t finding the correct index. Or we’d have an edge case that wasn’t getting the right index, and when we updated the code to fix it, it would cause a different edge case to get the wrong index. Fixing that stream of issues got pretty hairy pretty quickly.

Data volume issues: It became clear that the setup couldn’t reliably support the amount of data we had, much less be scalable as our data continued to balloon on a daily basis. At this point, many workspaces had been around for years, so they were already huge—and they were being actively written to. With workspaces sending millions of deliveries every week, we had tables with billions and billions of rows (and mounting!). Trying to manage that much data with just MySQL indexes cropped up a few issues:

Customer experience: When a customer wants to display metrics in the platform's UI, that query performance has got to be snappy. But some of our point queries were performing poorly in UI cases. Export cases—when a customer is exporting records out of the platform—were usually pretty smooth, but there were occasions where a transaction’s performance would fall off a cliff. Even if only a small percentage of transactions take 30-60 seconds to complete, it’s not an acceptable customer experience.
Maintenance and management: The tables were getting so ridiculously big that we couldn’t manage them properly in MySQL. For instance, when we’d try to do a table migration, it often just wouldn’t complete because the size and write load was just too large to process.
Optimization: The management issues really hamstrung us in terms of improving performance or adding any new functionality. Even adding a new column or index became impossible. And that meant we couldn’t really solve the indexing issues either.

We realized that having tables in MySQL that just grow forever is simply a bad idea. Maybe it would have been a workable solution if we were small and planned to stay that way—but business growth demanded another leap up the evolutionary ladder for our data management strategy.

Evolution stage two: Dipping into data warehouses

We started to wonder if the crux of the problem might come down to data warehousing. After all, the promise of data warehouses is really fast queries, and that’s ultimately what we needed from a customer experience perspective. So we started exploring data warehouses with that in mind: What information do we need to propagate for our customers?

We looked closely at how data warehousing capabilities could line up with the unique needs of our platform’s reporting features and functionality. It boiled down to balancing the needs of three competing requirements:

Historical deliveries: We need to update metrics information on deliveries months after they were first sent. We also need to be able to retrieve a time-ordered list of deliveries sent years ago, filtered by customer ID, campaign ID, and/or delivery state.
Recent deliveries: We also need to be able to retrieve those filtered lists for deliveries that were generated recently and do so quickly. Customers expect information on deliveries they've just sent to show up in the UI within seconds.
Specific deliveries: We need to be able to find all of the data on a particular delivery identified by nothing other than its delivery ID—whether it was sent years ago or one second ago.

Where data warehouses let us down

We tested a couple of data warehouse options, but they didn't deliver the performance we were looking for.

Point query speed: The data warehouses that could run fast point queries relied on cached data. That works well if you can hold an often-queried data set in memory. But we have terabytes of infrequently accessed data—and need to resolve point queries in under 200 milliseconds to avoid an unacceptable lag in the UI.
Real-time processing: Data warehouses are optimized for batch data loading operations, which invariably causes a lag before data is available for querying. Some warehouses offer more real-time options, but they’re costly when dealing with large volumes of data.
Cost: Even if we could get an acceptable level of speed, data warehouses are expensive! The cost of maintaining a large data set is high in a data warehouse, and it becomes prohibitive with the on-demand processing capability required for the queries we need to do.

We concluded that a data warehouse would not solve our problem. While they're fantastic at performing big queries across enormous data sets, no one had a magic bullet for running super-fast queries across a huge data store.

But hang on a minute: After digging into all these data warehouses, we realized that under the hood, they were all just leveraging an object storage engine from one of the major cloud providers. If they could do that, we could do that! And that realization set us off on the path toward a whole new data management configuration.

Evolution stage three: Timeboxed data in multi-tiered storage

Neither our MySQL approach nor a data warehouse could deliver both fast query performance and recent data visibility with our ever-growing volume. Was there a way to get the best of both worlds?

The missing piece: We realized we’d been trying to grapple with many years’ worth of data in a single model. But how far back in time were customers actually digging? Could we find a cutoff point that would allow us to shrink the amount of data indexed in MySQL without making historical data inaccessible? If we analyzed our access patterns, we could unlock the key to optimizing for them!

As it turned out, access patterns revealed that the vast majority of queries were concentrated on recent data, with few queries for deliveries older than eight months. After working closely with the product team to validate that our assessment lined up with customer needs, we had the foundation for a new strategy: timeboxing data based on age.

The final setup

These realizations led us to the framework that could handle a massive data store without compromising query performance. We built a multi-tiered storage solution, splitting up data storage based on age:

MySQL database for primary data: For deliveries within the last eight months, all data is indexed in MySQL. The data store is now manageable for optimal query performance. We don't have to worry about issues with super-fast point queries because a colossal volume of data doesn't bog the system down.
GCS storage for archived data: Deliveries older than eight months become immutable—we stop collecting metrics on them—and all data is moved to GCS. We can easily query the data and manage it effectively there. While this limited historical metrics, we found that it didn't undermine what customers needed.

The essential insight: Developing a natural chunking strategy based on time is the key to ensuring performance when dealing with a large, constantly growing data set.

Bringing back the bitmaps

Bitmap indexes had a lot of benefits in our initial system, and we certainly couldn’t rely on MySQL indexes for the data archived in GCS. We were looking for the following key characteristics for a new indexing scheme, and bitmaps promised a great solution:

Seamless queries across time: Uniform indexing means we can do a query that spans recent and archived data—there is no divergent functionality across different storage media. So, from a customer experience perspective, it's totally seamless.
Increased metrics functionality: We now deliver more nuanced and granular metrics. For instance, instead of simply showing a customer all opened deliveries, we highlight deliveries that were opened but not clicked.
Scalability: Bitmaps are incredibly compressed, which is one reason we used them in the first place. So, while theoretically, we could run into limitations someday, it would have to be something like hundreds of billions of deliveries in a single month—we should be good for the foreseeable future.

Asynchronous yet accurate batch updates

To leverage bitmap indexes in our new solution, we had to solve the old problem of deadlocking transactions when updating bitmap indexes. Again, the answer boiled down to reframing how we thought about time—in this case, synchronous versus asynchronous timing for batch updates.

This was an interesting realization: We don’t need bitmap indexes for most queries. Point queries—finding the specific delivery record for any given delivery—are the only instances where we truly need to be transactionally correct for delivery indexing. But for UI queries, like pulling up all the deliveries for a particular campaign, eventual consistency is invisible in the customer experience.

So now we can batch the updates to avoid deadlocking transactions with a massive influx of data because we can load and store the bitmaps transactionally but asynchronously.

Performant, scalable, and ready to keep evolving

We started this little adventure with one goal in mind: the ability to manage our database effectively. By intentionally evolving our data management strategy, we achieved that, along with some other pretty great outcomes.

Operational efficiency: Timeboxing data taught us that historical data is always weird, but we can now make realistic estimates on how long migrations will take and know how to avoid production outages when doing huge reads.
Scalability: When it comes to big data, nothing is performant if you have to do a full table scan. Our time-boxed solution can handle as much data as we can imagine—and more.
Cost savings: Storing archival data in GCS is incredibly cheap, so the cost to store data is now tremendously lower. That’s a benefit for customers, too—it keeps product costs lower and enables us to make a tremendous amount of historical data available to them.
Better customer experience: Customers have reliably speedy queries for both UI and export cases. Beyond that, we now have the freedom to build out more functionality to deliver more value.

So here we are, at the end of our evolutionary journey—at least for now. We emerged from the primordial soup, sprouted legs and hair and opposable thumbs, and have a big-brained homo sapiens form of data management (or maybe it’s a capybara—whatever the epitome of evolutionary excellence may be).

What’s next? Evolution is never a straight line—in nature or technology. We wandered down some flawed (albeit interesting) paths to our current incarnation. We started implementing our solution into production over a year ago, and we’re just about to complete archive work on the final MySQL table. After that, we'll keep exploring optimizations and look for other ways to apply this model. Who knows? Maybe next we'll morph into a yet-to-be-imagined data management lifeform.