At GameChanger we take in tons and tons of data, and have to serve it up really fast in different kinds of formats. Our core tools are Python, MongoDB, and Redis. Our big focus for the Spring baseball/softball season right now is “scaling”. I’ve been talking a bunch about scaling recently (at CTO School and the NYC Python Meetup, specifically), and I’ve started using the following definition of “Scaling”:
Decoupling Performance from Load
Note that it doesn’t have anything to do with being fast. It really has to do with being consistent: 1,000,000 users on your site or your API not causing it to slow down appreciably as a result of that load. This is hard. While coming up with ways we can address parts of this problem here, we ended up inevitably talking about queuing and asynchronous processing, and suddenly realized we were walking into the Eventually Consistent world. I hadn’t quite groked the buzz word until it was sitting in front of me.
So, initially I was terrified of inconsistency, and ended up concocting my “3 rules of eventual consistency” that underpin how we’re building an architecture that can do this. So while these are somewhat specific to our domain, I thought they might be somewhat more broadly useful:
- The Canonical Location Rule
- The Write Contract
- The Propagation Completeness Promise
The Canonical Location Rule states that for any given piece of data, it has a single canonical (authoritative / original) location in our DB that it exists. We use MongoDB, and much of what we’re doing for read-side scaling is building documents that share data from multiple sources (if you’re in the RDBMS world, think of materialized views). This means that for instance, a team’s name might also exist in a schedule document or a league document elsewhere in our DB. But the team document itself is Canonical for the team name. That’s the root place it lives, and this ensures that if all else goes to hell, we can go back to the team document and get the right value.
The Write Contract states that when we get a “write” (an API call that adds or updates data), we ensure that two things happen in order for that call to return a success response: (a) that the Canonical Location is updated/created to reflect the new data, and (b) that we ensure that Propagation has been queued for that canonical data to be replicated / used in calculations / whatever in order to get the rest of the system consistent.
Lastly, the Propagation Completeness Promise is that for any queued Propagation, that Propagation is not fully removed from it’s queue until ALL propagation is finished.
If these rules seem simple, they are. But they help me sleep at night as we move more and more processing “out of line”, and distribute our data across more and more representations. It’ll all get consistent. Eventually.