All that glitters
Document Stores and Sharding

I was inspired to jot this down by this tweet from Chris Dixon:

“I’m still trying to wrap my head conceptually around how getting rid of sql structure helps with sharding / replication.”

https://twitter.com/cdixon/status/160937251528916992

The answer is data locality. Relational stores, by definition, formalize not just the schema of individual tables but of their relations to each other. The reason they do this, of course, is so that you can run joins, queries that pull related data from multiple tables/collections at once.

And that’s exactly what document stores don’t do. So why is that better for sharding? Because in order to provide efficient access to data in a relational world, you’ll want related data to exist in one location. You’re stuck having to access an unpredictable number of nodes in order to complete a single query. That’s going to hurt a lot.

Document stores ensure that you can only be retrieving a single document at any one time, which means that no matter how your data is portioned across a network, you’ll only be talking to a single node for a single query. This means that any reasonable partitioning scheme that allows you to efficiently map a primary key to a node will give you constant performance in a shared situation.

And because there are no explicit (and thus relied-upon) relationships between documents, replication is as simple as copying individual objects. There are no relationships to maintain and foreign keys to check in the process.

The implications for schema design should be obvious (but took me a long time to fully set in): if there is data you need to retrieve together, that data should reside full in one (or more than one) document. These kinds of schemas are a bit tricky to design well (and require tackling interesting consistency problems), but the payoff in consistent performance a simple sharding is worth it.

blog comments powered by Disqus
blog comments powered by Disqus