All that glitters
Document Stores and Sharding

I was inspired to jot this down by this tweet from Chris Dixon:

“I’m still trying to wrap my head conceptually around how getting rid of sql structure helps with sharding / replication.”

https://twitter.com/cdixon/status/160937251528916992

The answer is data locality. Relational stores, by definition, formalize not just the schema of individual tables but of their relations to each other. The reason they do this, of course, is so that you can run joins, queries that pull related data from multiple tables/collections at once.

And that’s exactly what document stores don’t do. So why is that better for sharding? Because in order to provide efficient access to data in a relational world, you’ll want related data to exist in one location. You’re stuck having to access an unpredictable number of nodes in order to complete a single query. That’s going to hurt a lot.

Document stores ensure that you can only be retrieving a single document at any one time, which means that no matter how your data is portioned across a network, you’ll only be talking to a single node for a single query. This means that any reasonable partitioning scheme that allows you to efficiently map a primary key to a node will give you constant performance in a shared situation.

And because there are no explicit (and thus relied-upon) relationships between documents, replication is as simple as copying individual objects. There are no relationships to maintain and foreign keys to check in the process.

The implications for schema design should be obvious (but took me a long time to fully set in): if there is data you need to retrieve together, that data should reside full in one (or more than one) document. These kinds of schemas are a bit tricky to design well (and require tackling interesting consistency problems), but the payoff in consistent performance a simple sharding is worth it.

Scaling Source Code

Code structure must scale as an organization grows. I’ve said before that Scaling is Performance decoupled from Load, and I think the analogous statement might be that scalable code decouples maintainability from team size.

At an early stage, we’re all just strapping things together with duct tape and working to get our ideas validated by our markets. The pressure on code at this stage is to be minimal and functional, and directions change, as does structure, with great frequency. You’re usually optimizing for a team of 1, maybe 2 at this point, who can know the length and breadth of the code base by rote.

Inevitably there will come a point where you’ve hired 2 more programmers, your customers are clammoring for new features, and you’re still saddled with a code base designed for 1. The next big release goes far beyond its deadline. Programmers complain about APIs. New bugs are cropping up left and right.

What makes code maintainable by a larger group? Standards and well-design interfaces. There are heavy-handed processes you can implement to make sure every line of code is linted, reviewed, etc, but in a startup (even a growing one) your focus should still be individually empowerment and speed to market.

But make sure at least the following are in place:

  • QA. You’re going to break a lot, do find a detail obsessed person who can tell it like it is.

  • Naming conventions. The first thing you need to do is get everyone speaking the same language. Semantics is the core of software, so encourage the arguments and negotiation to happen at this level.

  • Interface design. And I mean interfaces like as in API, not UI. Your senior players are going to have to get good at thinking through interfaces. This requires treating code as a product, and applying UX principles, and thinking about consumption of your APIs by other team members.

  • Explicit architectural goals. Its probably too soon to have an Architect dedicated role, but your team needs to be able to make sound decisions as they work, in accordance with a set of common principles and roles.

  • Code review. There’s a whole other post in my head on Scaling Process, but for this one suffice it to say that all of the above will require you to keep yourselves honest, and communicate your progress and standards, and code review is the best way to get it going.

Good luck. This is the fun part.

Should You Learn to Code?

Maybe. More importantly, what do you need to do to succeed at it if you try?

Programming isn’t something you can take a couple of courses in and become useful. It’s pretty much the opposite of that (bias alert: I never studied CS, and taught myself to code, so there’s that). Programming is a trade. It’s like making shoes, or singing, or playing a sport. You can’t take a class on Soccer and expect to be able to play. You’ve got to get out on the field, and put in the hours of practice.

And I think that there’s a kind of person who is drawn to it, and flourishes in the trade. So here’s my list of what makes that kind of person (and if you’re diving in, what you should aim for to succeed):

Insane Focus

Ever watch a programmer at work? She’ll be the one with headphones on not noticing if the ceiling caves in. That’s not just a coincidence, not just nerdiness, it’s a requirement of the job. Programming requires you to contemplate all aspects of a problem at once, and create a little universe of clarity around it. This is shockingly hard, and it’s the reason that (productive) programmers will defend their focus time so strongly, and develop mechanisms (darkness, headphones, isolation, beards) that preserve that focus.

You’ll know you’re on the right track if your friends and loved ones start getting annoyed at how antisocial you’ve become.

A Love of Failure

Your first “hello world” program is going to feel great. Wow, I just made a computer do something! Somewhere shortly after that you’re going to be bludgeoned with failures. Odds are it’ll take multiple attempts to get something that even works in a rudimentary way. A lot of people walk away at this point. It’s demoralizing. But you’ve got to be able to fail over and over, until you find a solution that works. And then you’ve got to be willing to destroy that in favor of finding an even better solution.

Each time you fail, you’ll have learned something, and your next failure will be a more interesting one.

Reckless Abandon

Go forth and write bad code (and yes, it’s going to be terrible). A good programmer would probably laugh (rightfully) at your your whole approach and even your style. But did you make it work? Then good. Now do it again. The only way you’re going to learn how to do things well is to do them badly many many times, and build up the intuitive feel for how to construct programs and solve problems. This doesn’t mean you shouldn’t learn “best practices” and aim for “beautiful” code (both do matter), but don’t let worrying about that slow you down when you’re learning.

All that matters, at the end of the day, is whether your program does what you want.

Imagination

Programming is an act of pure creation- inside this little digital world, the programmer is God, able to conjure up substance from nothing at all and shape it to her liking. It’s really fun, but also really daunting, to be able to do this. There is no limit on what you can accomplish, but it’s going to be up to you to do it.

There’s no sense worrying about the limits, they’re going to become apparent whether you want them to or not- so aim big.

Software Architecture

It means taking a stand.

It’s easy to think of architecture as a series of choices: which database, which operating system, which programming language, which framework. All of those decisions should flow naturally from having a clear idea of what you’re building and why.

What are you going to hang your hat on? Responsiveness? Reliability? Durability? Throughput?

A good system will try to accomplish all of the above and more, but it had better know what it’s trying to be the best at. Once you’re clear on what you’re trying to accomplish, your decisions are justifiable and their success measurable.

chrisricca:

If you have an iPhone 4S you know that while Siri isn’t perfect yet, the spoken interface can be amazingly useful, and it’s definitely the future. While Apple works out the kinks on Siri from the software side, I’ve been searching for the best accessory setup for talking with her on the go or when…

The Hard Part, the Fun Part, and The Details

I’m not a big fan of estimation.  It’s hard to get right, and in software, today’s “easy” is tomorrow’s “more complicated than I thought.”  But we still have to estimate from time to time, and I’ve started to notice a pattern to one form of estimation error:

When a software engineer is looking at a project, she immediately grasps The Hard Part, and The Fun Part.

The Hard Part.  When a programmer first starts to grapple with the problem, she looks for the complicated bit that’s full of gotchas.  This is where she starts muttering for a whiteboard, and after an hour (or a few hours) of thought, there’s this AHA moment when she’s figured out a strategy through the crux of the problem.  Once you’ve done this a few time, you start to recognize ahead of time The Hard Part, and you’ll have a quickly formulated strategy for solving it.

The Fun Part.  There’s some cool new piece of technology to incorporate, a chance to use that algorithms class you took in school (I swear my degree was worth it all now!), or an opportunity to finally refactor that ugly code that the previous developer wrote.  This part is easy to see and reason about, and so it factors into your estimation early.

So all thats left is The Details, but give me a couple of days and I’ll bang out 90% of it… we’ll be done in no time!

And that’s exactly where it all goes wrong.  When you’re building something real, something that integrates with a database, an API, an existing flow, a user’s expectations, The Details aren’t the last 10% of a project.  They’re more like the last 75%.

So what I encourage anyone to do when trying to estimate how long a project will take, or how much is left to do, is sit down and try to catalog the details.  Tests, usability, copy, design, integration, backwards compatibility.  You’ll find that the list of things to do beyond The Hard Thing and The Fun Thing might dwarf both of them.

The Details are where things actually get done.  Ignoring them ensures that you’re going to be unhappily surprised when things take longer than you hope, or you’re going to shove something unfinished out the door, and both results suck.

The Rules of Eventual Consistency

At GameChanger we take in tons and tons of data, and have to serve it up really fast in different kinds of formats.  Our core tools are Python, MongoDB, and Redis.  Our big focus for the Spring baseball/softball season right now is “scaling”.  I’ve been talking a bunch about scaling recently (at CTO School and the NYC Python Meetup, specifically), and I’ve started using the following definition of “Scaling”:

Decoupling Performance from Load

Note that it doesn’t have anything to do with being fast.  It really has to do with being consistent: 1,000,000 users on your site or your API not causing it to slow down appreciably as a result of that load.  This is hard.  While coming up with ways we can address parts of this problem here, we ended up inevitably talking about queuing and asynchronous processing, and suddenly realized we were walking into the Eventually Consistent world.  I hadn’t quite groked the buzz word until it was sitting in front of me.

So, initially I was terrified of inconsistency, and ended up concocting my “3 rules of eventual consistency” that underpin how we’re building an architecture that can do this.  So while these are somewhat specific to our domain, I thought they might be somewhat more broadly useful:

  1. The Canonical Location Rule
  2. The Write Contract
  3. The Propagation Completeness Promise

The Canonical Location Rule states that for any given piece of data, it has a single canonical (authoritative / original) location in our DB that it exists.  We use MongoDB, and much of what we’re doing for read-side scaling is building documents that share data from multiple sources (if you’re in the RDBMS world, think of materialized views).  This means that for instance, a team’s name might also exist in a schedule document or a league document elsewhere in our DB.  But the team document itself is Canonical for the team name.  That’s the root place it lives, and this ensures that if all else goes to hell, we can go back to the team document and get the right value.

The Write Contract states that when we get a “write” (an API call that adds or updates data), we ensure that two things happen in order for that call to return a success response:  (a) that the Canonical Location is updated/created to reflect the new data, and (b) that we ensure that Propagation has been queued for that canonical data to be replicated / used in calculations / whatever in order to get the rest of the system consistent.

Lastly, the Propagation Completeness Promise is that for any queued Propagation, that Propagation is not fully removed from it’s queue until ALL propagation is finished.

If these rules seem simple, they are.  But they help me sleep at night as we move more and more processing “out of line”, and distribute our data across more and more representations.  It’ll all get consistent.  Eventually.

The Lion Window Manager is Broken
  • Command-Tab Foregrounding: when I switch to an app, I expect it to be foregrounded.  For non-full screen apps, I am more often moved to the space it’s in, but left looking at the most recently foregrounded app in that space.
  • Command-Tab Recent Apps: I have 2 full-screen Safari windows open.  I am looking at one of these, and Command-Tab over to Terminal.  I now Command-Tab back.  I get the other Safari window presented, but if I look hard, it has the backgrounded look (grayscale buttons, low contrast, etc.).  If I hit Command-L, to jump to the URL bar, in this state, Lion will now switch me to the other Safari window.  It knows it’s wrong!  Hitting Command-T to open a new tab also seems to have the effect of opening a tab in the non-visible Safari window in this state, though it doesn’t cause a switch to the other window.
  • Intra-App Window Switching: more generally, if there are multiple full-screen windows for a given app, only the “foreground” window of that app (which I seem to have no control over, as noted above) is available through Command-Tab switching.  Combine that with the fact that Command-Backtick no longer works between full-screen windows in an app, and I now must use gestures to access certain windows.  Productivity killer for a programmer. 
  • Desktop Inaccessible: None of the Expose gestures that get me to the desktop work while I’m in a full-screen app.  Picky, maybe, but I had a big use case here, which was that I’d grab an image out of a web page in Safari, flick a gesture to expose the desktop, and deposit it there.  The default four-finger gesture to expose the desktop also doesn’t work when you’re dragging anything (it seems to interpret your other fingers’ movement as a 3-fingered swipe, ignoring the finger that’s “holding” the item you’re dragging).  I think the desktop metaphor is awesome, but if I can’t easily drag content from an app onto my desktop, I’ve effectively lost it.

It feels like Lion is a half-way point to a full-screen OS a la iOS, but it is horribly painful to use in the mean time.  There’s a loss in reliability and transparency, and much of what I’m seeing has to qualify as pure bugginess.

I could write a list of the things I think are amazing, well designed, and pretty (the gestures are so great, and OMG the animation for sliding back in Safari history is a work of art), but right now all I want to do is vent. :)

Lion feels like OSY 1.0.  A kind of a new thing, not entirely thought-through, and in desperate need of some polish.

Standing Desk experiment  (Taken with Instagram at Grandview II)

Standing Desk experiment (Taken with Instagram at Grandview II)

MongoDB Gotchas (a couple)

A friend asked me last night for my favorite “Mongo Gotcha.”  He of course had one in mind, but I came up with a few, so I thought I’d jot a down the ones that I came up with on the spot (if you have more good ones, feel free to add in comments, and I’ll update)

Replication Gotcha #1: (this was his) If you lose a majority of your replica nodes, those left will be unable to assume Primary roles (because they can’t come up with a voting majority), and your cluster will be down.  This is a good thing (prevents nasty segmentation problems), but if you’re in that situation, the only way to recover is to restart a node in a non-replica-set mode, alter its config manually to grant it more votes, and restart.  You won’t be able to do anything that writes to config without a Primary, so you can’t add arbiters or anything else in this mode.

Replication Gotcha #2: When adding new replicas, pay very close attention to the size of your dataset and the size of your oplog.  You can run `db.printReplicationStatus()` on the primary node to see how many hours of oplog the node has.  When you bring a new node into a replica set cold, it will first load all data from Primary (or a Secondary, if you tell it to- and I highly recommend this), and then start “catching up” by getting oplog data shipped to it.  I’ve gone to the trouble of dropping a new node into a cluster, only to have it overrun available oplog by 15 minutes (if the initial sync takes so long that the capped oplog rolls over in the interim, the node is permanently unable to catch up).  Best strategy is to take a snapshot/backup of an existing replica, and start new nodes using —fastsync (they skip the initial download, and can get into oplog syncing quickly).  Also, try to make sure you have a ton of extra oplog (—oplogSize on the command line, I think).

Sharding Gotcha: Understand the heck out of your shard keys, and think about them very carefully.  Mongo distributes data according to key ranges within the total keyspace of all shard keys.  If your keys are too random, you’ll get nicely distributed writes but put excessive load on the balancer when it has to act, because there is no simple way to segment the keyspace in order to re-jigger chunk distribution.  If they’re too sequential, you’ll end up writing all new data to only 1 shard, because all new keys will end up in one keyrange, and then forcing the balancer to do all the work of distributing data across the cluster.  I’ve seen the absolute worst of both worlds, where 90% of my existing data used a random value for the shard key, and then I changed it to a semi-sequential key (which was actually a great key construction, by itself).  In the overall keyspace including all of my random keys, all of my new “good” shard keys ended up writing to a single node- and thus forcing the balancer to do the work.  But of course, all of the rest of my data was randomly distributed, so the balancer became (a) overworked and (b) horribly inefficient.  Ouch.

I should be clear that MongoDB is an awesome piece of technology- and 10gen is already working to make some of these gotchas less gotcha-ish.  But knowing the details of your technology stack is crucial to managing anything at scale (and at GameChanger, that’s the priority of the hour).

Fun.