Book Read Free

An Elegant Puzzle- Systems of Engineering Management

Page 3

by Will Larson


  2.4 Productivity in the age of hypergrowth

  You don’t hear the term hypergrowth12 quite as much as you did a couple years ago. Sure, you might hear it during any given week, but you also might open up Techmeme and not see it, which is a monumental return to a kinder, gentler past. (Or perhaps we’re just unicorning13 now.)

  Fortunately for engineering managers everywhere, the challenges of managing within quickly growing companies still very much exist.

  When I started at Uber, we were almost 1,000 employees and were doubling the headcount every six months. An old-timer summarized their experience as: “We’re growing so quickly that every six months we’re a new company.” A bystander quickly added a corollary: “Which means our process is always six months behind our head count.”

  Helping my team be successful when a defunct process merges with a constant influx of new engineers and system load has been one of the most rewarding opportunities I’ve had in my career. This is an attempt to explore the challenges and propose some strategies I’ve seen for mitigating and overcoming them.

  2.4.1 More engineers, more problems

  All real-world systems have some degree of inherent self-healing properties: an overloaded database will slow down enough that someone fixes it, and overwhelmed employees will get slow at finishing work until someone finds a way to help.

  Very few real-world systems have efficient and deliberate self-healing properties, and this is where things get exciting as you double engineers and customers year after year after year.

  Figure 2.7

  Employee growth rate of fast-growing companies.

  Productively integrating large numbers of engineers is hard.

  Just how challenging this is depends on how quickly you can ramp engineers up to self-sufficient productivity, but if you’re doubling every six months and it takes six to twelve months to ramp up, then you can quickly find a scenario in which untrained engineers increasingly outnumber the trained engineers, and each trained engineer is devoting much of their time to training a couple of newer engineers.

  Imagine a scenario in which training a single new engineer takes about 10 hours per week from each trained engineer, and in which untrained engineers are one-third as productive as trained engineers. The result is the right-hand chart’s (admittedly, pretty worst-case scenario) ratio of two-untrained-to-one-trained. Worse, for those three people you’re only getting the productivity of 1.16 trained engineers (2 x .33 for the untrained engineers plus .5 x 1 for the trainer).

  You also need to factor in the time spent on hiring.

  If you’re trying to double every six months, and about 10 percent of candidates undergoing phone screens eventually join, then you need to do ten interviews per existing engineer in that time period, with each interview taking about two hours to prep, perform, and debrief.

  Figure 2.8

  More employees, more customers, more problems.

  That’s less than four hours per engineer per month if you can leverage your entire existing team, but training comes up again here: if it takes you six months to get the average engineer onto your interview loop, each trained engineer is now doing three to four hours of hiring-related work per week, and your trained engineers are down to approximately 0.4 efficiency. The overall team is getting 1.06 engineers’ worth of work out of every three engineers.

  It’s not just training and hiring, though:

  For every additional order of magnitude of engineers, you need to design and maintain a new layer of management.

  For every ~10 engineers, you need an additional team, which requires more coordination.14

  Each engineer means more commits and deployments per day, creating load on your development tools.

  Most outages are caused by deployments, so more deployments drive more outages, which in turn require incident management, mitigations, and postmortems.

  Having more engineers leads to more specialized teams and systems, which require increasingly small on-call rotations so that your on-call engineers have enough system context to debug and resolve production issues. Consequently, relative time invested in on-call goes up.

  Let’s do a bit more handwavy math to factor these in.

  Only your trained engineers can go on-call. They’re on-call one week a month, and are busy about half their time on-call. So that’s a total impact of five hours per week for your trained engineers, who are now down to 0.275 efficiency, and your team is now getting less than the output of a single trained engineer for every three engineers you’ve hired.

  (This is admittedly an unfair comparison because it’s not accounting for the on-call load on the smaller initial teams, but if you accept the premise that on-call load grows as engineer head count grows, and that load grows as the number of rotations grows, then the conclusion should still roughly hold.)

  Although it’s rarely quite this extreme, this is where the oft-raised concern that “hiring is slowing us down” comes from: at high enough rates, the marginal added value of hiring gets very slow, especially if your training process is weak.

  Sometimes very low means negative!

  2.4.2 Systems survive one magnitude of growth

  We’ve looked a bit at productivity’s tortured relationship with engineering head count, so now let’s also think a bit about how the load on your systems is growing.

  Understanding the overall impact of increased load comes down to a few important trends:

  Most system-implemented systems are designed to support one to two orders’ magnitude of growth from the current load. Even systems designed for more growth tend to run into limitations within one to two orders of magnitude.

  If your traffic doubles every six months, then your load increases an order of magnitude every 18 months. (And sometimes new features or products cause load to increase much more quickly.)

  The cardinality of supported systems increases over time as you add teams, and as “trivial” systems go from unsupported afterthoughts to focal points for entire teams as the systems reach scaling plateaus (things like Apache Kafka, mail delivery, Redis, etc.).

  If your company is designing systems to last one order of magnitude and is doubling every six months, then you’ll have to re-implement every system twice every three years. This creates a great deal of risk—almost every platform team is working on a critical scaling project—and can also create a great deal of resource contention to finish these concurrent rewrites.

  However, the real productivity killer is not system rewrites but the migrations that follow those rewrites. Poorly designed migrations expand the consequences of this rewrite loop from the individual teams supporting the systems to the entire surrounding organization.

  If each migration takes a week, each team is eight engineers, and you’re doing four migrations a year, then you’re losing about 1 percent of your company’s total productivity. If each of those migrations takes closer to a month, or if they are only possible for your small cadre of trained engineers—whose time is already tightly contended for—then the impact becomes far more pronounced.

  There is a lot more that could be said here—companies that mature rapidly often have tight and urgent deadlines around pursuing various critical projects, and around moving to multiple data centers, to active-active designs, and to new international regions—but I think we’ve covered our bases on how increasing system load can become a drag on overall engineering throughput.

  The real question is, what do we do about any of this?

  2.4.3 Ways to manage entropy

  My favorite observation from The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford15 is that you only get value from projects when they finish: to make progress, above all else, you must ensure that some of your projects finish.

  That might imply that there is an easy solution, but finishing projects is pretty hard when most of your time is consumed by other demands.

  Let’s tackle hiring first, as hiring and training are often a team’s bigg
est time investment.

  When your company has decided that it is going to grow, you cannot stop it from growing, but, on the other hand, you absolutely can concentrate that growth, such that your teams alternate between periods of rapid hiring and periods of consolidation and gelling. Most teams work best when scoped to approximately eight engineers, as each team gets to that point, you can move the hiring spigot to another team (or to a new team). As the post-hiring team gels, eventually the entire group will be trained and able to push projects forward.

  You can do something similar on an individual basis, rotating engineers off of interviewing periodically to give them time to recuperate. With high interview loads, you’ll sometimes notice last year’s solid interviewer giving a poor experience to a candidate or rejecting every incoming candidate. If your engineer is doing more than three interviews a week, it is a useful act of mercy to give them a month off every three or four months.

  Figure 2.9

  Candidates get offers, become untrained, and then learn.

  I have less evidence on how to tackle the training component of this, but generally you start to see larger companies do major investments in both new-hire bootcamps and recurring education classes.

  I’m optimistically confident that we’re not entirely cargo-culting this idea from each other, so it probably works, but I hope to get an opportunity to spend more time understanding how effective those programs can be. If you could get training down to four weeks, imagine how quickly you could hire without overwhelming the existing team!

  The second most effective time thief that I’ve found is ad hoc interruptions: getting pinged on HipChat or Slack, taps on the shoulder, alerts from your on-call system, high-volume email lists, and so on.

  The strategy here is to funnel interruptions into an increasingly small area, and then automate that area as much as possible. Ask people to file tickets, create chatbots that automate filing tickets, create a service cookbook, and so on.

  With that setup in place, create a rotation for people who are available to answer questions, and train your team not to answer other forms of interruptions. This is remarkably uncomfortable because we want to be helpful humans, but it becomes necessary as the number of interruptions climbs higher.

  One specific tool that I’ve found extremely helpful here is an ownership registry, which allows you to look up who owns what, eliminating the frequent “Who owns X?” variety of question. You’ll need this sort of thing to automate paging the right on-call rotation, so you might as well get two useful tools out of it!

  A similar variant of this is ad hoc meeting requests. The best tool that I’ve found for this is to block out a few large chunks of time each week to focus. This can range from telecommuting on Thursday, to blocking out Monday and Wednesday afternoons, to blocking out from 8–11 each morning. Experiment a bit and find something that works well for you.

  Finally, the one thing that I’ve found at companies with very few interruptions and have observed almost nowhere else: really great, consistently available documentation. It’s probably even harder to bootstrap documentation into a non-documenting company than it is to bootstrap unit tests into a non-testing company, but the best solution to frequent interruptions I’ve seen is a culture of documentation, documentation reading, and a documentation search that actually works.

  There are a non-zero number of companies that do internal documentation well, but I’m less sure if there are a non-zero number of companies with more than 20 engineers that do this well. If you know any, please let me know so that I can pick their brains.

  In my opinion, probably the most important opportunity is designing your software to be flexible. I’ve described this as “fail open and layer policy”; the best system rewrite is the one that didn’t happen, and if you can avoid baking in arbitrary policy decisions that will change frequently over time, then you are much more likely to be able to keep using a system for the long term.

  If you’re going to have to rewrite your systems every few years due to increased scale, let’s avoid any unnecessary rewrites, ya know?

  Along these lines, if you can keep your interfaces generic, then you are able to skip the migration phase of system re-implementation, which tends to be the longest and trickiest phase, and you can iterate much more quickly and maintain fewer concurrent versions. There is absolutely a cost to maintaining this extra layer of indirection, but if you’ve already rewritten a system twice, take the time to abstract the interface as part of the third rewrite and thank yourself later. (By the time you’d do the fourth rewrite, you’d be dealing with migrating six times as many engineers.)

  Finally, a related antipattern is the gatekeeper pattern. Having humans who perform gatekeeping activities creates very odd social dynamics, and is rarely a great use of a human’s time. When at all possible, build systems with sufficient isolation that you can allow most actions to go forward. And when they do occasionally fail, make sure that they fail with a limited blast radius.

  There are some cases in which gatekeepers are necessary for legal or compliance reasons, or because a system is deeply frail, but I think that we should generally treat gatekeeping as a significant implementation bug rather than as a stability feature to be emulated.

  2.4.4 Closing thoughts

  None of the ideas here are instant wins. It’s my sense that managing rapid growth is more along the lines of stacking small wins than identifying silver bullets. I have used all of these techniques, and am using most of them today to some extent or another, so hopefully they will at least give you a few ideas. Something that is somewhat ignored a bit here is how to handle urgent project requests when you’re already underwater with your existing work and maintenance. The most valuable skill in this situation is learning to say no in a way that is appropriate to your company’s culture. That probably deserves its own chapter. There are probably some companies where saying no is culturally impossible, and in those places I guess you either learn to say your noes as yeses, or maybe you find a slightly easier environment to participate in.

  How do you remain productive in times of hypergrowth?

  2.5 Where to stash your organizational risk?

  Lately, I’m increasingly hearing folks reference the idea of organizational debt. This is the organizational sibling of technical debt, and it represents things like biased interview processes and inequitable compensation mechanisms. These are systemic problems that are preventing your organization from reaching its potential. Like technical debt, these risks linger because they are never the most pressing problem. Until that one fateful moment when they are.

  Within organizational debt, there is a volatile subset most likely to come abruptly due, and I call that subset organizational risk. Some good examples might be a toxic team culture, a toilsome fire drill, or a struggling leader.

  These problems bubble up from your peers, skip-level one-on-ones,16 and organizational health surveys. If you care and are listening, these are hard to miss. But they are slow to fix. And, oh, do they accumulate! The larger and older your organization is, the more you’ll find perched on your capable shoulders.

  How you respond to this is, in my opinion, the core challenge of leading a large organization. How do you continue to remain emotionally engaged with the challenges faced by individuals you’re responsible to help, when their problem is low in your problems queue? In that moment, do you shrug off the responsibility, either by changing roles or picking powerlessness? Hide in indifference? Become so hard on yourself that you collapse inward?

  I’ve tried all of these! They weren’t very satisfying.

  What I’ve found most successful is to identify a few areas to improve, ensure you’re making progress on those, and give yourself permission to do the rest poorly. Work with your manager to write this up as an explicit plan and agree on what reasonable progress looks like. These issues are still stored with your other bags of risk and responsibility, but you’ve agreed on expectations.

  Now
you have a set of organizational risks that you’re pretty confident will get fixed, and then you have all the others: known problems, likely to go sideways, that you don’t believe you’re able to address quickly. What do you do about those?

  I like to keep them close.

  Typically, my organizational philosophy is to stabilize team-by-team and organization-by-organization. Ensuring any given area is well on the path to health before moving my focus. I try not to push risks onto teams that are functioning well. You do need to delegate some risks, but generally I think it’s best to only delegate solvable risk. If something simply isn’t likely to go well, I think it’s best to hold the bag yourself. You may be the best suited to manage the risk, but you’re almost certainly the best positioned to take responsibility.

  As an organizational leader, you’ll always have a portfolio of risk, and you’ll always be doing very badly at some things that are important to you. That’s not only okay, it’s unavoidable.

  2.6 Succession planning

  Two or three years into a role, you may find that your personal rate of learning has trailed off. You know your team well, the industry particulars are no longer quite as intimidating, and you have solved the mystery of getting things done at your company. This can be a sign to start looking for your next role, but it’s also a great opportunity to build experience with succession planning.

  Succession planning is thinking through how the organization would function without you, documenting those gaps, and starting to fill them in. It’s awkward enough to talk about that it doesn’t get much discussion, but it’s a foundational skill for building an enduring organization.

 

‹ Prev