The Unicorn Project

Home > Other > The Unicorn Project > Page 11
The Unicorn Project Page 11

by Gene Kim


  Maxine is still at the office at ten that evening. By now, there’s a sense of genuine panic that things are going very, very wrong. So spectacularly wrong that even Dwayne, who was the most pessimistic in the Phoenix release betting pool, mutters to Maxine, “This is going worse than I thought it would.”

  That’s when Maxine becomes genuinely frightened.

  By midnight, it’s clear that a database migration is going to take five hours to complete instead of five minutes, with no way to stop it or restart it. Maxine tries to be helpful, but she isn’t familiar enough with the Phoenix systems to know where she would be the most useful.

  In contrast, Brent is being pulled every which way, needed for just about every problem, from the huge database meltdown in progress to helping people fix their configuration files. Seeing this, Maxine organizes a team to play goalie, protecting Brent from interruptions and fielding problems that don’t require him.

  Maxine notices something else. There must be two hundred people responsible for some portion of the release, but for most of them, it’s only about five minutes of work. So, they have to wait around for hours to perform their little part in this excruciatingly long, complex, and dangerous operation. The rest of their time is spent watching and … waiting.

  Even in the middle of this crisis, people are just sitting around, waiting.

  By two a.m. everyone realizes there is a very real risk that they are going to break every point-of-sale register in every one of the nearly thousand stores, knocking Parts Unlimited back into the Stone Age. And with all the promotion that Marketing has been doing, the stores will be filled with angry customers unable to buy what they were promised.

  Brent asks her to join a SWAT team to figure out how to speed up the database queries, still nearly a thousand times slower than they need to be in order to handle the expected load when stores open up later that morning.

  For hours, she works with a bunch of Phoenix developers and Ops DBAs with her IDE and browser open. They are stunned when they discover that clicking the product category drop-down box floods the database with 8,000 SQL queries.

  They are still working on fixing this when Wes pokes his head in the room, “Brent, we’ve got a problem.”

  “I’m busy, Wes,” Brent replies, not even looking up from his laptop. “No, this is serious,” Wes says. “The prices have disappeared from at least half of our products on the e-commerce site and mobile apps. Where the price should be displayed, either nothing shows up or it says ‘null.’ Screenshots are in the #launch channel.”

  Maxine blanches, pulling up the screenshot. This is much more serious than slow database queries, she thinks.

  “Dammit, I bet it’s another bad upload from the pricing team,” Brent says after staring at his screen for several moments. Maxine leans over as Brent pulls up various administrative screens and product tables—some are inside of Phoenix and others on systems she doesn’t recognize.

  Maxine takes notes as Brent pulls up log files, runs SQL queries against a production database, pulls up more tables in various applications … Only when he opens up a terminal window and logs into a server does Maxine ask, “What are you doing now?”

  “I need to inspect the CSV file that they uploaded into the app,” he says. “I think I can find one in the temporary directory on one of the application servers.” Maxine nods.

  When Brent squints at his screen, Maxine does as well. It’s a commaseparated text file with column names in the first row—product SKU, wholesale price, list price, sale promotion price, promotion start date … “It looks fine,” Brent mutters.

  Maxine agrees. She says, “Can you copy that file into the chat room? I’d like to take a look at it.”

  “Good idea,” he says. She imports it into Excel and several other of her favorite tools. It looks fine.

  While Wes tries to get one of the development managers on the phone, Brent tries to figure out what is going wrong. It’s almost thirty minutes later when he curses. “I can’t believe it. It’s a BOM!”

  Seeing Maxine’s confused expression, he says, “A byte-order mark!”

  “No way,” mutters Maxine, pulling up the file again, this time in a binary file editor. She stares at her screen, stunned that she missed it. A BOM is an invisible first character that some programs put in a CSV file to indicate whether it’s big-endian or little-endian. She’s been bitten by this before.

  Years ago, a colleague gave her a file exported from the SPSS statistical analysis application, and she spent half a day trying to figure out why her application couldn’t load it as expected. She finally discovered that the file had a BOM, which got interpreted as part of the first column name, which caused all her programs to fail. Which is almost certainly what is happening here, she thinks.

  Any intellectual satisfaction she feels at understanding this particular puzzle quickly disappears. She asks Brent, “This has happened before?”

  “You have no idea,” Brent says, rolling his eyes. “Different problem every time, depending on who generated the file. The most common problems lately are zero-length files, or files with no rows in them. And it’s not just the pricing team—we have data problems like this all over the place.”

  Maxine is appalled. The first thing she would have done right away is write some automated tests to ensure that all input files are correctly formed before they allow them to corrupt their production database, and that the correct number of rows are actually in the file.

  “Let me guess. You’re the only one who knows how to correct these bad uploads?” Maxine asks.

  “Yep,” she hears Wes say from behind her. “All roads lead to Brent.” Maxine jots down more notes, determined to investigate this and do something about it later.

  It’s almost two hours later before the pricing tables are corrected. Because of what Brent said, Maxine double-checks the file and is certain that it’s missing a significant number of product entries. And because the pricing team wasn’t part of the release, no one knows how to get a hold of them in the middle of the night (or early morning as is seems to be). Maxine adds some more things to her list of things that she’ll insist on building so that this won’t happen again.

  At seven a.m., Maxine rejoins the database team. They’re still working on speeding up queries—but it’s too late. An announcement is made that stores are beginning to open on the East Coast.

  The Phoenix release is still nowhere near complete. “We’re fourteen hours into the launch, and the missile is still stuck in the tube,” Dwayne says glumly.

  Maxine doesn’t know whether to laugh, smirk, or throw up—when missiles are stuck in the launch tube, it’s a very dangerous scenario, because at that point the missile is already armed and too dangerous to approach.

  At eight a.m., they are still hours away from having a working point-of-sale system. Sarah and her team are forced to train every store manager on how to use carbon paper imprints, and some stores are forced to only accept cash or personal checks.

  For Maxine, the rest of Saturday goes by in a blur. She’s unable to go home. The Phoenix rollout was more than just a spectacular outage … it was the most amazing example of production data loss Maxine has ever seen.

  Somehow, they managed to corrupt incoming customer orders. Tens of thousands of customer orders were lost, and an equal number of customer orders were somehow duplicated—sometimes three or four times. Hundreds of order administrators and accountants were mobilized, reconciling database entries against paper order slips being emailed or faxed from stores.

  Shannon texts everyone in the Rebellion, horrified that boxes of customer credit card numbers are being transmitted in the clear—but in the grand scheme of things, it’s just another blip in the Phoenix disaster.

  At three p.m., Kurt texts everyone:

  Not to put light on this big pile of suck, but Dwayne wins the betting pool. Congratulations, Dwayne.

  Dwayne replies:

  Not worth it! FUUUUUUUUUUUU …

>   He posts an image of a burning tire fire.

  By Saturday night, Maxine finally manages to go home and sleep for six hours before coming back to the office. Dwayne was right, this will go down in the record books, she thinks glumly.

  On Monday morning, Maxine is shocked to see her reflection in a mirror. She looks like crap, just like everyone around her—bags under her eyes, hair stringy. Long gone are her carefully pressed blazers. Now it’s jeans and a wrinkled jacket to cover up a stain on her equally wrinkled blouse. Today she doesn’t look classy. Like everyone else, she looks like she’s recovering from a hangover, having slept in her outfit from the night before.

  Since Saturday morning, their e-commerce site has been continually crashing under the unprecedented levels of customer traffic. In a status update meeting, Sarah crowed about what a great job Marketing did promoting Phoenix, then demanded that IT pull their weight.

  “She’s unbelievable,” Shannon mutters. “She created this whole disaster! Is anyone ever going to call her on this?” Maxine just shrugs.

  The carnage is unbelievable. Most of the in-store systems are still down—not just the point-of-sale registers, but nearly all of the back-office applications that support the in-store staff.

  For reasons that continue to mystify everyone, even the corporate website and email servers are having problems, further hampering their ability to get critical information to people who need it—not everyone has access to the developer chat rooms.

  In situations like this, technology failures cascade through the organization, like water flooding through a sinking submarine.

  Trying to stay alert, Maxine goes to get more coffee from the kitchen. Dwayne’s there doing the same thing. They nod at each other, and he says, “Did you hear we have hundreds of people who can’t even get into their buildings because their keycards won’t work?”

  “What?!” Maxine exclaims, exhausted but laughing. She says, “I was just talking to someone who’s trying to figure out why a bunch of batch jobs aren’t running. He’s even saying payroll might be delayed again—umm, I’ll leave that to other people to fix,” she concludes with a small laugh.

  “Huh,” he muses. “I wonder if we managed to knock out an interface to an HR application. That might explain these strange errors. We managed to screw up everything else.”

  All day during the recovery efforts, she hears questions like: Why are all those transactions failing? Where are they failing? How did it get into that state? Of the three ideas that might fix the problem, which one should we try? Will it make it worse? We think we fixed it, but is it really fixed?

  Once again, Maxine’s sensibilities are offended by how entangled all these systems are with each other. It’s so difficult to understand any part of the system in isolation.

  At times, it was difficult not to feel panicked. Earlier in the day, it looked like the Parts Unlimited e-commerce site was being attacked by an external party actively stealing credit card numbers. It took over an hour for Shannon and the security team to send out an email concluding that it was an application error—if someone refreshed the shopping cart at the wrong time, the full credit card number and three-digit CVV code of a random customer was displayed.

  The good news was that it wasn’t an external hack. The bad news was that it was a genuine cardholder data exposure event and likely another reason to be front-page news. All the attention and ridicule exploding on social media only added to everyone’s stress.

  Taking a break, Maxine walks back to her desk. She sees the developer who was so unconcerned with the release last week. He’s wearing fresh clothes and appears to be well-rested.

  “Rough weekend, I’m guessing?” he says to Maxine.

  Maxine stares at him, speechless. He’s still working on features for the next release. The only big change for him is that all his meetings have been canceled because most people have been sucked into the Phoenix crisis.

  He turns back to his screen to work on his piece of the puzzle, not caring that none of the pieces actually fit together. Or that the entire puzzle has caught on fire over the weekend, along with the house and the entire neighborhood.

  From:

  Alan Perez (Operating Partner, Wayne-Yokohama Equity Partners)

  To:

  Dick Landry (CFO), Sarah Moulton (SVP of Retail Operations)

  Cc:

  Steve Masters (CEO), Bob Strauss (Board Chair)

  Date:

  8:15 a.m., September 15

  Subject:

  Phoenix Release **CONFIDENTIAL**

  Sarah and Dick,

  I’ve been reading the news headlines about the Phoenix release. Not a great start. Again, I question whether software is a competency Parts Unlimited can create. Maybe we explore outsourcing IT?

  Sarah, you mentioned the large number of developers you’ve contracted to help. How long until they are fully contributing? When you grow a sales team, it takes time for new salespeople to carry full quota capacity. Can new developers really be onboarded fast enough to make a difference? Or are we just throwing good money after bad?

  Sincerely, Alan

  From:

  Sarah Moulton (SVP, Retail Operations)

  To:

  All IT Employees

  Cc:

  All Company Executives

  Date:

  10:15 a.m., September 15

  Subject:

  New production change policy

  Thank you for all your hard work helping deliver Phoenix to our customers. This is a badly needed step for us to regain parity in the marketplace.

  However, due to the harm that we did to our customers because of unanticipated problems caused by poor judgment exercised by certain members of the IT organization, all production changes must be approved by me, as well as Chris Allers and Bill Palmer.

  Changes made without approval will result in disciplinary action.

  Thank you, Sarah Moulton

  Maxine reads the email from Sarah. There’s a new, maybe even sinister, dynamic creeping into the Phoenix Project. In each of the outage calls and crisis management meetings, senior leaders seem to be going out of their way to posture about how they did their job but other people didn’t do their jobs, sometimes subtly, sometimes very blatantly.

  While the redshirts battle to contain the raging engine fire that is threatening the entire ship, the bridge officers continue to cover their asses, Maxine observes. Some are even using the disaster to their political advantage, often to punish individual engineers or entire departments for supposed dereliction of duty.

  Apparently, no one in IT leadership is safe—Maxine hears whispers that both Chris and Bill, as the heads of Dev and Ops, are in jeopardy of being fired, and there are rumors of all of IT being outsourced again. However, most believe William, as head of QA, is most likely to be axed.

  Which is bullshit, thinks Maxine. William was assigned to head up the release team less than twenty-four hours before the release! No one can get fired for trying to avert a disaster, right?

  “It’s like the TV show Survivor,” says Shannon. “All the technology executives are just trying to last one more episode. Everyone is freaking out. Steve has been demoted, and Sarah is trying to convince everyone that she can save the company.”

  Later that afternoon, Brent invites Maxine to join a meeting. “We’ve got nearly sixty thousand erroneous and/or duplicate orders in the database, and we’ve got to fix them so that the finance people can get accurate revenue reports.”

  Maxine helps the group wrangle the problem for an hour. At the end, once they find a solution, one of the Marketing managers says, “This is above my paygrade. Sarah is super-sensitive about changes right now. I’ve got to get her approval.”

  Ah, the Square in action, just like Cranky Dave described. But now, decisions that might have needed only to go “up and over one” now have to go “up and over two.” Now, all product managers need to run everything by Sarah. Someone mutters, “Don’t hold your breath�
�she never responds right away.”

  Great, Maxine thinks. Sarah has effectively paralyzed everyone in this room even further.

  Throughout the day, all decisions and escalations quickly grind to a standstill, even for emergencies, which Maxine didn’t expect. She discovers why: every manager insists on being a part of the communication plan. Why? They want to hear any bad news first, so they don’t appear out of touch and can massage any messages up the chain.

  Maxine is sharing this observation with Kurt when his phone buzzes. Seeing his sour expression, she asks, “What’s up?”

  “It’s Sarah,” he says. “She says she’s getting conflicting information from Wes and me about the corrupted order data. I need to spend thirty minutes explaining it to her when I’ve got two actual emergencies going on.”

  Kurt storms off before she can even wish him good luck. Maxine shakes her head. The lack of trust and too much information flowing around is causing things to go slower and slower.

  On Tuesday, Maxine joins a meeting led by Wes about more mysterious, intermittent outages for both the e-commerce site and the point-of-sale systems.

  Sarah has been sending out emails, sometimes in all caps, reminding people how important this is. But everyone already knows how important this is—processing orders is one of the most important functions for any retailer.

  The room is almost empty, even though this is a Sev 1 outage.

  Apparently, everyone has had to go home sick. The Phoenix release forced people to work long hours together in close proximity all day and night, and with little sleep. Now everyone is dropping like flies. Of the people needed on this call, no one is healthy enough to be in the office. In fact, only two people are healthy enough to even be on the conference line.

  Maxine looks up when she hears Sarah shouting, “What can you do about this? Who can fix this? Our store managers need our help! Don’t people realize how important this is?”

 

‹ Prev