by Gene Kim
When no one says anything, I tentatively add, “How about we take those same level 3 engineers that are dedicated to protect Brent from break-fix to help with these change issues?”
Wes quickly responds, “Maybe. But it’s not a long-term fix. We need the people doing the work to know what the hell they’re doing, not enable more people to hoard knowledge.”
I listen to Wes and Patty brainstorm ideas to reduce yet another dependency on Brent when something starts to bother me. Erik called wip, or work in process, the “silent killer,” and that inability to control wip on the plant floor was one of the root causes for chronic due-date problems and quality issues.
We just discovered that sixty percent of our changes didn’t complete as scheduled.
Erik had pointed to the ever-growing mountain of work on the plant floor as an indication that the plant floor managers had failed to control their work in process.
I look at the mountain of change cards piled up on today’s date on the calendar, as if a giant snowplow had pushed them all forward. Suddenly, it’s starting to seem like the picture Erik painted on the plant floor eerily describes the state of my organization.
Can it work really be compared to work on a plant floor?
Patty interrupts my deep contemplation as she asks, “What do you think?”
I look back up at her. “For the last couple of days, only forty percent of the scheduled changes were completed. The rest are being carried forward. Let’s assume that this continues for a bit longer, while we figure out how to disseminate all the Brent knowledge.
“We have 240 incomplete changes this week. If we have four hundred new changes coming in next week, we’ll have 640 changes on the schedule next week!
“We’re like the Bates Motel of changes,” I say in disbelief. “Changes go in but never come out. Within a month, we’ll have thousands of changes that we’ll be carrying around, all competing to get implemented.”
Patty nods, “That’s exactly what’s bothering me. We don’t have to wait a month to see thousands of changes—we’re already tracking 942 changes. We’ll cross over one thousand pending changes sometime next week. We’re running short of space to post and store these change cards. So why are we going through all this trouble if the changes aren’t even going to get implemented!”
I stare at all the cards, willing them into giving me an answer.
An ever-growing pile of inventory trapped on the plant floor, as high as the forklifts could stack it.
An ever-growing pile of changes trapped inside of it Operations, with us running out of space to post the change cards.
Work piling up in front of the heat treat oven, because of Mark sitting at the job release desk releasing work.
Work piling up in front of Brent, because of…
Because of what?
Okay, if Brent is our heat treat oven, then who is our Mark? Who authorized all this work to be put in the system?
Well, we did. Or rather, the cab did.
Crap. Does that mean we did this to ourselves?
But changes need to get done, right? That’s why they’re changes. Besides, how do you say no to the onslaught of incoming work?
Looking at the cards piling up, can we afford not to?
But when was the question ever asked whether we should accept the work? And on what basis did we ever make that decision?
Again, I don’t know the answer. But, worse, I have a feeling that Erik may not be a raving madman. Maybe he’s right. Maybe there is some sort of link between plant floor management and it Operations. Maybe plant floor management and it Operations actually have similar challenges and problems.
I stand up and walk to the change board. I start thinking aloud, “Patty is alarmed that more than half our changes aren’t completing as scheduled, to the extent that she’s wondering whether this whole change process is worth the time we’re investing in it.
“Furthermore,” I continue, “she points out that a significant portion of the changes can’t complete because Brent is somehow in the way, which is partially because we’ve directed Brent to reject all non-Phoenix work. We think that reversing this policy is the wrong thing to do.”
I take a mental leap, following my intuition. “And I’d bet a million dollars that this is the exact wrong thing to do. It’s because of this process that, for the first time, we’re even aware of how much scheduled work isn’t getting done! Getting rid of the process would just kill our situational awareness.”
Feeling like I’m getting on a roll, I say adamantly, “Patty, we need a better understanding of what work is going to be heading Brent’s way. We need to know which change cards involve Brent—maybe we even make that another piece of information required when people submit their cards. Or use a different color card—you figure it out. You need to inventory what changes need anything from Brent, and try to satisfy it instead with the level 3 engineers. Failing that, try to get them prioritized so we can triage them with Brent.”
As I’m talking, I’m more confident that we’re heading down the right path. At this point, we might not be fixing the problem, but at least we’ll be getting some data.
Patty nods, her expression of concern and despair now gone. “You want me to get my arms around the changes that are heading to Brent, indicating them on the change cards and maybe even requiring this information on all new cards. And to get back to you when we know how many changes are Brent-bound, what the changes are, and so forth, along with a sense of what the priorities are. Did I get that right?”
I nod and smile.
She types away on her laptop. “Okay, I’ve got it. I’m not sure what we’ll find out, but it’s better than anything I came up with by a long shot.”
I look over at Wes, “You look concerned—anything you want to share?”
“Uh…” Wes says eventually. “There’s not much to share, really. Except that this is a very different way of working than anything I’ve seen in it. No offense, but did you switch medication recently?”
I smile wanly, “No, but I did have a conversation with a raving madman on a catwalk overlooking the manufacturing plant floor.”
But if Erik was right about wip in it Operations, what else was he right about?
CHAPTER 12
• Friday, September 12
It’s 7:30 p.m. on Friday, two hours after the Phoenix deployment was scheduled to start. And things are not going well. I’m starting to associate the smell of pizza with the futility of a death march.
The entire it Operations team was assembled in preparation for the deployment at 4 p.m. But there was nothing to do because we hadn’t received anything from Chris’ team; they were still making last minute changes.
It’s not a good sign when they’re still attaching parts to the space shuttle at liftoff time.
At 4:30 p.m., William had stormed into the Phoenix war room, livid and disgusted that no one could get all of the Phoenix code to run in the test environment. Worse, the few parts of Phoenix that were running were failing critical tests.
William started sending back critical bug reports to the developers, many of whom had already gone home for the day. Chris had to call them back in, and William’s team had to wait for the developers to send them new versions.
My team wasn’t just sitting around, twiddling our thumbs. Instead, we were frantically working with William’s team to try to get all of Phoenix to come up in the test environment. Because if they couldn’t get things running in a test environment, we wouldn’t have a prayer of being able to deploy and run it in production.
My gaze shifts from the clock to the conference table. Brent and three other engineers are huddling with their qa counterparts. They’ve been working frantically since 4 p.m., and they already look haggard. Many have laptops open to Google searches, and others are systematically fiddling with settings for the servers, operating systems, databases, and the Phoenix application, trying to figure out how to bring everything up, which the developers ha
d assured them was possible.
One of the developers had actually walked in a couple of minutes ago and said, “Look, it’s running on my laptop. How hard can it be?”
Wes started swearing, while two of our engineers and three of William’s engineers started poring through the developer’s laptop, trying to figure out what made it different from the test environment.
In another area of the room, an engineer is talking heatedly to somebody on the phone, “Yes, we copied the file that you gave us… Yes, it’s version 1.0.13… What do you mean it’s the wrong version… What? When did you change that?… Copy it now and try again… Okay, look, but I’m telling you this isn’t going to work… I think it’s a networking problem… What do you mean we need to open up a firewall port? Why the hell didn’t you tell us this two hours ago?”
He then slams the phone down hard, and then pounds the table with his fist, yelling, “Idiots!”
Brent looks up from the developer laptop, rubbing his eyes with fatigue. “Let me guess. The front-end can’t talk to the database server because someone didn’t tell us we need to open a firewall port?”
The engineer nods with exhausted fury, and says, “I cannot freaking believe this. I was on the phone with that jackass for twenty minutes, and it never occurred to him that it wasn’t a code problem. This is fubar.”
I continue to listen quietly, but I’m nodding in agreement at his prognosis. In the Marines, we used the term fubar.”
Watching tempers fray, I look at my watch: 7:37 p.m.
It’s time to get a management gut check from my team. I round up Wes and Patty and look around for William. I find him staring over the shoulder of one of his engineers. I ask him to join us.
He looks puzzled for a moment, because we don’t normally interact, but he nods and follows us to my office.
“Okay, guys, tell me what you think of this situation,” I ask.
Wes speaks up first, “Those guys are right. This is fubar. We’re still getting incomplete releases from the developers. In the past two hours, I’ve already seen two instances when they’ve forgotten to give us several critical files, which guaranteed that the code wouldn’t run. And as you’ve seen, we still don’t know how to configure the test environment so that Phoenix actually comes up cleanly.”
He shakes his head again. “Based on what I’ve seen in the last half hour, I think we’ve actually moved backward.”
Patty just shakes her head with disgust and waves her hand, adding nothing.
I say to William, “I know we haven’t worked much together, but I’d really like to know what you think. How’s it looking from your perspective?”
He looks down, exhaling slowly and then says, “I honestly have no idea. The code is changing so fast that we’re having problems keeping up. If I were a betting man, I’d say Phoenix is going to blow up in production. I’ve talked with Chris a couple of times about stopping the release, but he and Sarah ran right over me.”
I ask him, “What do you mean by you ‘can’t keep up’?”
“When we find problems in our testing, we send it back to Development to have them fix it,” he explains. “Then they’ll send back a new release. The problem is that it takes about a half hour to get everything set up and running, and then another three hours to execute the smoke test. In that time, we’ll have probably gotten three more releases from Development.”
I smirk at the reference to smoke tests, a term circuit designers use. The saying goes, “If you turn the circuit board on and no smoke comes out, it’ll probably work.”
He shakes his head and says, “We have yet to make it through the smoke test. I’m concerned that we no longer have sufficient version control—we’ve gotten so sloppy about keeping track of version numbers of the entire release. Each time they fix something, they’re usually breaking something else. So, they’re sending single files over instead of the entire package.”
He continues, “It’s so chaotic right now that even if by some miracle Phoenix does pass the smoke test, I’m pretty sure we wouldn’t be able to replicate it, because there are too many moving parts.”
Taking off his glasses, he says with finality, “This is probably going to be an all-nighter for everyone. I think there’s genuine risk that we won’t have anything up and running at 8 a.m. tomorrow, when the stores open. And that’s a big problem.”
That is a huge understatement. If the release isn’t finished by 8 a.m., the point of sale systems in the stores used to check out customers won’t work. And that means we won’t be able to complete customer transactions.
Wes is nodding. “William is right. We’re definitely going to be here all night. And performance is worse than even I thought it would be. We’re going to need at least another twenty servers to spread the load, and I don’t know where we can find so many on such short notice. I have some people scrambling to find any spare hardware. Maybe we’ll even have to cannibalize servers in production.”
“Is it too late to stop the deployment?” I ask. “When exactly is the point of no return?”
“That’s a very good question.” Wes answers slowly. “I’d have to check with Brent, but I think we could stop the deployment now with no issues. But when we start converting the database so it can take orders from both the in-store pos systems and Phoenix, we are committed. At this rate, I don’t think that will be for a couple of hours yet.”
I nod. I’ve heard what I’ve needed to hear.
“Guys, I’m going to send out an e-mail to Steve, Chris, and Sarah to see if I can delay the deployment. And then I’m going to find Steve. Maybe I can get us one more week. But, hell, even getting one more day would be a win. Any thoughts?”
Wes, Patty, and William all just shake their heads glumly, saying nothing.
I turn to Patty. “Go work with William to figure out how we can get some better traffic coordination in the releases. Get over to where the developers are and play air traffic controller, and make sure everything is labeled and versioned on their side. And then let Wes and team know what’s coming over. We need better visibility and someone to keep people following process over there. I want a single entry point for code drops, controlled hourly releases, documentation… Get my drift?”
She says, “It would be my pleasure. I’ll head up to the Phoenix war room for starters. I’ll kick down the door if that’s what it takes and say, ‘We’re here to help…’”
I give them all a nod of thanks and head to my laptop to write my e-mail.
From: Bill Palmer
To: Steve Masters
Cc: Chris Anderson, Wes Davis, Patty McKee, Sarah Moulton, William Mason
Date: September 12, 7:45 PM
Priority: Highest
Subject: URGENT: Phoenix deployment in major trouble—my recommendation: 1 week delay
Steve,
First off, let me state that I want Phoenix in production as much as anyone else. I understand how important it is to the company.
However, based on what I’ve seen, I believe we will not have Phoenix up by the tomorrow 8 AM deadline. There is SIGNIFICANT RISK that this may even impact the in-store POS systems.
After discussions with William I recommend that we delay the Phoenix launch by one week to increase the likelihood that Phoenix achieves its goals and avert what I believe will be a NEAR-CERTAIN disaster.
I think we’re looking at problems on the scale of the “November 1999 Thanksgiving Toys R Us” train-wreck, meaning multiday outages and performance problems that potentially put customer and order data at risk.
Steve, I will be calling you in just a couple of minutes.
Regards,
Bill
I take a moment to collect my thoughts and call Steve, who answers on the first ring.
“Steve, it’s Bill. I just sent out an e-mail to you, Sarah, and Chris. I cannot overstate how badly this rollout has gone so far. This is going to bite us in the ass. Even William agrees. My team is now extremely concerned that the ro
llout will not complete in time for the stores to open at 8 a.m. Eastern time tomorrow. That could disrupt the stores’ ability to take sales, as well as probably cause multiday outages to the website.
“It’s not too late to stop this train wreck,” I implore. “Failure means that we’ll have problems taking orders from anyone, whether they’re in the stores or on the Internet. Failure could mean jeopardizing and screwing up order data and customer records, which means losing customers. Delaying by a week would just mean disappointing customers, but at least they’ll come back!”
Steve breathes into the phone and then replies, “It sounds bad, but at this point, we don’t have a choice. We have to keep going. Marketing already bought weekend newspaper ads announcing the availability of Phoenix. They’re bought, paid for, and being delivered to homes across the country. Our partners are all lined up and ready to go.”
Flabbergasted, I say, “Steve, just how bad does it have to be for you to delay this release? I’m telling you that we could be taking a reckless level of risk in this rollout!”
He pauses for several moments. “Tell you what. If you can convince Sarah to postpone the rollout, let’s talk. Otherwise, keep pushing.”
“Are you kidding me? She’s the one who’s created this kamikaze mess.”
Before I can stop myself, I hang up on Steve. For a brief moment, I consider calling him back to apologize.
As much as I hate to, I feel like I owe the company one last try to stop this insanity. Which means talking to Sarah in person.
Back in the Phoenix war room it’s stuffy and rank from too many people sweating from tension and fear. Sarah is sitting by herself, typing away on her laptop.
I call out to her, “Sarah, can we talk?”
She gestures to the chair next to her, saying, “Sure. What’s up?”
When I say in a lowered voice, “Let’s talk in the hallway.”