Comments about technological history, system fractures, and human resilience from James R. Chiles, the author of Inviting Disaster: Lessons from the Edge of Technology (HarperBusiness 2001; paperback 2002) and The God Machine: From Boomerangs to Black Hawks, the Story of the Helicopter (Random House, 2007, paperback 2008)

Friday, June 22, 2012

Root Cause Analysis: The Traveler's Tale, Part 1

Following is an overview of root-cause analysis, in the manner of a fable ... with time machine! “Root cause analysis” is a common term in industry and in news articles like the recent one about the origins of cracks in a shield building at the Davis-Besse nuclear power reactor, but it's impossible to summarize in a few sentences so most writers just assume that readers know the method. But root cause analysis is a whole array of methods, developed over decades by many people. 

And now for "The Traveler's Tale," Part I.

From: Richard C. Asplundh, Staff Engineer, 
 Rapid Prototyping Div.
To: Boss
Re: Mixed Results with Prototype X-1A Time Machine
Date: September 4, 1877
Via: Catamount Brand Whiskey bottle

Well, boss, the new time machine works! In one direction, anyway! I'm leaving my progress report in a bottle and burying it where I hope someone will find it in your time zone, and send it to the home office. Meanwhile, I'm staying busy back here and trusting that you still have the meter running, paycheck-wise.

Remember how you asked me to do just one little test run before lunch with the X-1A? As in, “Ricky, old boy, how about you go backward just a few minutes, and try out the Back-to-Base Homing Mode?”

I'm here to say the homing mode doesn't work yet. Also, you need to tell the lab rats that they miscalibrated the ChronoCounter, because that "couple of minutes" they dialed in was considerably more than a century.

Fortunately, because of my deep training in all forms of Root Cause Analysis, I've been able to make myself useful back here, pending some help from your direction.

Before I get down to details, here's the view from 40-K feet: I get a job in a hard-rock gold mine. I immediately learn that the boys at the Acme Mine are smack in the middle of an all-out business crisis. But it can't withstand my root-cause skills for long, and that's me without my laptop, or my notebook, or the stack of proprietary software. 

But as I always like to say, root cause analysis is more headware than hardware. Me and my rough-hewn buddies get things sorted out, against all odds.

The big adventure starts like this: There's a haze, I feel dizzy, then find myself in a mountain forest. The machine's control panel says it's just a few minutes before I left, but I can see that this doesn't look like the inside of our company's Southwest South Dakota Warehouse at all. I wait a day in case you might send a rescue squad from the future, but no dice, so I decide to trudge off and meet the natives, whoever they may be.

I have to say it's pretty exciting to go off exploring when you don't know if you're fifty years off the beam or fifty thousand. But I find a grubby town and get my bearings: I've touched down outside of Deadwood, Dakota Territory. It's August 16, 1877, about ten years into the Black Hills gold rush.
Nothing too glamorous about this side of the Old West: unpainted slab-sided buildings up high, and mud down low. I head for a “help wanted” sign, walk inside, and a thin geezer with a visor says they need a man to tend the couple of dozen Missouri mules that live in a stable at the bottom of the Acme Mine. 

Graybeard identifies himself as Too-Tall Johanson. He says that mules pull the ore cars from the heading down a little iron track, back to the main shaft, where a steam engine drags the rock to the surface. The mules live down there, never seeing daylight … like our IT people.

I tell the guy behind the counter that I don't know one end of a mule from the other. But Too-Tall hires me for two dollars a day, hands me a shovel, and says I'll pick it up. “The way things are going at this mine, it won't be for long anyway,” Too-Tall says. This is about when my fact-finding antennae go into hyper-activity.

The owners in Frisco are about to close us down and everything else is going wrong too!” says he. “Cain't hardly understand it!” I clap Too-Tall on the shoulder and tell him that help has arrived from an unexpected direction. He shakes his head and I go off to grab some worn-out old miner togs from a heap in the back, which feel like they were hacked out of old pine shingles. I buy a carbide lamp on credit at the company store, and down into the dark I go. 

After a week I switch to the midnight shift. That way, I can give a few pointers to the mine's baseball team at batting practice after dinner. It's in the cellar, with 43 losses and 26 wins.

The mine is in even worse shape. According to the boys, things seemed to go south all of a sudden. Starting about two weeks before I drop in, gold-ore production took a nosedive first in quantity and then quality. And all of a sudden there were new, weird problems nobody had seen before. Miners are a superstitious bunch and morale took a tumble. 

I find out why nobody wanted the mule-tending job: there was an unexplained explosion in the mule stable a week back and it's made them all superstitious that another mule is about to blow up. Meanwhile, all the mules have belly aches and make a lot of noise. We are buying gallons of Brother Jubal's patent medicine and mixing it in the water trough, but it doesn't help.

I find out plenty of other things the first week. I buy some paper and a box of pencils and make a stack of notes back at the bunkhouse. Soon it's time to start my root-cause analysis, frontier-style. It's the world's first. (As I once explained at a staff picnic last year, while the core concepts behind root causes are recognizable in Aristotle's notions of moral responsibility and determinism, it didn't get going until Operations Research during World War II and the postwar study of loss control. So right now I'm seventy years ahead of the competition.) 

I initiate my work one Saturday night when I'm whooping it up with the graveyard shift in the Dirty Dog Saloon. I shove aside the shot glasses and peanut shells, pull out two sheets of foolscap with my incident description and pass it around to the boys for a review.

There isn't room for all my work product in one whiskey bottle, but it went like this:  
“Series of mishaps and problems at the Acme Mine beginning around August 2, continuing to date. Most time-critical is rapid deterioration in gold-ore production. Problem first noticed with downward trend in ore deliveries in tons/day, falling to 33% below targets. Persisted for two weeks. Tonnage recovered to acceptable range by August 12 but starting August 7, assayed ore quality at the stamp mill fell from 10 troy oz/ton to 1 oz. Monthly gross revenue from stamp mill dropped 32% year over year. Owners plan to close mine at end of fiscal.” 
  I was careful to word this like the classic I.D.: focus on describing the most serious symptom and don't point fingers or guess about solutions. Too early for that!

Well, incident descriptions are something new to the crowd at the Dirty Dog Saloon, so I buy another round of rotgut liquor and warm them up to the idea. The audience even adds a couple of bullet points, along with a bullet hole following gunplay between a tinhorn gambler and a placer miner two tables over.

As you know, my next task is the Problem Statement, describing the deviation from the desired state, and of course it's scoped to stay within our field of control. I draft a short paragraph, in declarative sentences, stating the goal to achieve. 

Now the hard part is about to begin: getting buy-in from the powers that be. Something tells me that they don't place a lot of faith (yet) in statistics and decision trees.

Which follows in the second whiskey bottle, so stay tuned for the Part 2 of "Root Cause Analysis: The Time Traveler's Tale."

1 comment:

  1. I appreciate all of the information that you have shared. Thank you for the hard work!
    - RCA Software