"Brent": the Story of the Volunteer Warrior, and The Mayhem That Ensued

If you haven't read The Phoenix Project, I gotta tell ya, it's a good book (even though there is some language in it, admittedly true to life in many IT organizations). In there is a story of an engineer called Brent. Brent is crucial for most IT operations of any kind. Every thing he knows is in only his head, and documented nowhere. If anything is to get done, Brent needs to be involved.

A real time-bomb situation.



The book goes on to talk about how to diffuse this time bomb, but it doesn't talk about how to prevent it in the first place.

The Perfect Storm for Brent


This happened because of the "Volunteer Viscous Cycle". It happens when the dispatching of problems to engineers to fix them happens on a volunteer basis.

The following diagram shows, in essence, what happens in the volunteer workflow:


This is exactly what creates a Brent: an uncontrolled, volunteer-based dispatch system when problems arise. A team of engineers hesitant to jump in and help, and one volunteer, constitues a perfect storm for new Brents to form.

The New Problem



When a new problem arises --  a bug, a ticket, or an issue, depending on your issue tracking system -- engineers on a team often hesitate to take it, given that they can choose what problems to take and what to fix. They are not hesitant to take the issue because they are lazy or dumb. They are most often well-educated, highly-trained, seasoned engineers; however, new problems, never seen before, still daunt these engineers. Any engineer who takes it will make lots of mistakes, become frustrated, and spend lots of time fixing it. Further, they could use the same time to fix 3 problems that they know how to fix. So, they wait and see if someone else will volunteer to take this particular problem.

Enter Brent, a guy who just wants to get stuff done. He may be highly educated, or he may be just out of high school, but he has drive. He'll volunteer for anything, either because he's just trying to be helpful, or because he's trying to show that he is valuable, or simply because no one else will take the ticket. He's a first-rate volunteer, a real go-getter. And he's about to become a dumping ground for all sorts of problems.


Brent sees that others do not feel comfortable hacking at a problem and helpfully, proudly, or even meekly, volunteers. It takes him a while, but he eventually resolves the issue and everyone's happy.

Brent starts taking on so many of these new issues that people just assume it's "Brent's job" or "Brent's niche" to take on new or unusual tasks, and so come to expect them picking up the slack.


The Returning Bug (Uh-Oh)

 



Brent eventually takes on enough issues that a few of them reappear, or new issues appear which are related to older ones. This time, something in the server environment changed, which breaks a service that Brent has previously serviced. Everyone instantly sees this server is broken and remembers how hard Brent worked on this problem last time. They think that he'll be able to do it faster than themselves, and so it is assigned to him. He may even volunteer for to take the issue, trying to be helpful. Brent fixes it for the team. He learns even more about the server than he knew from the last time he fixed it. The problem is, others on his team know even less, because they stopped worrying about the server once the issue was given to Brent.

The Silo Effect

 

A long time passes with this process in place. It is now too difficult for all of Brent's knowledge to be documented. There's simply too much to write it all down, and quite frankly, too many fires that Brent needs to fight. When problems arise that have to do with "Brent" servers, the only one that can fix them is Brent. He is now considered a "crucial resource". It is simply a waste of Brent's valuable time to document issues already fixed, when there are so many current issues that only he can fix. There simply isn't time.

Diffusion

The Phoenix Project goes into some good detail about how to diffuse this situation. As a recap:
  1. Limit Brent's work-in-process by using the kanban process.
  2. Get engineers shadowing Brent.
  3. Disallow Brent the keyboard. His shadows must do all the work.

Prevention

In order to diffuse Brent, you must put mulitple shadow engineers on him, lest he quit and you end up with "just another Brent". More than one engineer at a time must know how fix any given problem in their space of responsibility, and hopefully in more than their region of responsibility.



A great way to ensure this from the start to institute round-robin assignment of issues to members of the team. Rules:

  1. Issues are assigned to members of a team in a round-robin fashion. This way everyone knows about everything.
  2. If an engineer needs help on an issue from another engineer, he may request that the engineer document the process and then send him a link to the wiki page, but the only the engineer assigned the ticket, issue, task, or subtask must actually do the work. This way, every thing currently worked on is currently documented.
Rule 2 above is especially useful. If the wiki page is unclear, the author must clarify the wiki page so the engineer can do the work.  If a particular engineer is called on to write multiple wiki pages, it is probably a sign that he has been a "Brent" in the past and so rule 2 naturally protects against such volunteer warriors. If everyone knows how to fix a particular issue, then the idea is that if a particular engineer needs help on that issue, he can ask the team, or otherwise a more or less random person. The main point is that the author or authors of the wiki page are asked to write it. This principle guards against volunteer warriors.

Rule 1 may seem inefficient, since some engineers naturally find it easier to solve some problems than others. It is not meant to be efficient; rather, it endows resilience to a team. You'll never find an IT team with efficiency problems, but I have yet to find one without resiliency problems. Lots of IT professionals will brag about how many VMs they can manage, or how fast they can fix a problem, but those same IT techs will tell you horror stories about when something went down and Brent was on vacation. IT organizations who follow Rule 1 will find that the tradeoff between efficiency and resiliency is worth it. Bugs will be fixed slower, but they will always be fixable, come rain, shine, or even (gasp!) a sick Brent.

EDIT: An earlier version of this article used "Brett" instead of "Brent". The character in The Phoenix Project is "Brent", so the name "Brett" in this article was changed to match the character, consistent with what I was trying to do on the first draft.

Comments

Popular Posts