-
Treating OPS Teams Like Product Teams
Platform as a Product
Photo by Arno Senoner on Unsplash Operations is one of those areas that many people in the company struggle to fully understand. The depth and breadth of responsibility varies per organization, with production support being the only thread you find consistently in companies. (And even that is changing with you build it, you run it becoming popular)
But infrastructure operations is too important of a component to be relegated to the annals of cost-center accounting. Smart organizations understand this and invest heavily in Operations teams. My job as a leader is not only to evangelize what it is we do, but to tell a story that’s relatable to stakeholders so that they understand how our role impacts their day-to-day lives. Most people don’t spend a lot of time thinking about disaster recovery, high availability or even ongoing maintenance of the things we build. Everything that is built and operating has some sort of maintenance cost associated with it. For many, it’s easy to think once a product is launched, it just exists on its own with no real need future management. But software is never finished, just abandoned.
With this lack of clarity on the value my team brings, I’ve been working through different ways to more effectively evangelize what it is we do. This led me to the idea of managing the Operations team like a product team, using similar techniques, roles and producing similar artifacts as part of how we manage what we do.
Over the next few months I’ll be working to make this shift within my team and chronicling some of the experiences we have, the challenges and the thoughts around this transformation. I’m still working on a way to tie these things together into some sort of easily searchable series, but know that this won’t be the end of the conversation. I’ll have some sort of tag to use across the entire series.
What’s the Product?
The first thing I had to ask myself when I cooked up this idea is, Exactly what is the product that I’m “selling”? That question was actually easier to answer than I thought. At Basis we’ve been pretty adamant about building as much of a self-service environment as possible for engineers. Unfortunately the self-service approach never took on a polished, holistic view the way a product would. We would solve problems using a familiar set of patterns, but we’d never actually think about them from the perspective of a product. What you end up with is a bunch of utilities that look kind-of the same but not enough for you to make strong assumptions about their behavior, the way you might with say Linux command line utilities.
With Linux command line tools, whether you realize it or not, you make a bunch of assumptions about how the command functions. Even if you’ve never used the command before in your life, you know that it probably takes a bunch of flags to modify the behavior of the script. The flags are most likely in the format of “- or –”. You know that the output of the command is most likely going to be text. You know that you can pipe the output of that command to another command that you might be more familiar with, like grep. Leveraging these behaviors almost becomes second nature because you can count on them. But that didn’t just happen naturally. It took a deliberate set of rules, guidelines expectations etc. This is what’s missing from my team’s current approach to self-service.
So back to the original question of “What’s the product?” I’ve been working on a definition that helps to frame all of the subsequent questions that follow, like product strategy, vision etc.
What is the product?
A suite of tools and services designed to support in the creation, delivery and operation of application code through all phases of the software development lifecycle.
It’s a bit of a mouthful at the moment and I’m still tooling around with it a bit but I think it’s important to conceptualize what the product is that Operations is selling. The best analogy I’ve been able to come up with is the world of manufacturing.
As an inventor or product creator, you might design your product in a lab, under ideal conditions. But you have no idea how to mass produce it. You have no idea how to source the materials effectively for it. You have no idea the nuanced problems that your design might create when you’re attempting to create 200,000 of whatever you created.
If you’re a solo creator, you’d probably start talking to a manufacturer so that you can leverage their expertise as well as their production facilities to help turn your dream into a reality. If you work for a large enough company, you might have your own internal manufacturing team that specializes in various types of product creation. This is the analogy for operations. We take application code that the developers have created, then using our infrastructure and processes, get it to a state that’s production ready.
I’m sure given a little scrutiny this analogy will show some holes, but I think it does a good job of at least getting people in the mindset for viewing infrastructure and the supporting services as a product. The final product of a manufacturing line is a blend of design and production, similar to the way the quality of the application is a blend of design and production.
How does this change the way we look at ProdOps?
I can imagine many people are reading this and thinking “What’s the big deal, so it’s a product. How does that change anything?” Depending on your team, it might not change anything. But for many groups, once you start looking at operations as a product team, it really starts to change your perspective on the management of your infrastructure. But most importantly, if we get our minds right, thinking about Operations as a product opens us to a world of best practices, workflow management techniques, reports and communication patterns just to name a few. A perfect example is the idea of user personas.
In the operations world, we have a vague idea of who are “customers” are internally. Not only that but we have a very specific idea of what a developer should know and care about. Our expectations manifest themselves on how we interact with developers. Our forms, our workflows our RTFM approach, all are based on our elevated expectations of developers. But if we approach this from a product centric viewpoint, we’re forced into a customer centric viewpoint as well. Nobody would tell their customers “you should just be more sophisticated” or “you should just know that”. It wouldn’t be great for sales. This is one of the reasons why product teams develop user personas as a way to represent their target customer. They might even create multiple user personas to represent the breadth of their potential customer base, as well as how those personas might use the same tool differently than other customers. User Personas are in no way revolutionary, but thinking in terms of a product makes their adoption in an operations setting a much more natural transition.
Wrap up
As I mentioned previously, this is really a big experiment on my part. At the time of this writing, I’m very early in the process. But I hope to use this blog to share parts of the journey with you. Hopefully you’ll be able to learn from some of my missteps.
In the next part of this series, I’ll be writing about the establishment of the product vision, product strategy and product principles and how they play their parts in building the roadmap for the Operations infrastructure.
-
Organizing Tickets for OPS Teams Part 2
Organizing Tickets for OPS Teams Part 2
Photo by Alvaro Reyes on Unsplash In my previous article I laid out some of the ground work for how I setup my team’s workflow management. In this article I’ll go a little deeper, specifically around ticket types and my labelling process in order to get more data from our ticket work so that I can effectively manage the team.
Ticket Types
As previously mentioned, my team uses JIRA for ticket management. Any ticket system worth a damn will have some concept of ticket types so the lessons presented should still be applicable. I’ll be writing directly about my JIRA experience, so your mileage may vary.
The first thing when considering what ticket types to create is how I want to report on this data in the future. If I don’t care about the difference between a Defect and a User Story, there may not be much value in separating the two ticket types. With reporting in mind, I go about laying out the different ticket types I want as my first layer of reporting.
- ProdOps Tasks — This ticket type is designed for end users (developers, QA staff, etc) who need support from my team for something that is need in “quick” fashion. Quick might be minutes, it might be days but the important thing is that it can’t wait for the normal iteration planning process of my team to happen. This is interrupt driven work. As a result, the workflow for ProdOps tasks has these tickets skip over the backlog and land directly into the Input Queue.
- Stories — These are larger requests that are going to take time, planning and effort. They might come from customers (again, developers, QA staff, product owners etc) but they’re often generated from within our team. Stories are always capable of being scheduled and therefore go directly to the Backlog upon creation.
- Defects — When a piece of infrastructure or automation that my team supports isn’t working as intended but is not blocking a user’s ability to do their job, we mark this as a defect. An example might be that our automation does an unnecessary restart of the Sidekiq Service, which results in a longer environment creation process. It is a pain for sure, but the user will live. It’s still something we should address, hence the defect ticket. Defects go directly to the backlog.
- Incidents — When a problem is occurring, there’s no workaround and there’s a direct impact to a group of people’s ability to work, that’s considered an incident. An incident exists regardless of the environment it happens in. (No matter the environment, it’s always production for somebody) Incidents skip the backlog and go straight to the input queue. Incidents are often generated automatically via PagerDuty since all of our alerting happens through the Datadog/PagerDuty integration.
- Outage — When we have large system wide outages we create an outage ticket to track the specifics of the larger impact. Because incidents are generated by alerting, when there’s an outage we will often have multiple incident tickets that are all related to the same problem. The outage ticket allows us to relate all of those tickets to a master ticket, as well as use the outage ticket to track the specific timings and events of the larger incident. Outage tickets are generated manually at the declaration of an outage.
- Epics — I use epics to tie multiple stories into larger efforts. I also use epics as a way to communicate what the team is working on in a higher level fashion to my management. My boss doesn’t care that we’re working on moving away from the deprecated “run” module in Salt Stack. (That’s too low level) Leadership wants larger chunks of work to understand what’s happening on the team. Having an epic with a business level objective at its definition is much easier for leaders to follow and understand.
Each of these ticket types were created with two primary things in mind. * How do I want to report on tickets? * How do I want these tickets to behave as it relates to the backlog and input queue?
How do I want to report on tickets?
I create the ticket types based on how I want to report. ProdOps Task tickets were created to get an understanding of not only the demands that other teams are placing on my team but the urgency of those demands. This might be something material like “Need help with a new Jenkins Pipeline” to something routine like “New hire needs access to Kubernetes.” Having these types of requests separated into their own ticket type allows me to very easily create reports around them. (Even with JIRA’s horrible reporting abilities)
Stories and defects when compared to incidents and prodops tasks allow me to get a sense for how much planned work the team is doing versus work that bullies its way into the queue and demands our immediate attention.
Something to consider about ticket reporting. Reporting can be an inexact science. Much of it is subjective when you start looking at the details of a ticket. The thing to keep in mind with this sort of reporting is that we’re looking at the data for themes not for precision. Do I care that I had 3 tickets get categorized incorrectly as defects? Not when 60% of my tickets are defects. The 60% number (if true) helps to draw my focus. When it comes to reporting, look for a signal, but then validate that signal. Don’t just assume the data is accurate and start making changes. It’s just too difficult to keep the data completely accurate, so you should always look at your ticketing reports through that lens.
How do I want these tickets to behave as it relates to the backlog and input queue?
Tickets that are too urgent to go through the planning and prioritization process need to be made available to the team for work immediately. By creating those as separate ticket types, it’s easy for me to create a different workflow that allows these tickets to jump straight into the Input queue. I can also add functionality to flag these items or take other actions to raise their visibility to the team. But the ticket type drives my ability to handle them differently.
Different ticket types for end users to leverage also makes it much easier for them to interact with us as a team. Almost exclusively we tell our users to create their tickets as ProdOps Tasks. The majority of the time, they’re items that need to be addressed sooner rather than later. In the cases where their tickets actually can be scheduled we just convert the ticket to the appropriate ticket type (based on our reporting needs) and we move it to the backlog for the next planning meeting. This removes the anxiety of choosing the wrong ticket type from the user. Create it as a ProdOps Task and we’ll do the rest.
Ticket types can go a long way in helping you to create meaningful reports on the activity of your teams. It also gives you a way to slice your workload to see how different areas are impacted. The average time to close a ticket might be 14 days but then you find out that if you separate that by ticket type, the incident tickets are the outlier for resolution time. Maybe your team isn’t consistent about closing those particular ticket types for some reason. Or perhaps the automation that you use to resolve the tickets through monitoring isn’t setting the “Resolution” field on the ticket appropriately.
Sometimes though you want a level of reporting that goes beyond what ticket types allow for. This is where I use labels.
Using Labels for Reporting
Labels are pieces of metadata that you can add to tickets to give them a bit more description. The beautiful thing about labels (and metadata generally) is that they’re so flexible. The horrible thing about labels is that they’re so flexible.
The reporting on labels in JIRA isn’t the greatest, but the pain of pulling this data into a separate tool and figuring out the JIRA data model is much higher than just dealing with the reporting shortcomings, so here we are. When it comes to labels, relying on team members to always label tickets has varying levels of success. Some team members will be extremely diligent about it while others will be more lax. It’s good to have a process where you can validate that labels have been applied to tickets appropriately.
The issue I find with labels is that it can be difficult to know whether the label is just missing on a ticket or if that ticket doesn’t meet the criteria for the label. In order to combat this, I’ve designed my label strategy so that I understand what my label is trying to communicate and I ensure that the positive label (i.e., this ticket matches that criteria) has an opposite label, denoting that it doesn’t meat that criteria. For example, a label that I want all my tickets to have is whether the ticket was a PLANNED ticket, meaning the team decided when it would be done versus an UNPLANNED ticket, which had its schedule forced on us for one reason or another. Instead of just having a “PLANNED” label for those tickets, we also use an “UNPLANNED” label for the others. This way I can always know if a ticket was processed or not (for this criteria at least) because it should have one of these two labels.
Processing Tickets for Labelling
For the labels that I absolutely want to ensure every ticket has, I create filters to identify tickets that do not have those labels. For example, my planned/unplanned filter looks like this:
project = "Prod Ops Support" AND created >= startOfYear() AND (NOT labels in (UNPLANNED, PLANNED, TEST-TICKET) OR labels is EMPTY)
This will give me a list of tickets that haven’t been labeled yet. Using the Bulk Change tool, I can quickly scan through the tickets and “check” the items that I considered UNPLANNED. With the Bulk Change Tool I can then select each ticket that I want to add the label to.
JIRA Bulk Edit Tool After going through the Bulk Edit wizard and adding the label the query should now return fewer results, since we’ve updated all of the UNPLANNED tickets. Now we can select all of the remaining tickets and add the PLANNED label to them. Repeat the same process with the Bulk Change tool and you’re good to go.
NOTE: Make sure you disable notifications for your bulk edit change. Or lots of people will be frustrated with you
I repeat this process for all label sets that I want to add. Each label set has a query similar to the one I used for PLANNED/UNPLANNED tickets which allows me to quickly identify tickets that need to be processed.
Another label pair I add is TOIL/VALUEADD. This identifies which tickets are work that we shouldn’t be doing as a team and need to automate or transition to another group. An example of TOIL work would be user creation.
All of this might sound like a lot of work, but I assure you I spend no more than 15 minutes per week doing this type of labeling work. I do it on Monday mornings every week in order to keep the volume relatively low. And again, the aim for me isn’t 100% accuracy, but to get the broad strokes so that I can see the signal start to bubble up.
Wrap Up
Now that I’ve explained my ticket types as well as my labeling process we can discuss the different types of dashboards that can be built in a future blog post.
-
Change is Scary, Even When It’s Fun
Change is Scary, Even When It’s Fun
Second-order thinking can help us evaluate the consequences of our consequences
Woman walking in front of a sign that reads Let’s change One thing I’ve learned since having children is just how early in life many of the faults in humanity show up. Children are reflections of ourselves but in the purest form. When children reveal behaviors like greed, biases and violence, it makes you start to view these behaviors as a natural part of human nature that can only be controlled through societal norms.
I say these things to prepare you for the fact that you are not immune to these behaviors. None of us are immune to biases and it’s easy to accept the reality our biases create. Biases can also hide our true motivation for taking (or not taking) a course of action.
My children are at the age where we can play video games together, which is way better than playing with Paw Patrol action figures. The artificial world of video games is starting to reveal the dark underbelly of human behavior. When I see this darkness manifest in my children, it makes me look at these behaviors critically. In this post I’ll talk about an experience I shared with my daughter Ella, who is 10 years old, and the parallels I see in the workplace.
The Video Game
Satisfactory is a video game where players work together to extract resources from an alien planet and build various components out of those resources for their employer FICSIT. To do this the players move through a series of improving capabilities and skills that allow them to build factories to automate a lot of this work. Factories are a collection of machines that automate tasks in a pipeline like fashion to start with one type of input (e.g. iron ore) and at the end of the pipeline have a type of output, like iron rods.
One of the key tasks early in the game is making fuel for your generators. Generators are used to power the other components of your factory. My daughter Ella’s very first factory was created to produce Bio Fuel , which is the most efficient type of fuel in the early stage of the game. In order to make bio fuel, Ella created a factory pipeline that would take leaves and grass, convert that into bio mass and then take the bio mass and convert that into bio fuel.
When she built the factory, she had the idea of keeping 50% of her bio mass as-is and storing it, and then sending 50% of the biomass down the pipeline to be converted to bio fuel. Early on this technique made sense but over time we realized that anything that would take bio mass as a fuel source, would also take bio fuel as a fuel source. The difference is that bio mass burns a lot faster, so a generator might consume 18 units of bio mass per minute, but would only consume 4 units of bio fuel per minute for the same power output.
Recommending change to the way things are done
Once I realized that bio fuel could be used in everything, I suggested to Ella that we just focus her factory on creating bio fuel instead of storing 50% of our bio mass as is. With many different factories running, you can spend a lot of time making sure your generators are fueled. Having to fuel them less often is a huge productivity boost for your game play. To my surprise, Ella was very resistant to the idea. Like any proposed change in any setting, Ella had a laundry list of defenses for why things should remain the same.
“We might need bio mass later in the game” was her first retort. A fair one for someone not familiar with these types of gameplay loops. But I leaned on my 20+ years of experience playing these types of games to try to rationalize with her why this isn’t likely. I explained how the game play progression typically has us moving forward and that it wouldn’t be long before we probably wouldn’t be using bio fuel either. And bio mass is so easy to acquire that it wouldn’t be a problem if we needed to build a new factory later. But sometimes experience isn’t convincing to people.
“But there’s no downside to us just storing it” came next from her. That’s true, except it’s horribly inefficient. We almost never opt for bio mass, unless we’re out of bio fuel. And often what would happen is the storage container we used to store the bio mass would fill up, which would force us to convert it to bio fuel anyways to make space in the container. But this was a manual process, so again it hit our productivity and the productivity of the factory as a whole. Inefficiency though can get so embedded in the process that people just live with it because it seems easier than the alternative.
“Bio mass is just as good as bio fuel.” Here’s a scenario where data I thought would surely win the day. As I mentioned earlier, the game tells us the burn rate of fuel types. Bio mass burns 18 units per minute while bio fuel burns 4 units per minute. Each generator can accept a stack of 200 of either fuel type. Doing the math means we need to refill bio fuel generators every 50 minutes, but bio mass generators roughly every 11 minutes. I thought the data would make this an easy conversation, but if you work in any office setting, you probably already know where this is going.
“I don’t know if that data is right.” Now she challenges the validity of the data provided by the video game developers. I don’t know if she’s thinking there’s a global conspiracy against the bio mass industry or if the developer is staffed by activists pushing an agenda. She claims that when she watched the burner it felt like they burned around the same amount of time. Now I’m starting to lose my patience a little bit.
“I just don’t think it’s worth changing the entire factory for this.” We’re finally getting to the root of the issue now! She just doesn’t feel like doing the work. I don’t think the work is actually that much but I’m a bit more experienced than she is so I can see how she might think it’s a bigger task. I offer to do it for her. Finally the last wall of resistance crumbles. She agrees to the change and decides she’ll implement it as soon as she finishes a few high priority factory tasks, ironically one of which is refueling a bunch of bio mass burning generators.
Implementing the change
Ella implemented the change in the most efficient manner possible. The conveyor belt that carries the bio mass goes into a conveyor belt splitter, sending half the bio mass to storage and half the bio mass to be created into bio fuel. She opted to just delete the conveyor belt that would have shipped the bio mass into a storage container. One minor tweak and suddenly the reality we were fighting about had finally come to fruition. We’re only producing bio fuel and we’re producing it at a much higher rate because the delivery of bio mass is now 50% faster. (Since we’re no longer splitting it)
If you’ve been reading this from the perspective of an employee at a company, a lot of this probably resonates with you. Remove video games and replace it with whatever it is your company does, and you’ve probably had a lot of these very same conversations with co-workers. And it’s easy to assign laziness, ambivalence, lack of empathy or any other host of adjectives to describe that co-workers work ethic.
The case with my daughter is the ideal scenario. The entire exercise was one of fun and recreation. The work that needed to be done was literally part of the game loop, the very thing that makes the game fun. The task was completely owned by Ella from beginning to end, so she could implement it any way she wanted. Despite all these things going for it, resistance still crept in. Why? Because it’s not the work, it’s the change.
Change is a funny thing for some people. It brings in uncertainty and doubt for the future. The devil you know versus the devil you don’t. Dealing with the inefficiencies of the current factory was a lot easier for Ella to get her head around than the potential problems that could be created by redesigning the factory from the ground up. What if she ran out of materials during the rebuild? What if she couldn’t get the pieces lined up properly? What if we ran out of fuel in our generators while the fuel factory was being rebuilt? I’m sure all these things were swirling in her mind at a subconscious level, which then consciously manifested themselves as resistance to change, with a set of adopted biases to justify that change. Confirmation bias is what happens when we interpret information in a way that confirms or supports a set of prior held beliefs. It’s what allowed Ella to replace hard data with her general feeling of how fast fuel burned. Keeping an eye out for when we might fall victim to confirmation bias is a part of being “data driven”. I put that in quotes because many people and organizations are “data driven as long as it supports what I wanted to do anyway”, which isn’t exactly the same thing. Confirmation bias plays a huge part in that mindset.
Chesterton’s Fence
Another observation I had made was how the factory was left in this modified state that might not make a ton of sense to the next set of factory workers. With the intent of the factory going from making bio mass and bio fuel, to just making bio fuel, many of the components of the factory don’t serve a functional purpose any more. We have a conveyor belt splitter that doesn’t split to anything. We have a storage container that isn’t connected to the factory at all any more. We have an extra storage container in the pipeline that doesn’t make sense with just a single fuel type being produced. If I were a new employee at this factory, I’d be a little baffled as to why these things exist. This made me think of Chesterton’s Fence and how it plays in our comfort levels when making changes.
Chesterton’s Fence is a concept of second order thinking where we not only think about the consequences of our decisions, but the consequences of those consequences. The phrase comes from the book The Thing by G.K. Chesterton. In the book, a character sees a fence but fails to see why it exists. Before removing the fence he must first understand why it was there in the first place.
As a new factory worker who is trying to make the fuel factory more efficient, I might be confused by these extra components scattered about the system. What if they had a purpose that I’m unaware of? What would removing these things from the system do? Their uselessness seems so obvious that it almost makes it even more daunting of a task to remove it because you have no idea why it exists.
This is a common problem we see with hastily implemented changes. The change is designed to deliver the value needed now as quickly as possible but sometimes at the expense of clarity for future operators of the system. Thinking about the consequences of our consequences can create a more sustainable future but at the same time, put more work on our plates in the present.
Wrap up
This post ended up going on way longer than I expected and if you’ve reached the end you deserve a cookie or a smart tart or something. The parallels in behavior between my video game-playing daughter and senior people in large organizations is startling. The truth is these behaviors are our default state of mind. Only with the awareness of our faults can we improve.
Some key takeaways from this lesson for me are:
- Biases exist early on in life and you’re not immune to them.
- Keep an eye out for confirmation bias. It can make you believe some crazy stuff
- People fear change, even in the most optimal of situations.
- Second-order thinking can help us evaluate the consequences of our consequences. It also pressures us to understand the intent behind something before we go about changing it.
This post was a bit off the beaten path but seeing these behaviors in my daughter, whom I love and is perfect, gives me room for critical thought about humans in the work force.
-
Organizing Tickets for OPS Teams
Organizing Tickets for OPS Teams — Part 1
Ticket management is one of those boring topics that comes up every now and again in OPS circles. A lot of teams that I’ve chatted with try to model their ticket management process after the development process using Sprints/Scrum. I’ve found Scrum to be limiting in an Operations setting. The amount of uninterrupted work that comes into the queue for OPS teams makes it imperative that your workflow accommodates and expects unplanned work. In this first of several posts, I’ll talk about how my team manages their work.
I should start with a little sales job on why you need tickets. It goes beyond just tracking your work. It’s about making your work visible to you and your team, but also to others around you who have a vested interest in what you’re working on and also when you intend to work on it. A Kanban Board can help to organize and communicate what the team is currently focused on. There are plenty of great posts about how Kanban works and its goals, so I won’t dive too deep into that. I’ll just highlight a few key points.
- Limit the Work in Process (WIP) at any one time
- Make sure all work is visible and has a ticket associated with it
- Work should flow left-to-right through the process
Limiting Work in Process
One of the key tenants of Kanban is to make sure you’re limiting the amount of work in process at any one time. The knee jerk reaction is to pull in more work to increase throughput of the team but it’s counterproductive. Little’s Law speaks to this particular phenomenon well. The best way I’ve found to limit WIP is to limit how many tickets each person can have in process at any one time.
For my team, we’ve opted for a maximum of two tickets per person in process. This allows engineers to hop between the tickets in the event their other tickets is blocked and it’s beyond the engineer’s ability to unblock it. This limit also helps us to gauge how many tickets we can handle at any one time in the input queue. (More on that later)
The Backlog
Like SCRUM/Sprints, Kanban workflows have the concept of a backlog. The backlog is a queue of work that you may or may not deliver. When it comes to the backlog, there are no firm commitments.
I’ve seen some Kanban boards where the Backlog is the left-most column on the Kanban board. Personally, I prefer not to display the Backlog at all on the Kanban Board, saving it for a separate board. The reason is that humans have short attention spans.
I want my team laser focused on the things that are in the input queue, because those are the things we’ve given priority to as a team. With the backlog visible, it’s too easy to see a ticket that someone thinks is important or should pop into the work queue right away. The problem with this self-prioritization? Something else stops getting worked on. I know we all believe that multi-tasking is a thing, but it’s not. This leads to missed commitments, more work in the queue than is necessary and confusion from your stakeholders as certain items seem to jump the line without explanation. (Mainly the work engineers prefer to work on) Removing the backlog from the primary working Kanban board helps to stop this from happening.
Another benefit to hiding the backlog is the amount of noise it reduces. Backlogs always grow. Even a well-groomed backlog can be intimidating to teams. You don’t want the crushing weight of expectations constantly in the face of the team. There’s no sense of progress when you see an ever growing queue to the left of your screen. Just think of your own personal to-do system and you’ll get that feeling of dread creeping over you. Protect your team from that feeling. Hide your backlog.
Prioritizing the Backlog
Now that I’ve safely hidden the backlog, my next step is to prioritize it. In my current role we use JIRA for ticket management which allows me to easily order the tickets in the backlog visually. The order updates a ranking value on the ticket which is how Jira keeps track of priority internally. Keeping the backlog ordered by priority makes it easy to select what gets worked on next. Of course priorities can change daily, so there’s a level of discipline that has to be enacted to keep the ordering honest and up-to-date. I prefer that new tickets get added to the bottom of the priorities list, which makes it incumbent on me to re-prioritize the ticket if deemed necessary. If tickets don’t automatically go to the bottom of the priority queue, you’ll find yourself in a last-in first-out queue setup, which will eventually starve all of your older tickets.
The Input Queue
With the backlog safely tucked away and prioritized, the input queue becomes the left-most column on our Kanban Board. The input queue holds all of the tickets that we’ve currently committed to for this iteration. What’s an iteration? For our team, an iteration is the cycle to which we make fresh commitments on new tickets. Every week, we try to commit to a new round of tickets to bring our input queue back to its maximum capacity. If we agree that we’ll commit to 10 tickets per week, at the end of the week we’ll replenish the input queue to get it back to 10. (Or sooner if we run out of tickets)
If you do a good job of keeping your backlog prioritized then it becomes really easy to populate the queue by just taking the top X number of tickets in the backlog and moving them to the input queue. Following this pattern, your input queue should also be ordered by priority. (There are several scenarios where that might not be true, which I will address in a subsequent post) Now your team members can begin pulling tickets from the top of the queue and beginning work on them.
The Columns
Each Kanban board has at minimum 3 columns that represent the phase work is in. They roughly fall into the category of
- To Do
- In Progress
- Done
For simple boards that might be all you need. For me I like to have a little bit more information about where a ticket is in the workflow. More columns means a better idea of where tickets might be bottlenecking when the team starts to slow down. But the more columns there are the more of a burden it can put on the team as they try to figure out the minutiae of where a task is. Unless there’s clear value in the column, avoid getting too detailed in the phases of a ticket. For my team we have the following columns.
- To Do
- In Progress
- Waiting For
- Needs PR Approval
- QA/Verification
- Done
The titles of these categories are pretty self-explanatory except for maybe “Waiting For”. This column is for tickets that are waiting on some sort of external information, time or action. For example, if we’re waiting for Saturday night because that’s when the approved maintenance window is, there isn’t much that the engineer can do to move time forward. The ticket gets moved to the Waiting For column until we can implement the change. (I could probably eliminate the Waiting For Column in favor of a flag status to indicate the ticket is blocked. More on that later)
Tickets will generally flow left-to-right on this Kanban board, showing progression towards being complete. Each phase is important (for my reporting anyway) with regard to where the ticket is in the process and how I can be of assistance. Do I need to wrangle people to get the PR approved? Is a ticket blocked waiting for someone to respond to an email? Has the change been released and we need another team to sign-off saying it’s complete? This flow gives me insight into where we’re at.
Something to keep in mind when you’re designing your workflow. You have to think about the data that you want out of the system, including reports you intend to run. That will ultimately drive how you structure your system. If you don’t intend on reporting or leveraging a status, then in my opinion there’s really no need to have a separate status. Each of these categories I’ve listed above were created to express something I wanted to be able to report on or have a quick status of. This applies to everything in your ticketing system, not just statuses. Labels, ticket types, components, tags, all these things should be driven from some sort of reporting you intend to do.
Wrap Up
When I set off writing this I thought I’d get it all done in a single blog post. But this will clearly be something I need to write about over several blog posts. In my next post I’ll discuss the various issue types I use as well as additional swim lanes that can help to add context to tickets.
-
Ask the wrong people, you build the wrong thing
Ask the wrong people and you build the wrong thing
Not long ago my wife and I received an email from our kid’s school. (They attend a CPS school.) The email was a survey of some sort that would be used to make decisions about curriculum in next year’s school program. It always excites me when parents, teachers and administrators get to collaborate on school programs.
You can imagine my frustration as I clicked on the link to the survey and was greeted with some cryptic error message from Google Forms. I wish I could remember what the error said but even as someone with a technical background, the error didn’t point to any specific action that could be taken to resolve it. Thanks to the pandemic, I’m well equipped with handling the idiosyncrasies of CPS’ implemenation of Google Apps. I logged out of all my Google accounts and then logged back in with my daughter’s CPS email address and I was granted access to the survey.
The question being asked was if we would prefer more STEM classes or more Arts related classes for extra curriculur items next year. I quickly suspected that this was probably going to lead to a case of Selection Bias as the number of people who figured out how to participate in the survey are probably more technical leaning than those that just gave up. Out of curiosity I asked a few people in my circle. The people who I’d consider technical, poked around and figured things out, while other people just gave up, assuming that there was something broken on the site, which was a fair assumption given the generic nature of the error message.
This experience got me thinking about how often we make “educated” decisions based on poor information. CPS could think that they were implementing the wishes of their student community only to find out they were addressing a subset. I’ve fallen victim to this mistake myself when I wasn’t
When my team and I were designing our infrastructure platform at Basis we had a tendency to talk to the loudest developers in the room. Those developers had very specific needs and requirements. But we failed a lesson that I’m sure every product manager in the world knows. The loudest people aren’t always representative of the larger user body. This is exactly what we encountered as we built out our chat bot. Using feedback from the noisy developers pushed us towards a model where there were many different options for building environments and packages. Instead of creating a tight streamlined process, we created different avenues for people to build and manage their environments. We supported custom datasets that were seldom used. We created different methods of creating environments, so maybe you only needed the database server or maybe you only wanted the database server and the jobs server. This created headaches for the people that didn’t want that functionality, which forced us to create omnibus commands that strung together multiple commands.
I’d really like to be angry at the developers for this but the truth is the mistake was all my own. Developers, like anyone else, have different things that they’re attracted to. Some developers love to understand the stack from top to bottom and want configurability at every level. Others are wholly disinterested in infrastructure and want to just point to a repository and say “make an environment out of this”. Despite what my personal feelings are on how much or how little interest they have, the reality is you’re probably not moving them from whatever their stated position is. And even if you do, without using a very heavy hand, you’re likely to just alienate them.
The lesson learned is to make sure that you’re talking to the audience that you actually want to talk to. Who are you asking? How are you asking them? Think about how they might self-select out of your surveys or questions and see if you can mitigate that. To engage with the entire audiece means you might need multiple methods of interviewing people. Developers who respond to surveys might not be the same developers that will respond to 1-on–1 interviews. Don’t make the mistake of optimizing for a subset of your audience or user base. Put care and thought into reaching your target audience.
-
Benefits of Conferences
Benefits of Conferences
Photo of conference attendees I love meeting new people at conferences, especially when people are first time conference attendees. One of my favorite questions to ask is “What did you have to do to get approval to attend?” The question reveals a lot about their employer and the person’s direct manager.
In many organizations, conference attendance is seen as a transactional affair with only specific line items in the transaction providing any sort of intrinsic value. These organizations saddle their employees with requirements that must be met in order to attend the conference. They have note taking requirements, presentations to give when they return and required talks to attend when at the conference. These are just a few of the requirements I’ve heard in my attendance days. It can be easy to dress these requirements up as “due diligence” but in most cases I’ve come across, this level of rigor only seems to apply to conferences. What is more likely happening is that these organizations don’t see the concrete value they expect to see from attending conferences and therefore discount them. But conferences deliver an impact that can be clearly felt, despite their concrete value being difficult to calculate and put on a ledger.
The Hallway Track
Anyone who has attended a conference will tell you that the hallway track is often the most valuable part of the conference. The hallway track is the part of the conference that is unscheduled and unscripted. As people make their way from one talk to another, they inevitably bump into each other and start a conversation that slowly balloons into something larger. Sometimes the conversation is so interesting that you forgo your next talk in favor of this impromptu conversation in-between sessions.
The magic about these conversations is that they tend to take on a life of their own, bending and weaving with the desires of the participants. Something that starts as a follow up questions on distributed locking techniques can quickly evolve into questions that are deeper and more specific to your particular problem. And despite everyone’s desire to be special, conferences make you realize that most of us are solving similar sets of problems. Even if you don’t get a definitive solution out of these talks, I assure you that you’ll get a briefing on how not to solve the problem.
The hallway track has been difficult to replicate virtually. Since the onset of the pandemic, many groups have tried and found very inventive ways to imitate it, but there’s nothing quite like the real thing. Equally difficult is putting a dollar value on the track. There’s no time slot you can point at to show your boss why you want to attend. It’s something organic that evolves, but more importantly, that you have some semblance of control over. Yes your mileage may vary but that’s really the case for everything.
Introduction to new thoughts and ideas
Albert Einstein is often attributed with the quote;
“We cannot solve our problems with the same thinking we used when we created them.”
This axiom can exist within engineering groups. They get trapped into their standard way of thinking and can’t see how a different approach might work. “That would never work here” is a common retort to new ideas. But continued and expanded expansion to new ideas and their successful implementations makes people question the way they do things. Again, never be surprised by just how many people have the same problems you have. Unless you’re Facebook, Apple, Netflix or Google, many companies have the same types of problems. It’s hard to accept that you’re not a special, magical snowflake but attending a conference can force that acceptance pretty quickly.
Sometimes these new ideas and approaches to your problem are not packaged in a flashy title that draws your attention. In my experience some of the best tid bits of information come from talks that I would have never attended or watched on my own. But when I’m at a conference, there’s always a block of time that doesn’t have a talk that speaks directly to my problems. When attending a conference in person I’m more compelled to attend a random talk in that situation. It’s incredible how often that random talk pays dividends. Would I have spent 45 minutes on that talk if I just came across it on YouTube? Probably not. But broadening the scope of what I hear and attend helps with problems that are not top of mind. Better yet, you realize that some of your underlying problems are related to activities, actions or systems that you hadn’t previously considered. Exposure to people, their problems, ideas and solutions helps to expand your thinking about your own problems.
Getting your company name out in the community
You might work for a small or medium size company that just isn’t on the mind’s of technical professionals. Attending conferences (and even better, speaking at them) helps to get your company name into the tech community. With remote work opportunities continuing to grow, the number of potential prospects sky rockets with conference attendance.
In addition to socializing the company name, you’re also socializing the company’s values by the fact that you have employees in attendance! You’d be surprised how valuable that can be to potential job seekers. I’m always surprised when I’m at a DevOps Days conference and I meet someone working at a bank or a hospital, industries that I associate with old-world thinking and mentalities. But talking to those attendees and hearing that their teams are experimenting with DevOps practices, using modern technologies and work management techniques helps to change my biased view of them.
Energizing your employees
The post-conference buzz is real. Once you’ve gotten all of this new information, you’re eager to see how it can be applied to your day-to-day work. Many people come back to the office with a basket of ideas, some of them completely crazy, but many of them completely practical and achievable. As a team you’ll have to figure out which are which. With the support of management that buzz can be channelled into making real change and providing employees with immense job satisfaction as they do it.
Job satisfaction = Retention
No amount of healthy snacks, ping-pong tables and free soda can replace the joy engineers get when they can effect change.
Virtual attendance
A quick note about virtual attendance. During the pandemic conference organizers tried very hard, with varying degrees of success, to replicate the in-person conference feel virtually. But regardless of how well conference organizers do this, remote conferences can be difficult.
For starters, networking virtually can be hard. It requires a level of intentionality either on the conference organizers or as you as an attendee. Chat rooms during a conference talk are a common way of trying to generate those networking opportunities, but they can distract you from the speaker. Hanging out in chatrooms after the talk can sometimes be effective, but again just not quite the same as in person.
Another thing to consider with virtual attendance is how you attend it. Many people attend conferences virtually, but remain logged into all of their usual modes of communication for work, which effectively means, you’re working. Without a clear separation from your work duties, virtual attendance can give way to the usual pressures of the “office”.
These are just a couple of reasons why I favor in-person conferences to virtual conferences. Are virtual conferences better than nothing? Absolutely. But I caution you to not evaluate the value of conferences based solely on virtual conferences.
Wrap-up
Conferences can be a great resource for your employees to engage in the communities that they’re a part of. Networking is crucial to building relationships and knowledge and that is an activity that can be much easier to do in person.
Conferences help expose people to new ideas and new ways to solve problems other than the standard approach the company may take. When you attend conferences you quickly learn that your problems are not as unique as you thought. You’ll without a doubt run into people that have the same problems as you. You’ll probably even meet people who have tried the same solutions and can save you from a wasted journey.
Conferences also help to energize employees. You come back from a conference and you’re excited to experiment with a lot of the techniques and technologies you learned about.
If your company won’t send you to a conference, here’s a few quick tips that might help.
- Some conferences have free tickets, especially for underrepresented groups. If you’re curious about a conference but can’t attend, definitely look into this option. I’ve seen some conferences even cover hotel and air fare.
- Speaking at conferences is another way to get into the event for free. Many conferences have a public Call for Proposals (CFP) process that you can submit to. Don’t think you need some crazy, mind bending thing to give a talk about. Your personal experience, personal communication style and touch can’t be replicated and is something unique to offer. Try it out!
- Try to show the value of the conference to your management. Highlight why you want to attend the conference and some of the soft benefits beyond what watching the YouTube videos after the conference can provide. You can use some of the points highlighted in this article.
- Pay for the conference yourself. Be sure to talk to your manager and let them know that you’re willing to pay for the conference yourself if they can support you with time off and or some help with the travel expense. This technique is depends heavily on your personal situation and the size of the conference.
- Find a new job. (Seriously) I’m not suggesting you quit right now over it, but you might want to consider adding a question about conference attendance to your list of interview questions.
-
Authoring K8s Manifests
Authoring K8s Manifests
Kubernetes Logo Note: This is an internal blog post that I wrote at our company. When I interact with people in the tech community they’re often curious about how different teams approach think about these problems more broadly, so I thought I’d include this. The audience was internal Basis employees, so some of the references may not make sense.
There are no right solutions
As humans we’re obsessed with not making the wrong choice. Everything from where you go to school to whether you should order the chicken or the steak, is besieged by the weight of making “the wrong” choice. But that framing suggests that right and wrong are absolutes, as if you could plugin all the variables of a given situation and arrive at a conclusive answer. This couldn’t be further from the truth. Not in life and definitely not in engineering.
Choices are about trade-offs. Depending on what you’re optimizing for one set of trade-offs seems more practical than another. For example, investing your savings is a good idea, but the vehicles you use to invest differ based on your goals. If you need the money soon, a money market account offers flexibility but at the expense of good returns. The stock market might offer higher returns but at the risk of losing some of the principal. Do you need the money in 2 years or 20 years? How much do you need it to grow? How quickly?
The economist Thomas Sowell famously said “There are no solutions, there are only trade-offs; and you try to get the best trade-off you can get, that’s all you can hope for.”
This statement holds true in software engineering as well.
Imperative vs Declarative Manifest Authoring
When it comes to Kubernetes manifests, there really is only one method of applying those manifests and that’s using a declarative model. We tell Kubernetes what it is we want the final state to look like (via the manifests) and we rely on Kubernetes to figure out the best way to get us to that state.
With Kubernetes all roads lead to a declarative document being applied to the cluster, but how we author those manifests can take on an imperative bend if we wanted to using various template engines like Helm, Jsonnet or the now defunct Ksonnet. But using templating languages provides a power and flexibility that allows us to do some things that we probably shouldn’t do given our past experiences. Templating opens the door for impeding some of the goals we have around the Kubernetes project and what the experience we’re specifically optimizing for. I’d prefer to stay away from templating layers as much as possible and be explicit in our manifest declarations.
What are we optimizing for?
In order to really evaluate the tools we’ve got to discuss what it is we’re optimizing for. These optimizations are in part due to past experiences with infrastructure tools as well as acknowledgements of the new reality we’ll be living in with this shared responsibility model for infrastrcuture.
Easy to read manifests to increase developer involvement
With the move to Kubernetes we’re looking to get developers more involved with the infrastructure that runs their applications. There won’t be a complete migration of ownership to development teams, but we do anticipate more involvement from more people. The team that works on infrastructure now is only 6 people. The development org is over 40 people. That said the reality is that many of these developers will only look at the infrastructure side of things 4 or 5 times a year. When they do look at it, we want that code to be optimized for reading rather than writing. The manifests should be clear and easy to follow.
This will require us to violate some principles and practices like code reuse and DRY, but after years of managing infrastructure code we find that more often than not, each case requires enough customization where the number of parameters and inputs to make code actually reusable, ballons quickly and becomes a bit unwieldy. Between our goals and the realities of infrastructure code reuse, using clear and plain manifest definitions is a better choice for us. We don’t currently have the organizational discipline to be able to reject certain customizations of an RDS instance. And honestly, rejecting a customization request because we don’t have the time to modify the module/template doesn’t feel like a satisfying path forward.
A single deployment tool usable outside the cluster
Because of the application awareness our current orchestration code has, we end up with multiple deployment code bases that are all fronted by a common interface. (Marvin, the chatbot) Even with Marvin serving as an abstraction layer, you can see chinks in the facade as different deployment commands have slightly different syntax and or feature support. In the Kubernetes world we want to rely on a single deploy tool that tries to keep things as basic as
kubectl apply
when possible. Keeping the deploy tool as basic as possible will hopefully allow us to leverage the same tool in local development environments. In order to achieve this goal, we’ll need to standardize on how manifests are provided to the deployment tool.There is a caveat to this however. The goal of a single method to appply manifests is separate and distinct from how the manifests are authored. One could theoretically use a template tool like Helm to write the manifests in, but then provide the final output to the deploy tool. This would violate another goal of easy to read manifests. I just wanted to call out it could be done. Having some dynamic preprocessor that happens ahead of the deploy tool and commits the final version of the manifest to the application repository could be a feasible solution.
Avoiding lots of runtime parameters
Another issue that we see in today’s infrastructure is that our deploy tool requires quite a bit of runtime information. A lot of this runtime information happens under the hood, so while the user isn’t required to provide it, Marvin infers a lot of information based on what the user does provide. For example, when a user provides the name “staging0x” as the environment, Marvin then recognizes that he needs to switch to the production account vs the preproduction account. He knows there’s a separate set of Consul servers that need to be used. He knows the name of the VPC that it needs to be created in as well as the Class definition of the architecture. (Class definitions are our way to scope the sizing requirements of the environment. So a class of “production” would give you one sizing and count of infrastructure, while a class of “integration” or “demo” will give you another)
This becomes problematic when we’re troubleshooting things in the environment. For example, if you want to manually run
terraform apply
or even aterraform destroy
, many times you have to look at previously run commands to get a sense what some of the required values are. In some cases, like during a period of terraform upgrading, you might need to know precisely what was provided at runtime for the environment in order to continue to properly manage the infrastructure. This has definitely complicated the upgrades of components that are long lived, especially areas where state is stored. (Databases and Elasticache for example)Much of the need for this comes from the technical debt incurred when we attempted to create reusable modules for various components. Each reusable module would create a sort of bubble effect, where input parameters for a module at level 3 in the stack would necessitate that we ask for that value at level 1 so that we can propagate it down. As we added support for a new parameter to support a specific use case, it would have the potential to impact all of the other pieces that use it. (Some of this is caused by limitations of the HCL language that Terraform uses)
Nevertheless, when we use templating tools we open the door for code reuse as well as levels of inference that makes the manifest harder to read. (I acknowledge that putting “code reuse” into a negative context seems odd) This code reuse in particular though tends to be the genesis of parameterization that ultimately bubbles its way up the stack. Perhaps not on day one, but by day 200 it seems almost too tempting to resist.
As an organization, we’re relatively immature as it relates to this shared responsibility model for infrastructure. A lot of the techniques that could mitigate my concerns haven’t been battle tested in the company. After some time of running in this environment and getting use to developer and operations interactions my stance may soften on this, but for day one it is a little bit too much to add additional processes to circumvent the short comings.
Easily repeated environment creation
In our internal infrastructure as code (IaC) testing we would often have a situation where coordinating infrastructure changes that needed to be coupled with code changes was a bit of a disaster. Terraform would be versioned in one repository, but SaltStack code would be versioned in a different repository, but the two changes would need to be tested together. This required either a lot of coordination or a ton of manual test environment setup. To deal with the issue more long-term we started to include a branch parameter on all environment creation commands, so that you could specify a custom SaltStack server, a specific Terraform branch and a specific SaltStack branch. The catch was you had to ensure that these parameters were enacted all the way down the pipeline. The complexity that this created is one of the reasons I’ve been leaning towards having the infrastructure code and the application code exist in the same repository.
Having the two together also allows us to hardcode information to ensure that when we deploy a branch, we’re getting a matching set of IaC and application code by setting the image tag in the manifest to match the image built. (There are definite implementation details to work out on this) This avoids the issue of infrastructure code being written for the expectations of version 3.0 of application code, but then suddenly being provided with version 2.0 of application code and things breaking.
We see this when we’re upgrading core components that are defined at the infrastructure layer or when we role out new application environment requirements, like AuthDB. When AuthDB rolled out, it required new infrastructure, but only for versions of the software that were built off the AuthDB branch. It resulted in us spinning up AuthDB infrastructure whether you needed it or not, prolonging and sometimes breaking the creation process for environments that didn’t need AuthDB.
Assuming we can get over a few implementation hurdles, this is a worthwhile goal. It will create a few headaches for sure. How do we make a small infrastructure change in an emergency (like replicacount) without triggering an entire CI build process? How do we ensure OPS is involved with changes to the /infrastructure directory? All things we’ll need to solve for.
Using Kustomize Exclusively
The mixture of goals and philosophies has landed us on using Kustomize exclusively in the environment. Along with that we’d like to adopt many of the Kustomize philosophies around templating, versioning and their approach to manifest management.
While Helm has become a popular method for packaging Kubernetes applications, we’ve avoided authoring helm charts in order to minimize not just the tools, but also the number of philosophies at work in the environment. By using Kustomize exclusively, we acknowledge that some things will be incredibly easy and some things will be incredibly more difficult than they need to be. But that trade-off is part of adhering to an ideology consistently. Some of those tradeoffs are established in the Kubernetes team’s Eschewed Features document. Again, this isn’t to say one approach is right and one is wrong. The folks at Helm are serving the needs of many operators. But the Kustomize approach aligns more closely with the ProdOps worldview of running infrastructure.
We’re looking to leverage Kustomize so that we:
- Don’t require preprocessing of manifests outside of the Kustomize commands
- Being as explicit as possible in manifest definitions, making it easy for people who aren’t in the code base often, to read it and get up to speed.
- Being able to easily recreate environments without the need for storing or remembering run time parameters that were passed.
- Minimizing the number of tools used in the deployment pipeline
I’m not saying it’s the right choice. But for ProdOps it’s the preferred choice. Some pain will definitely follow.
-
Organizing your todos for better effectiveness
Organizing your todos for better effectiveness
If I’ve learned anything during the pandemic it’s this; time is not my constraining resource. The lockdown has forcibly removed many of the demands on my time that I’ve conveniently used as an excuse. My 35-minute commute each way is gone. My evening social commitments have all evaporated. Time spent shuffling kids between extra curricular activities has now become a Zoom login. What am I doing with all of this extra time?
After a few work days that felt incredibly productive, I decided to deeply examine what made those days more effective than others. I didn’t necessarily accomplish more. I spent most of the time doing a rewrite of some deployment code. At the end of the day I had a bunch of functions and unit tests written, but I didn’t have anything impactful to share just yet. That’s when I realized it wasn’t the deliverable of a task that made me feel productive but the level of purpose with which I worked.
What was it about those days that made me feel so unproductive? The one thing they all had in common was a heavy sense of interruption. Sometimes the interruptions were driven by the meetings that seem to invade my calendar, spreading like a liquid to fill every available slice of time. Other times it was the demands of my parallel full time job as a parent/teacher/daycare provider, now that my kids are permanently trapped inside with me. The consistent theme was that when I only had 30 minutes of time, it seemed impractical to work on a task named “Rewrite the deployment pipeline”. My problem consisted of two major issues, the size of the work and how the work was presented to me.
We tend to think of tasks in terms of a deliverable. A large task gets unfairly summarized as a single item, when in fact, it’s many smaller items. I learned this quite some time ago but the issue still shows up in my task list from time to time. The first step was to make sure that my tasks were broken down into chunks that could be accomplished in a maximum of 30 to 60 minutes. Breaking down “Rewrite the deployment pipeline”, could be separated into tasks like:
- Write unit tests for the metadata retrieval function
- Write the metadata retrieval function
- Move common functions into a standard library
- Update references of the common functions to the new standard library
You get the idea. These are all small tasks that I should be able to tackle in a 60 minute period.
The more pressing issue that I would encounter however is presenting work based on the current work context that I’m in. If I’ve only got 15 minutes before my next meeting, it takes a lot of energy to start to get into a task that I know I can’t finish in that period. Because I didn’t have time to finish any of the items on my important list, I’d decide to play hero and go looking in Slack channels to see whose questions I could answer. But for some reason at the end of the week when I review my list of tasks, I’d still have all these small tasks that I hadn’t made any progress on.
This is where Omnifocus’s perspectives functionality saves me. Perspectives allow me to look at tasks that meet a specific criteria. I have a perspective I use called “Focus” that shows me which tasks I’ve flagged as important and which tasks are “due” soon. (In my system, due means that I’ve made an external commitment to a date or there is some other time based constraint on the task)
While this is great to make sure that I’m on top of things that I’ve made commitments to, it doesn’t do a great job of showing me what I can actually work on given the circumstances. There’s no indication of how much time a task will take. Having a separate category for phone calls is great when I’m in phone calling mode. But there’s different level of time commitments between “Call Mom and make sure she got the gift” and “Call your mortgage broker to discuss refinancing options”. I needed a way to also distinguish those tasks from each other.
Awhile ago I had started leveraging an additional context/tag of “Short Dashes” and “Full Focus”. This was just a quick hint of how much energy was required for the task. But by using those contexts/tags, I can create a new filter that highlighted short dash items that I could do between meetings. And now that Omnifocus supports multiple tags, I can also add a tag based on the tool that I need to complete the task. (e.g. Email, Phone, Computer, Research)
Now when I have a short amount of time, I can quickly flip to this perspective of work, which allows me to wrap up a lot of the smaller tasks that I need to do. This helps me to maximize those few minutes that I would normally waste checking Twitter because I didn’t have enough time to complete a larger task.
Another common scenario I’d run into was where my physical presence was tied up, but my mind was free. (Think of waiting for a doctor’s appointment to start. Back when we did those crazy things) I created a mobile perspective specifically for that purpose! It looks at all the tasks that I could complete on a mobile device.
These small changes have helped me to become more effective in those smaller slices of time. Now I know what I can make progress on regardless of my situation and begin to make some of that extra time I’ve got useful.
If you don’t have a to do management system, I’d highly recommend Omnifocus and reading the book Getting Things Done by David Allen.
-
ChatOps/ChatBots at Centro
ChatOps/ChatBot at Centro
“white robot action toy” by Franck V. on Unsplash During DevOpsDays PDX I chatted with a number of people who were interested in doing ChatOps in their organizations. It was the motivation I needed to take this half-written blog post and put the finishing touches on it.
Why a ChatBot?
The team at Centro had always flirted with the idea of doing a chatbot, but we stumbled into it to be honest, which will account for a bunch of the problems we’ve encountered down the road. When we were building out our AWS Infrastructure, we had envisioned an Infrastructure OPS site that would allow users to self-service requests. A chatbot seemed like a novelty side project. One day we were spit-balling on what a chatbot would look like. I mentioned a tool I had been eyeing for a while called StackStorm. StackStorm has positioned itself as an “Event-Driven Automation” tool, the idea being that an event in your infrastructure could trigger an automation workflow. (Auto-remediation anyone?) The idea seemed solid based on the team’s previous experience at other companies. You always find that you have some nagging problem that’s going to take time to get prioritized and fixed. The tool also had a ChatOps component, since when you think about it, a chat message is just another type of event.
To make a long story short, one of our team members did a spike on StackStorm out of curiosity and in very short order had a functioning ChatBot ready to accept commands and execute them. We built a few commands for Marvin (our chatbot) with StackStorm and we instantly fell in love. Key benefits.
- Slack is a client you can use anywhere. The more automation you put in your chatbot the more freedom you have to truly work anywhere.
- The chatbot serves as a training tool. People can search through history to see how a particular action is done.
- The chatbot (if you let it) can be self-empowering for your developers
- Unifies context. (again if you let it) The chatbot can be where Ops/Devs/DBAs all use the same tool to get work done. There’s a shared pain, a shared set of responsibility and a shared understanding of how things are operated in the system. The deploy to production looks the same way as the deploy to testing.
Once you get a taste for automating workflows, every request will go under the microscope with a simple question; “Why am I doing this, instead of the developer asking a computer to do it”.
Chatbot setup
StackStorm is at the heart of our chatbot deployment. The tool gave us everything we needed to start writing commands. The project ships with Hubot but unless you run into problems, you don’t need to know anything about Hubot itself. The StackStorm setup has a chatops tutorial to get into the specifics of how to set it up.
The StackStorm tool consists of various workflows that you create. It uses the Mistral workflow engine from the OpenStack Project. It allows you to tie together individual steps to create a larger workflow. It has the ability to launch separate branches of the workflow as well, creating some parallel execution capabilities. For example, if your workflow depends on seeding data in two separate databases, you could parallelize those tasks and then have the workflow continue (or “join” in StackStorm parlance) after those two separately executing tasks complete. It can be a powerhouse option and a pain in the ass at the same time. But we’ll get into that more later in the post.
The workflows are then connected to StackStorm actions, which allow you to execute them using the command line tool or the Chatbot. An action definition is a YAML file that looks like
---
name: "create"
pack: platform
runner_type: "mistral-v2"
description: "Creates a Centro Platform environment"
entry_point: "workflows/create.yaml"
enabled: true
parameters:
environment:
type: "string"
required: true
description: "The name of the environment"
requested_version:
type: "string"
default: "latest"
description: "The version of the platform to deploy"Workflows and actions are packaged together in StackStorm via “packs”. Think of it as a package in StackStorm that provides related functionality to a product. For us, we group our packs around applications, along with a few shared libraries for actions we perform from multiple packs. The above action is from the platform pack, which controls management of our primary platform environment. There are a bunch of community supported packs available via the StackStorm Exchange.
Then to finally make this a chat command, we define an alias. The alias identifies what messages in chat will trigger the associated action.
---
name: "create"
action_ref: "platform.create"
description: "Creates a Platform environment"
formats:
- "create platform environment named {{ environment }}( with dataset {{ dataset }})?( with version {{ requested_version='latest' }})"
ack:
format: "Creating platform environment {{ execution.parameters.environment }}"
append_url: false
result:
format: "Your requested workflow is complete."The formats section of the alias is a slightly modified regular expression. It can be a bit difficult to parse at times as commands become more complex with more optional parameters. The {{ environment }} notation expresses a parameter that will be passed on to the associated action. You can also set that parameter to a default value via assignment, as in {{ requested_version=latest }}. This means if a user doesn’t specify a requested_version, “latest” will be passed as the value for that parameter. Between regex and default parameters, you can have a lot of control over parameters that a user can specify. You can also have multiple formats that trigger the same action. You can see what action this will be invoked by the action_ref line. It’s in a pack.action_name format.
StackStorm brings a lot to the table
This might seem like a lot to get setup per command, but it’s actually quite nice to have StackStorm as this layer of abstraction. Because StackStorm is really an event-automation tool, it exposes these workflows you create in 3 different ways.
- The chatbot allows you to execute commands via your chat tool. Hubot supports a number of chat tools, which I believes translates to StackStorm support as well.
- The packs and actions you create can be executed from the StackStorm run command manually. This is extremely useful when there’s a Slack outage. The command syntax is
st2 run platform.create environment=testing requested_version=4.3
And just like in chat, optional parameters will get default values. - The StackStorm application also provides API access. This gives you the ability to call workflows from just about any other application. This is great when someone needs to do the exact same thing a user might do themselves via the Chatbot. That whole shared context thing showing up again.
What do you run via Chatbot?
To put it simply, as much as we can. Anytime there’s a request to do something more than once, we start to ask ourselves, “Are we adding value to this process or are we just gate keepers?” If we’re not adding value, we put it in a chat command. Some examples of things we have our.
- Create an environment
- Restore an environment
- Take a DB Snapshot of an environment
- Scale nodes in an Autoscaling group
- Execute Jenkins build jobs
- Scale the Jenkins worker instance count
- Run migrations
- Pause Sidekiq
- Restart services
- Deploy code
- Put an environment in maintenance mode
- Turn on a feature toggle
- Get a config value from Consul
- Set a config value in Consul
In all, we have over 100 chat commands in our environment.
But what about Security
Yes, security is a thing. Like most things security related you need to take a layered approach. We use SSO to authenticate to Slack, so that’s the first layer. The second layer is provided inside the workflows that we create. You have to roll your own RBAC, but most organizations have some sort of Directory Service for group management. For Slack in particular the RBAC implementation can be a bit mess.y The chatbot variables you get as part of each message event include the user’s username, which is changeable by the user. So you really need to grab the user’s token, look up the user’s info with the token to get the email address of the account and then use that to look up group information in whatever your directory service is.
We also ensure that dangerous actions have other out-of-band workflow controls. For example, you can’t just deploy a branch to production. You can only deploy an RPM that’s in the GA YUM repository. In order to get a package to the GA repository, you need to build from a release branch. The artifact of the release branch gets promoted to GA, but only after the promotion confirms that the release branch has a PR that has been approved to go to master. These sorts of out-of-band checks are crucial for some sensitive actions.
Push based two-factor authentication for some actions is desired too. The push based option is preferred because you don’t want to have a two-factor code submitted via Chat, that is technically like for another 60–120 seconds. We’re currently working on this, so keep an eye out for another post.
Lastly, there are some things you simply can’t do via the Chatbot. No one can destroy certain resources in Production via Chat. Even OPS has to move to a different tool for those commands. Sometimes the risk is just too great.
Pitfalls
A few pitfalls with chatbots that we ran into.
- We didn’t define a common lexicon for command families. For example, a deploy should have very similar nomenclature everywhere. But because we didn’t define a specific structure, some command are
create platform environment named demo01
and some arecreate api environment demo01
. The simple omission ofname
can trip people up who need to operate in both the platform space and the api space. - The Mistral workflow is a powerful tool, but it can be a bit cumbersome. The workflow also uses a polling mechanism to move between steps. (Step 1 completes, but step 2 doesn’t start until the polling interval occurs and the system detects step 1 finished) As a result, during heavy operations you can spend a considerable amount of time wasted with steps completing, but waiting to poll successfully before they move on.
- Share the StackStorm workflow early with all teams. Empower them to create their own commands early on in the process, before the tools become littered with special use cases that makes you hesitant to push that work out to other teams.
- Make libraries of common actions early. You can do it by creating custom packs so that you can call those actions from any pack.
- Use the mistral workflow sparingly. It’s just one type of command runner StackStorm offers. I think the preferred method of execution, especially for large workflows, is to have most of that execution in a script, so that the action becomes just executing the script. The Mistral tool is nice, but becomes extremely verbose when you start executing a lot of different steps.
Conclusion
We’re pretty happy with our Chatbot implementation. It’s not perfect by any means, but it has given us back a lot of time in wasted toil work. StackStorm has been a tremendous help. The StackStorm Slack is where a lot of the developers hangout and they’re amazing. If you’ve got a problem, they’re more than willing to roll up their sleeves and help you out.
While not in-depth, I hope this brief writeup has helped someone out there in their Chatbot journey. Feel free to ping me with any questions or leave comments here.
-
Stories vs Facts in Metrics
You need to measure your processes. It doesn’t matter what type of process, whether it be a human process, a systems process or a manufacturing process, everything needs to be measured. In my experience, you’ll often find humans resistant to metrics that measure themselves. There’s a lot of emotion that gets caught up in collecting metrics on staff because unlike computers, we intuitively understand nuance. I’ve worked hard to be able to collect metrics on staff performance while at the same time not adding to the team’s anxiety when the measuring tape comes out. A key to that is how we interpret the data we gather.
At Centro, we practice Conscious Leadership, a methodology to approaching leadership and behaviors throughout the organization. One of the core tenants of Conscious Leadership is this idea of Facts vs Stories. A fact is something that is completely objective, something that could be revealed by a video camera. For example, “Bob rubbed his forehead, slammed his fist down and left the meeting”. That account is factually accurate. Stories are interpretations of facts. “Bob got really angry about my suggestion and stormed out of the meeting.” That’s a story around the fact that Bob slammed his fist down and left the meeting, but it’s not a fact. Maybe Bob remembered he left his oven on. Maybe he realized at that exact moment the solution to a very large problem and he had to test it out. The point is, the stories we tell ourselves may not be rooted in reality, but simply a misinterpretation of the facts.
This perspective is especially pertinent with metrics. There are definitely metrics that are facts. An example is number of on-call pages to an employee. That’s a fact. The problem is when we take that fact and develop a story around it. The story we may tell ourselves about that is we have a lot of incidents in our systems. But the number of pages a person gets may not be directly correlated to the number of actual incidents that have occurred. There is always nuance there. Maybe the person kept snoozing the same alert and it was just re-firing, creating a new page.
There are however some metrics that are not facts, but merely stories in a codified form. My favorite one is automatically generated stats around Mean Time to Recovery. This is usually a metric that’s generated via means of measuring the length of an incident or incidents related to an outage. But this metric is usually a story and not a fact. The fact is the outage incident ticket was opened at noon and closed at 1:30pm. The story around that is it took us 1.5 hours to recover. But maybe the incident wasn’t closed the moment service was restored. Maybe the service interruption started long before the incident ticket was created. Just because our stories can be distilled into a metric doesn’t make them truthful or facts.
Facts versus stories is important in automated systems, but even more so when dealing with human systems and their related workflows. Looking at a report and seeing that Fred closed more tickets than Sarah is a fact. But that doesn’t prove the story that Fred is working harder than Sarah or that Sarah is somehow slacking in her responsibilities. Maybe the size and scope of Fred’s tickets were smaller than Sarah’s. Maybe Sarah had more drive-by conversations than Fred, which reduced her capacity for ticket work. Maybe Sarah spent more time mentoring co-workers in a way that didn’t warrant a ticket. Maybe Fred games the system by creating tickets for anything and everything. There are many stories we could make up around the fact that Fred closed more tickets than Sarah. It’s important as leaders that we don’t let our stories misrepresent the work of a team member.
The fear of stories that we make out of facts is what drives the angst that team members have when leaders start talking about a new performance metric. Be sure to express to your teams the difference between facts and stories. Let them know that your measurements serve as signals more than truths. If Fred is closing a considerable larger number of tickets, it’s a signal to dig into the factors of the fact. Maybe Fred is doing more than Sarah, but more than likely, the truth is more nuanced. Digging in may reveal corrective action on how work gets done or it might reveal a change in the way that metric is tracked. (And subsequently, how that fact manifests) Or it might confirm your original story.
Many people use metrics and dashboards to remove the nuance of evaluating people. It should serve as the prompt to reveal the nuance. When you take your issue to your team, make sure you are open about the fact and your story around those facts. Be sure to separate the two and have an open mind as you explore your story. The openness and candor will provide a level of comfort around the data being collected, because they know it’s not the end of the conversation.
-
How You Interview is How You Hire
“Of course you can’t use Google, this is an interview!” That was the response a friend got when he attempted to search for the syntax to something he hadn’t used in awhile in an interview. After the interview was over, he texted me and told me about the situation. My advice to him was to run as fast as he could and to not think twice about the opportunity, good or bad. I haven’t interviewed at a ton of places as a candidate, so my sample size is statistically insignificant, but it seems insane in today’s world that this would be how you interview a candidate.
As a job candidate, you should pay special attention to how your interview process goes. You spend so much time focused on getting the answers and impressing the tribunal, that you can sometimes fail to evaluate the organization based on the nature of questions being asked and how they are asked. As an organization, how you interview is how you hire, how you hire is how you perform. This is an important maxim, because it can give you, the job seeker, a lot of insight into the culture and personalities you might be working with soon.
The interview process I described earlier seems to put more emphasis on rote memorization than actual problem solving ability. Coming from someone who still regularly screws up the syntax for creating symlinks, I can atest to the idea that your ability to memorize structure has no bearing on your performance as an engineer.
What does an emphasis on memorization tell me about an organization? They may fear change. They may demand the comfort of tools they know extremely well, which on the face of it isn’t a bad thing. Why use the newest whizzbangy thing when old tied and true works? Well sometimes, the definition of “works” changes. Nagios was fine for me 20 years ago, but it isn’t the best tool for the job with the way my infrastructure looks today, regardless of how well I know Nagios. (on a side note, I think this describes VIM users. We labor to make VIM an IDE because we’ve spent so many years building up arcane knowledge, that starting over seems unpalatable. But I digress)
No one expects to work at a place where Google isn’t used extensively to solve problems. So what exactly are interviewers attempting to accomplish by banning it? Creating a power-dynamic? Seeing how you work under-pressure? You work in an environment where Internet access is heavily restricted? These goals very well could be pertinent to the job, but how you evaluate those things are just as important as the results you get from them.
I wish there was a rosetta stone for interview format to personality types, but this is just one example of the type of thing I look for when interviewing and try to actively avoid when giving an interview. Things to also look out for
- Are they looking for a specific solution to a general problem? Maybe you have an answer that works, but you feel them nudging you to a predetermined answer. (e.g. Combine two numbers to get to 6. They might be looking for 4 and 2, but 7 and –1 are also valid)
- Did the interview challenge you at all technically? Will you be the smartest person in the room if you’re hired?
- Are you allowed the tools that you would fully expect to use on the job? (Google, IDE help documentation etc)
- Are they asking questions relevant to the actual role? Preferably in the problem space you’ll be working in.
Paying attention to how a company evaluates talent gives you insight into the type of talent they have. The assumption is always that the people giving the interview have the right answers, but plenty of unqualified people have jobs and those same unqualified people often sit in on interviews.
Remember that the interview is a two-way street. Look to the interview process as a way to gleam information about all of the personalities, values and priorities that make the process what it is. And then ask yourself, is it what you’re looking for?
-
Hubris — The Interview Killer
Hubris — The Interview Killer
Interviewing engineers is a bit more art than science. Every hiring manager has that trait that they look for in a candidate. As an interviewer, you subconsciously recognize early on if the candidate has that magical quality, whatever it may be for you. It either qualifies or disqualifies a candidate in an instant. The trait that I look for to disqualify a candidate is hubris.
Self-confidence is a wonderful thing, but when self-confidence becomes excessive, it’s toxic and dangerous. That danger is never more prevalent then during the build vs buy discussion. The over-confident engineer doesn’t see complex problems, just a series of poor implementations. The over-confident engineer doesn’t see how problems can be interconnected or how use cases change. Instead they say things like “That project is too heavy. We only need this one small part” or “There aren’t any mature solutions, so we’re going to write our own.”
The cocky engineer to the rescue Humility in an engineer is not a nicety, it’s a necessity. Respect for the problem space is a required ingredient for future improvements. But as important as respect for the problem is, respect for the solutions can be even more important. Every solution comes with a collection of trade-offs. The cocky engineer doesn’t respect those trade-offs or doesn’t believe that they were necessary in the first place. The cocky-engineer lives in a world without constraints, without edge cases and with an environment frozen in time, forever unchanging.
But why does all this matter? It matters because our industry is full of bespoke solutions to already solved problems. Every time you commit code to your homegrown log-shipping tool, an engineer that solves this problem as part of their full-time job dies a little bit on the inside. Every time you have an easy implementation for leader election in a distributed system, a random single character is deleted from the Raft paper.
I’m not suggesting that problems are not worth revisiting. But a good engineer will approach the problem with a reverence for prior work. (Or maybe they’re named Linus) An arrogant engineer will trivialize the effort, over promise, under deliver and saddle the team with an albatross of code that always gets described as “some dark shit” during the on-boarding process to new hires.
If problems were easy, you wouldn’t be debating the best way to solve it because the answer would be obvious and standard. When you’re evaluating candidates, make sure you ask questions that involve trade-offs. Ask them for the flaws in their own designs. Even if they can’t identify the flaws, how they respond to the question will tell you a lot, so listen closely. If you’re not careful, you’ll end up with a homegrown Javascript framework….or worse.
-
I really like the concept here, but I’m not sure I’m fully getting it.
I really like the concept here, but I’m not sure I’m fully getting it. Adaptive Capacity *can* be pretty straight forward from a technology standpoint, especially in a cloud type of environment where the “buffer” capacity doesn’t incur cost until it’s actually needed. When it comes to the people portion, I’m not sure if I’m actually achieving the goal of “adaptive” or not.
My thought is basically building in “buffer” in terms of work capacity, but still allocating that buffer for work and using prioritization to know what to drop when you need to shift. (Much like the buffers/cache of the Linux filesystem) The team is still allocated for 40 hours worth of work, but we have mechanisms in place to re-prioritize work to take on new work. (i.e. You trade this ticket/epic for that ticket/epic or we know that this lower value work is the first to be booted out of the queue)
This sounds like adaptive capacity to me, but I’m not sure if I have the full picture, especially when I think of Dr. Cook’s list of 7 items from Poised to Deploy. The combination of those things is exactly what makes complex systems so difficult to deal with. People understand their portion, but not the system as a whole, so we’re always introducing changes/variance with unintended ripple effects. And I think that’s where it feels like I have a blindspot when it comes to the concept.
I might have jumped the gun on this post, because I still have one of the keynotes you linked in the document to watch as well as a PDF that Allspaw tweeted, but figured I’d just go ahead and get the conversation rolling before it fell off my to-do list. =)
-
A Post-mortem on the Mental Model of a System
On Thursday, December 14th we suffered a small incident that lead to various user notifications and actions not being triggered as they would during normal system operations. The issue was caught by alerting so that staff could react prior to customers being impacted, but the nature of the failure and the MTTR (approx 4 hours) was higher than it should have been given the nature of the error and the corrective action taken to resolve it. This seemed like an opportune time to evaluate the nature of the issue with regards to our mental models of how we think the system operates versus how it actually operates. This post-mortem is much more focused on those components of the failure than our typical march towards the ever elusive “root cause”.
Below is a timeline of the events that transpired. After that we’ll go into differnet assumptions made by the participants and how they’re unaligned with the actual way the system behaves.
Timeline
- Datadog alert fires stating that the activity:historyrecordconsumer queue on the RabbitMQ nodes is above thresholds.
- Operator on-call receives the alert, but doesn’t take immediate action
- Second Datadog alert fires at 5:37am for direct::deliveryartifactavailableconsumer-perform
- Operator is paged and begins to diagnose. Checks system stats for any sort of USE related indicators. The system doesn’t appear to be in any duress.
- The operator decides to restart the Sidekiq workers. This doesn’t resolve the issue, so the operator decides to page out to a developer.
- The operator checks the on-call schedule but finds the developer on-call and the backup-developer on-call have no contact information listed. With no clear escalation path on the developer side, the operator escalates to their manager.
- Management creates an incident in JIRA and begins assisting in the investigation.
- Manager requests that the operator restart the Sidekiq workers. This doesn’t resolve the issue.
- Developers begin to login as the work day begins
- Developer identifies that the work queues stopped processing at about 2:05am
- Developer suggests a restart of the Consumer Daemon
- Operator restarts Consumer Daemon
- Alerts clear and the queue begins processing
As you can see, the remedy for the incident was relatively straight forward. But there were a lot of assumptions, incorrect mental models and bad communication that led to the incident taking so long to be resolved. Below is a breakdown of different actions that were taken and the thought process behind them. This leads us to some fascinating insights on ways to make the system better for its operators.
Observations
Some brief context about the observations. The platform involved in the incident has recently been migrated from our datacenter hosting provider to AWS. Along with that migration was a retooling of our entire metrics and alerting system, moving away from InfluxDB, Grafana, Sensu to Datadog.
The platform is also not a new application and precedes the effort of having ProdOps and Development working more closely together. As a result, Operations staff do not yet have the in-depth knowledge of the application they might otherwise have.
Operator on-call receives the alert, but doesn’t take immediate action
The operator received the page, but noticed that the values for the queue size were just above the alerting threshold. Considering the recent migration and this being the first time the alert had fired in Production, the operator made the decision to wait, assuming the alert was a spike that would clear itself. You can notice a clear step change in the graph below.
We have many jobs that run on a scheduled basis and these jobs drop a lot of messages in a queue when they start. Those messages usually get consumed relatively quickly, but due to rate limiting by 3rd parties, the processing can slow down. In reality the queue that’s associated with this alert does not exhibit this sort of behavior.
Second Datadog alert fires at 5:37am for direct::deliveryartifactavailableconsumer-perform
After this second alert the Operator knew there must be a problem and began to take action. The operator wasn’t sure how to resolve the problem and said a feeling of panic began to sink in. Troubleshooting from a point of panic can lead to clouded decision making and evaluation. There were no runbooks on this particular alert, so the impact to the end-user was not entirely clear.
The operator decided to restart Sidekiq because the belief was that they were consumers of the queue. The operator had a mental model that resembled Celery, where work was processed by RabbitMQ workers and results were published to a Redis queue for notification. The actual model is the reverse. Workers work off of the Redis queue and publish their results to RabbitMQ. As a result, the Sidekiq workers only publish messages to RabbitMQ but they do not consume messages from RabbitMQ, therefore the Sidekiq restart was fruitless.
The Operator began to troubleshoot using the USE methodology (Utilization, Saturation, Errors) but didn’t find anything alarming. In truth, it turned out that the absence of log messages was the indicator that something was wrong, but the service that should have been logging wasn’t known to the operator. (The Operator assumed it would be Sidekiq workers based on their mental model described above. Sidekiq workers were logging normally)
Operator checks the on-call schedule and notices the Developer on call doesn’t have a phone number listed
The Operator checked the confluence page but didn’t have any contact information or any general information on who to escalate to if the contact listed didn’t respond. This is a solved problem with tools like PagerDuty, where we programmatically handle on-call and escalations. Due to budget concerns though we leverage the confluence page. It could be worthwhile to invest some development effort into automating the on-call process somehow in lieu of adding more users to pager duty. ($29 per user)
Management creates an incident in JIRA and begins assisting in the investigation.
Management began investigating the issue and requested another restart of Sidekiq. The manager assumed that the operator was using Marvin, the chatbot, to restart services. The operator however was unsure of the appropriate name of the services to restart. The help command for the restart service command reads
restart platform service {{ service }} in environment {{ environment }}
This was confusing because the operator assumed that {{service}} meant a systemd managed service. We run Sidekiq workers as Systemd services so each worker has a different name, such as int_hub or bi_workers. Because the operator didn’t know the different names, it was much easier to SSH into a box, do the appropriate systemd commands and restart the services.
The disconnect is that {{ service }} is actually an alias that maps to particular application components. One of those aliases is sidekiq, which would have restarted all of the Sidekiq instances including the consumer_daemon, which would have resolved the issue. But because of the confusion surrounding the value of {{ service }}, the operator opted to perform the task manually. In a bit of irony, consumer_daemon is technically not a Sidekiq worker, so it’s also incorrectly classified and could cause further confusion for someone who has a different definition for these workers. The organization needs to work on a standard nomenclature to remove this sort of confusion across disciplines.
Developer identifies that the work queues stopped processing at about 2:05am
When developers began to login, they quickly noticed that the Consumer daemon hadn’t processed anything since 2:05am. This was identified by the absence of log messages in Kibana. This was missed by the Operator for two reasons
- As previously stated, the Operator was unaware that Consumer Daemon was the responsible party for processing this queue, so any log searching was focused on Sidekiq.
- The messages that denoted processing from Consumer Daemon have a much more internal representation. The log entries refer to an internal structure in the code “MappableEntityUpdateConsumer”. But the Operator being unaware of internal structures in the code, would have never correlated that to the behavior being seen. The log message is written for an internal view of the system versus that of an external operator.
Additional Observations
There were some additional observations that came out as part of the general discussion that didn’t map specifically to the timeline but are noteworthy.
activity:historyrecordconsumer is not a real queue
This queue is actually just something of a reporting mechanism. It aggregates the count of all queues and emits that as a metric. When this queue shows a spike in volume, it’s really just an indicator that some other queue is encountering problems. This is an older implementation that may not have value in the current world any longer. It also means that each of our stacked queue graphs are essentially reporting double their actual size. (Since this queue would aggregate the value of all queues, but then also be reported in the stacked graph) We should probably eliminate this queue entirely, but we’ll need to adjust alerting thresholds appropriately with its removal.
Background job exceptions don’t get caught with Airbrake
Airbrake normally catches exceptions and reports them via Slack and email. But background workers (i.e. not Sidekiq) do not report to Airbrake. There have been instances where a background job is throwing exceptions but no action is being taken to remedy.
Fixing the problem vs resolving the incident
Restart the Consumer daemon solved the issue we were having, but there was never any answer as to why every worker node in the fleet suddenly stopped processing from RabbitMQ. The support team was forced to move on to other issues before fully resolving or understanding the nature of the issue.
Action Items
With the observations listed above, we’ve found a few things that will make life easier the next time a similar incident occurs.
- Continue to ensure that our alerting is only alerting when there is a known/definite problem. There’s more value to getting alerts 5 minutes late, but being confident that the alert is valid and actionable. In this case the alerting was correct, but we’ll need to continue to build trust by eliminating noisy alerts.
- Ensure that the On-Call support list has phone numbers listed for each contact. We also need to document the escalation policy for when on-call staff are unavailable. We should also look at automating this, either through expanding PagerDuty or otherwise.
- Marvin chatbot commands need a bit more thorough help page. The suggestion of using a man page like format with a link in the help documentation was suggested.
- Common nomenclature for workers should be evangelized. A simple suggestion is that “workers” accounts for all types of publish/subscribe workers and when we’re talking about a particular subset fo workers we describe the messaging system they interact with. “RabbitMQ Workers” vs “Sidekiq Workers”.
- Support staff need to be afforded the time and energy to study an incident until its cause and prevention are sufficiently understood. We need to augment our processes to allow for this time. This will be a cross-functional effort lead by Prod Ops.
-
I Don’t Understand Immutable Infrastructure
We were at the airport getting ready to go through security. A deep baritone voice shouted, “Everybody must take their shoes off and put them in the bin.” Hearing the instruction I told my son and daughter to take their shoes off and put them in the bin. When we got in line for the X-Ray machine, another man looked at my kids and said “Oh they don’t need to take their shoes off.” My wife and I looked at each other puzzled, “But the man over there said everyone take their shoes off.” “Oh, everyone except children under 12” he responded, as if that was the universal definition of “everybody”. I tell this story to highlight the idea that the words we choose to use matter a great deal when trying to convey an idea, thought or concept. Nowhere is this more true than the world of computing.
Immutable Infrastructure is one of those operational concepts that has been very popular, at least in conference talks. The idea isn’t particularly new, I remember building “golden images” in the 90’s. But there’s no doubt that the web, the rate of change and the tooling to support it has put the core concepts en vogue again. But is what we’re doing really immutable? I feel like it’s not. And while it may be a simple argument over words, we use the benefits of immutability in our arguments without any of the consequences that design choice incurs.
I often hear the argument that configuration management is on its way out, now that we’re ready to usher in an era of “immutable” infrastructure. You don’t push out new configurations, you build new images with the new configuration baked in and replace the existing nodes. How do we define configuration? That answer is simultaneously as concreate and maleable. I define configuration as
The applications, libraries, users, data and settings that are necessary to deliver the intended functionality of an application.
That’s a fairly broad definition, but so is configuration! Configuration management is the process (or absence of process) for managing the components in this list. Therefore if any one of these items is modified, that constitutes not just a change to your configuration, but a changer to your infrastructure as well.
Since we’ve defined configuration, what do we mean by immutability? (Or what do we as an industry mean by it) The traditional definition is
Not subject or susceptible to change or variation in form or quality or nature.
In the industry we boil it down to the basic meaning of “once it’s set, it never changes. A string is often immutable in programming languages. Though we give strings the appearance of mutability, in reality it’s a parlor trick to simplify development. But if you tell a developer that strings are immutable, it conveys a specific set of rules and the consequences for those rules.
What doe these definitions mean in practice? Let’s pretend it’s a normal Tuesday. There’s a 60% chance there’s a new OpenSSL package out and you need to update it. Rolling out a new OpenSSL package by creating a new image for your systems seems like a reasonable methodology. Now there’s a known good configuration of our system that we can replicate like-for-like in the environment. If you’re particularly good at it, getting the change rolled out takes you 30 minutes. (Making the change, pushing it, kicking off the image build process and then replacing, while dialing down traffic) For the rest of us mere mortals, it’s probably closer to a couple of hours. But regardless of time, immutable infrastructure wins!
Now lets pretend we’re in our testing environment. This obviously has a different set of nodes it communicates with vs production, so our configuration is different. We don’t want to maintain to separate images, one for production one for testing because that would rob us of our feeling of certainty about the images being the same. Of course we solve this with service discovery! Now instead of baking this configuration into the application, our nodes can use tools like Consul and Eureka to find the nodes it needs to communicate with. The image remains the same, but the applications configured on the image are neatly updated for reflect their running environment.
But isn’t that a change? And the definition of immutable was that the server doesn’t change. Are we more concerned that OpenSSL stays on the same version than we are about what database server an instance is talking to? I’m sure in the halls of Google, Netflix and LinkedIn, a point release of a library could have catastrophic consequences. But if you asked most of the industry “What frightens you more? Updating to the latest version of OpenSSL or updating worker_threads from 4 to 40?” I imagine most of us would choose the latter with absolutely zero context around what worker_threads is. Let’s wave our magic wand though and say service discovery has also relieved us of this particular concern. Let’s move on to something more basic, like user management.
In testing environments I have widely different access policies than I do for my production systems. I also have a completely different profile of users. In production, operations and a few developers are largely the only people that have access. In testing, development, QA and even product may have a login. How does that get managed? Do I shove that into service discovery as well? I could run LDAP in my environment, but that pushes my issue from “How do I manage users and keys?” to, “How do I manage access policy definitions for the LDAP configuration?”
This is all just to say that I’m incredibly confused about the Immutable Infrastructure conversation. In practice it doesn’t solve a whole host of concerns. Instead it pushes them around into a layer of the system that is often ill suited to the task. Or worse, the idealology simply ignores the failures caused by configuration changes and decides that “Immutable Infrastructure” is actually “Immutable Infrastructure, except for the most dangerous parts of the system”.
This doesn’t even tackle the idea that configuration management is still the best tool for…..wait for it…managing configuration, even if you’re using immutable infrastructure. Docker and Packer both transport us back to the early 90’s in their approach to defining configuration. It be a shame if the death of configuration management was as eminent as some people claim.
So what am I missing? Is there a piece of the puzzle that I’m not aware of? Am I being too pedantic in my definition of things? Or is there always an unexpressed qualifier when we say “immutable”?
Maybe words don’t matter.
-
Our Salt Journey Part 2
Our Salt Journey Part 2
Structuring Our Pillar Data
This is the 2nd part in our Salt Journey story. You can find the previous article here. With our specific goals in mind we decided that designing our pillar data was probably the first step in refactoring our Salt codebase.
Before we start about how we structure Pillar data, we should probably explain what we plan to put in it, as our usage may not line up with other user’s and their expectations. For us, Pillar data is essentially customized configuration data beyond the defaults. Pillar data is less about minion specific data customizations and more about classes of minions getting specific data.
For example, we have a series of grains (which we’ll talk about in a later post) that have classification information. One of the grains set is
class
, which identifies the node as being part of development, staging or production. This governs a variety of things we may or may not configure based on the class. If a node is classified asdevelopment
, we'll install metrics collections and checks, but the alerting profile for them will be very different than if the node was classified asstaging
orproduction
.With this in mind, we decided to leverage Pillar Environments in order to create a tiered structure of overrides. We define our pillar’s
top.sls
file in a specific order ofbase
,development
,staging
and lastlyproduction
like the diagram below.├── base
├── development
│── staging
├── productionIt’s important that we order the files correctly because when the
pillar.get()
function executes, it will merge values, but on a conflict the last write wins. We need to ensure that the order the files are read in match the ascending order that we want values to be overridden. In this example, conflicting values in theproduction
folder will override any previously defined values.This design alone however might have unintended consequences. Take for example the below YAML file.
packages:
- tcpdump
- rabbitmq-server
- redisIf this value is set in the
base
pillar lookup, then (assuming you've defined base asbase: '*'
), then apillar.get('packages')
will return the above list. But if you also had the below defined in theproduction
environment:packages:
- elasticsearchthen your final list would be
packages:
- tcpdump
- rabbitmq-server
- redis
- elasticsearchBecause the
pillar.get()
will traverse all of the environments by default. This results in a possible mashup of expected values without care. We protect against this by ensuring that Pillar data is restricted to only nodes that should have access to it. Each pillar environment is guarded by a match syntax based on the grain. Lets say our Pillar data looks like the below:├── base
│ └── apache
│ └── init.sls
├── development
│── staging
│ └── apache
│ └── init.sls
├── production
│ └── apache
│ └── init.slsIf we’re not careful, we can easily have a mashup of values that result in a very confusing server configuration. So in our
top.sls
file we have grain matching that helps prevent this.base:
- apache
-
production:
'G@class:production'
- apacheThis allows us to limit the scope of the nodes that can access the
production
version of the Apache pillar data and avoids the merge conflict. We repeat this pattern fordevelopment
andstaging
as well.What Gets a Pillar File?
Now that we’ve discussed how Pillar data is structured, the question becomes, what actually gets a pillar file? Our previous Pillar structure had quite a number of entries. (I’m not sure that this denotes a bad config however, just an observation) The number of config files was largely driven on how our formulas were defined. All configuration specifics came from pillar data, which meant in order to use any of the formulas, it required some sort of Pillar data before the formula would work.
To correct this we opted to moving default configurations into the formula itself using the standard (I believe?) convention of a
map.jinja
file. If you haven't seen themap.jinja
file before, it's basically a Jinja defined dictionary that allows for setting values based on grains and then ultimately merging that with Pillar data. A common pattern we use is below:A map.jinja for RabbitMQ
{% set rabbitmq = salt['grains.filter_by']({
'default': {
},'RedHat': {
'server_environment': 'dev',
'vhost': 'local',
'vm_memory_high_watermark': '0.4',
'tcp_listeners': '5672',
'ssl_listeners': '5673',
'cluster_nodes': '\'rabbit@localhost\'',
'node_type': 'disc',
'verify_method': 'verify_none',
'ssl_versions': ['tlsv1.2', 'tlsv1.1'],
'fail_if_no_peer_cert': 'false',
'version': '3.6.6'
}
})%}With this defined, the formula has everything it needs to execute, even if no Pillar data is defined. The only time you would need to define pillar data is if you wanted to override one of these default properties. This is perfect for formulas you intend to make public, because it makes no assumptions about the user’s pillar environment. Everything the formula needs is self-contained.
Each Pillar file is defined first with the key that matches the formula that’s calling it. So an example Pillar file might be
rabbitmq:
vhost: prod01-server
tcp_listeners: 5673The name spacing is a common approach, but it’s important because it gives you flexibility on where you can define overrides. They can be in their own standalone files or they can be in a pillar definitions for multiple components. For example our home grown applications need to configure multiple pillar data values. Instead of spreading these values out, they’re collapsed with name spacing into a single file.
postgres:
users:
test_app:
ensure: present
password: 'password'
createdb: False
createroles: True
createuser: True
inherit: True
replication: Falsedatabases:
test_app:
owner: 'test_app'
template: 'template0'logging:
- input_type: log
paths:
- /var/log/httpd/access_log
- /var/log/httpd/error_log
- /var/log/httpd/test_app-access.log
- /var/log/httpd/test_app-error.log
- /var/log/httpd/test_app-access.log
- /var/log/httpd/test_app-error.log
document_type: apache
fields: { environment: {{ grains['environment'] }},
application: test_app
}
- input_type: logWe focus on our formulas creating sane defaults specifically for our environment so that we can limit the amount of data that actually needs to go into our Pillar files.
The catch with shoving everything into the
map.jinja
file is that sometimes you have a module that needs a lot of default values. OpenSSH is a perfect example of this. When this happens you're stuck with a few choices:- Create a huge
map.jinja
file to house all these defaults. This can be unruly. - Hardcode defaults into the configuration file template that you’ll be generating, skipping the lookup altogether. This is a decent option if you have a bunch of values that you doubt you’ll ever change. Then you can simply turn them into lookups as you encounter scenarios where you need to deviate from your standard.
- Shove all those defaults into a
base
pillar definition and do the lookups there. - Place the massive list of defaults into a
defaults.yaml
file and load that in
We opted for option #3. I think each choice has its pluses and minuses, so you need to figure out what works best for your org. Our choice was largely driven by the OpenSSH formula and its massive number of options being placed in Pillar data. We figured we’d simply follow suit.
This pretty much covers how we’ve structured our Pillar data. Since we started writing this we’ve extended the stack a bit more which we’ll go into in our next post, but for now this is a pretty good snapshot of how we’re handling things.
Gotchas
Of course no system is perfect and we’ve already run into a snag with this approach. Nested lookup overrides is problematic for us. So take for example we have the following in our
base.sls
file:apache:
sites:
cmm:
DocumentRoot: /
RailsEnvironment: developmentand then you decide that you want to override it in a
production.sls
Pillar file below:apache:
sites:
cmm:
RailsEnvironment: productionWhen you look do a
pillar.get('apache')
with a node that has access to the production pillar data, you'd expect to getapache:
sites:
cmm:
DocumentRoot: /
RailsEnvironment: productionbut because Salt won’t handle nested dictionary overrides you instead end up with
apache:
sites:
cmm:
RailsEnvironment: productionwhich of course breaks a bunch of things when you don’t have all the necessary pillar data. Our hack for this has been to have a separate key space for overrides when we have nested properties.
apache_overrides:
sites:
cmm:
RailsEnvironment: productionand then in our Jinja Templates we do the look up like:
{% set apache = salt['pillar.get']('apache') %}
{% set overrides = salt['pillar.get']('apache') %}
{% do apache.update(overrides) %}This allows us to override at any depth and then rely on Python’s dictionary handling to merge the two into a useable Pillar data with all the overrides. In truth we should do this for all look ups just to provide clarity, but because things grew organically we’re definitely not following this practice.
I hope someone out there is finding this useful. We’ll continue to post our wins and losses here, so stay tuned.
- Create a huge
-
Thanks Weighted Decision. Great resources there!
Thanks Weighted Decision. Great resources there!
-
Our Journey with Salt
These are a few of the major pain points that we are trying to address, but obviously we’re going to do it in stages. The very first thing we decided to tackle was formula assignment.
Assigning via hostname has its problems. So we opted to go with leveraging Grains on the node to indicate what type of server it was.
With the role custom grain, we can identify the type of server the node is and based on that, what formulas should be applied to it. So our top.sls file might look something like
'role': 'platform_webserver':
- match: grain
- webserverNothing earth shattering yet, but still a huge upgrade from where we’re at today. The key is getting the grain populated on the server instance prior to the Salt Provisioner bootstrapping the node. We have a few ideas on that, but truth be told, even if we have to manually execute a script to properly populate those fields in the meantime, that’s still a big win for us.
We’ve also decided to add a few more grains to the node to make them useful.
- Environment — This identifies the node as being part of development, staging, production etc. This will be useful to us later when we need to decide what sort of Pillar data to apply to a node.
- Location — This identifies which datacenter the node resides in. It’s easier than trying to infer via an IP address. It also allows us a special case of local for development and testing purposes
With these items decided on, our first task will be to get these grains installed on all of the existing architecture and then re-work our top file. Grains should be the only thing that dictates how a server gets formulas assigned to it. We’re making that explicit rule mainly so we have a consistent mental model of where particular functions or activities are happening and how changes will ripple throughout.
Move Cautiously, But Keep Moving
Whenever you make changes like this to how you work, there’s always going to be questions, doubts or hypotheticals that come up. My advice is to figure out what are the ones you have to deal with, what are the ones you need to think about now and what you can punt on till later. Follow the principle of YAGNI as much as possible. Tackle problems as they become problems, but pay no attention to the hypotheticals.
Another point is to be clear about the trade-offs. No system is perfect. You’ll be constantly making design choices that make one thing easer, but another thing harder. Make that choice with eyes wide open, document it and move on.
It’s so easy to get paralyzed at the whiteboard as you come up with a million and one reasons why something won’t work. Don’t give in to that pessimistic impulse. Keep driving forward, keep making decisions and tradeoffs. Keep making progress.
We’ll be back after we decide what in the hell we’re going to do with Pillar data.
-
Being a Fan
Being a Fan
It was November 26th 1989, my first live football game. The Atlanta Falcons were taking on the New York Jets at Giant’s stadium. I was 11 years old. A friend of my father’s had a son that played for the Falcons, Jamie Dukes. They invited us down for the game since it was relatively close to my hometown. Before the game we all had breakfast together. Jamie invited a teammate of his, a rookie cornerback named Deion Sanders, to join us. The game was forgettable. The Falcons got pounded, which was par for the course that year. But it didn’t matter, I was hooked.
For the non-sports fan, the level of emotional investment fans have may seem like an elaborate ponzi scheme. Fans pour money into t-shirts, jerseys, hats, tickets etc. When the dream is realized, when your team lifts that Lombardi trophy and are declared champions, the fan gets……nothing. No endorsement deals. No free trophy replica. No personal phone call from the players. Nothing. We’re not blind to the arrangement. We enter it willingly. To the uninitiated it’s the sort of hero-worshiping you’re supposed to shed when you’re 11 years old. Ironically this is when initiation is most successful.
Fandom is tribalism. Tribalism is at the epicenter of the human condition. We dress it up with constructs as sweeping as culture and language and as mundane as logos and greek letters. We strive to belong to something and we reflexively otherize people not of our tribe. Look at race, religion or politics.
But that’s the beauty of sports. The otherization floats on an undercurrent of respect and admiration. That otherization fuels the gameday fire, but extinguishes itself when a player lays motionless on the field. That otherization stirs the passion that leads to pre-game trash talk, but ends in a handshake in the middle of the field. That otherization causes friendly jabs from the guy in a Green Bay jersey in front of you at the store, but ends in a “good luck today” as you part ways.
In today’s political and social climate, sports are not just an escape, but a blueprint for how to handle our most human of urges. Otherization in sports has rules, but those rules end in respect for each other and respect for the game. I read the lovefest between these two teams and think how much it differs from our political discourse. If the rules of behavior for politicans changed, so would the rules for its fans.
Fandom is forged in the furnace of tribalism. As time passes it hardens. Eventually, it won’t bend, it won’t break. A bad season may dull it, but a good season will sharpen it. You don’t choose to become a fan. Through life and circumstance, it just happens. By the time you realize you’re sad on Mondays after a loss, it’s too late. You’re hooked.
Best of luck to the Patriots. Even more luck to the Falcons. Win or lose, I’ll be with the tribe next year…and the year after that…and the year after. I don’t have a choice. I’m a fan. #RiseUp
-
The Myth of the Working Manager
The Myth of the Working Manager
The tech world is full of job descriptions that describe the role of the workingmanager. The title itself is a condescension, as if management alone doesn’t rise to the challenge of being challenging.
I was discussing DHH’s post on Moonlighting Managers with a colleague when it occurred to me that many people have a fundamental misunderstanding of what a manager should do. We’ve polluted the workforce with so many bad managers that their toxic effects on teams hovers like an inescapable fog. The exception has become the rule.
When we talk about management, what we’re often describing are more supervisory tasks than actual management. Coordinating time-off, clearing blockers and scheduling one-on-ones is probably the bare minimum necessary to consider yourself management. There’s an exhaustive list of other activities that management should be responsible for, but because most of us have spent decades being lead in a haze of incompetency, our careers have been devoid of these actions. That void eventually gives birth to our expectations and what follows is our collective standards being silently lowered.
Management goes beyond just people management. A manager is seldom assigned to people or a team. A manager is assigned to some sort of business function. The people come as a by-product of that function. This doesn’t lessen the importance of the staff, but it highlights an additional scope of responsibility for management, the business function. You’re usually promoted to Manager of Production Operations not Manager of Alpha Team. Even when the latter is true, the former is almost always implied by virtue of Alpha Team’s alignment in the organization.
As the manager of Production Operations, I’m just as responsible for the professional development of my team as I am for the stability of the platform. Stability goes beyond simply having two of everything. Stability requires a strategy and vision on how you build tools, from development environments to production. These strategies don’t come into the world fully formed. They require collaboration, a bit of persuasion, measurement, analysis and most notably, time. It’s the OODA loop on a larger time scale.
Sadly, we use reductive terms like measurement and analysis which obfuscates the complexity buried within them. How do you measure a given task? What measurement makes something a success or failure? How do you acquire those measurements without being overly meddlesome with things like tickets and classifications. (Hint: You have to sell the vision to your team, which also takes time) When managers cheat themselves of the time needed to meet these goals, they’re technically in dereliction of their responsibilities. The combination of a lack of time with a lack of training leads to a cocktail of failure.
This little exercise only accounts for the standard vanilla items in the job description. It doesn’t include projects, incidents, prioritization etc. Now somewhere inside of this barrage of responsibility, you’re also supposed to spend time as an engineer, creating, reviewing and approving code among other things. Ask most working managers and they’ll tell you that the split between management and contributor is not what was advertised. They also probably feel that they half-ass both halves of their job, which is always a pleasant feeling.
I know that there are exceptions to this rule. But those exceptions are truly exceptional people. To hold them up as the standard is like my wife saying Why can’t you be more like Usher? Lets not suggest only hiring these exceptional people unless you work for a Facebook or Google or an Uber. They have the resources and the name recognition to hold out for that unicorn. If you’re a startup in the mid-west trying to become the Uber of knitting supplies, then chances are your list of qualified candidates looks different.
The idea of a working manager is a bit redundant, like an engineering engineer. Management is a full-time job. While the efficacy of the role continues to dwindle, we should not compound the situation by also dwindling our expectations of managers, both as people and as organizations. Truth be told the working manager is often a creative crutch as organizations grapple with the need to offer career advancement for technical people who detest the job of management.
But someone has to evaluate the quality of our work as engineers and by extension, as employees. Since we know the pool of competent managers is small, we settle for the next best thing. An awesome engineer but an abysmal manager serving as an adequate supervisor.
The fix is simple.
- Recognize that management is a different skill set. Being a great engineer doesn’t make you a great manager.
- Training, training, training for those entering management for the first time. Mandatory training, not just offering courses that you know nobody actually has time to take.
- Time. People need time in order to manage effectively. If you’re promoting engineers to management and time is tight, they’ll always gravitate towards the thing they’re strongest at. (Coding)
- Empower management. Make the responsibilities, the tools and the expectations match the role.
Strong management, makes strong organizations. It’s worth the effort to make sure management succeeds.