-
ChatOps/ChatBots at Centro
ChatOps/ChatBot at Centro
“white robot action toy” by Franck V. on Unsplash During DevOpsDays PDX I chatted with a number of people who were interested in doing ChatOps in their organizations. It was the motivation I needed to take this half-written blog post and put the finishing touches on it.
Why a ChatBot?
The team at Centro had always flirted with the idea of doing a chatbot, but we stumbled into it to be honest, which will account for a bunch of the problems we’ve encountered down the road. When we were building out our AWS Infrastructure, we had envisioned an Infrastructure OPS site that would allow users to self-service requests. A chatbot seemed like a novelty side project. One day we were spit-balling on what a chatbot would look like. I mentioned a tool I had been eyeing for a while called StackStorm. StackStorm has positioned itself as an “Event-Driven Automation” tool, the idea being that an event in your infrastructure could trigger an automation workflow. (Auto-remediation anyone?) The idea seemed solid based on the team’s previous experience at other companies. You always find that you have some nagging problem that’s going to take time to get prioritized and fixed. The tool also had a ChatOps component, since when you think about it, a chat message is just another type of event.
To make a long story short, one of our team members did a spike on StackStorm out of curiosity and in very short order had a functioning ChatBot ready to accept commands and execute them. We built a few commands for Marvin (our chatbot) with StackStorm and we instantly fell in love. Key benefits.
- Slack is a client you can use anywhere. The more automation you put in your chatbot the more freedom you have to truly work anywhere.
- The chatbot serves as a training tool. People can search through history to see how a particular action is done.
- The chatbot (if you let it) can be self-empowering for your developers
- Unifies context. (again if you let it) The chatbot can be where Ops/Devs/DBAs all use the same tool to get work done. There’s a shared pain, a shared set of responsibility and a shared understanding of how things are operated in the system. The deploy to production looks the same way as the deploy to testing.
Once you get a taste for automating workflows, every request will go under the microscope with a simple question; “Why am I doing this, instead of the developer asking a computer to do it”.
Chatbot setup
StackStorm is at the heart of our chatbot deployment. The tool gave us everything we needed to start writing commands. The project ships with Hubot but unless you run into problems, you don’t need to know anything about Hubot itself. The StackStorm setup has a chatops tutorial to get into the specifics of how to set it up.
The StackStorm tool consists of various workflows that you create. It uses the Mistral workflow engine from the OpenStack Project. It allows you to tie together individual steps to create a larger workflow. It has the ability to launch separate branches of the workflow as well, creating some parallel execution capabilities. For example, if your workflow depends on seeding data in two separate databases, you could parallelize those tasks and then have the workflow continue (or “join” in StackStorm parlance) after those two separately executing tasks complete. It can be a powerhouse option and a pain in the ass at the same time. But we’ll get into that more later in the post.
The workflows are then connected to StackStorm actions, which allow you to execute them using the command line tool or the Chatbot. An action definition is a YAML file that looks like
---
name: "create"
pack: platform
runner_type: "mistral-v2"
description: "Creates a Centro Platform environment"
entry_point: "workflows/create.yaml"
enabled: true
parameters:
environment:
type: "string"
required: true
description: "The name of the environment"
requested_version:
type: "string"
default: "latest"
description: "The version of the platform to deploy"Workflows and actions are packaged together in StackStorm via “packs”. Think of it as a package in StackStorm that provides related functionality to a product. For us, we group our packs around applications, along with a few shared libraries for actions we perform from multiple packs. The above action is from the platform pack, which controls management of our primary platform environment. There are a bunch of community supported packs available via the StackStorm Exchange.
Then to finally make this a chat command, we define an alias. The alias identifies what messages in chat will trigger the associated action.
---
name: "create"
action_ref: "platform.create"
description: "Creates a Platform environment"
formats:
- "create platform environment named {{ environment }}( with dataset {{ dataset }})?( with version {{ requested_version='latest' }})"
ack:
format: "Creating platform environment {{ execution.parameters.environment }}"
append_url: false
result:
format: "Your requested workflow is complete."The formats section of the alias is a slightly modified regular expression. It can be a bit difficult to parse at times as commands become more complex with more optional parameters. The {{ environment }} notation expresses a parameter that will be passed on to the associated action. You can also set that parameter to a default value via assignment, as in {{ requested_version=latest }}. This means if a user doesn’t specify a requested_version, “latest” will be passed as the value for that parameter. Between regex and default parameters, you can have a lot of control over parameters that a user can specify. You can also have multiple formats that trigger the same action. You can see what action this will be invoked by the action_ref line. It’s in a pack.action_name format.
StackStorm brings a lot to the table
This might seem like a lot to get setup per command, but it’s actually quite nice to have StackStorm as this layer of abstraction. Because StackStorm is really an event-automation tool, it exposes these workflows you create in 3 different ways.
- The chatbot allows you to execute commands via your chat tool. Hubot supports a number of chat tools, which I believes translates to StackStorm support as well.
- The packs and actions you create can be executed from the StackStorm run command manually. This is extremely useful when there’s a Slack outage. The command syntax is
st2 run platform.create environment=testing requested_version=4.3
And just like in chat, optional parameters will get default values. - The StackStorm application also provides API access. This gives you the ability to call workflows from just about any other application. This is great when someone needs to do the exact same thing a user might do themselves via the Chatbot. That whole shared context thing showing up again.
What do you run via Chatbot?
To put it simply, as much as we can. Anytime there’s a request to do something more than once, we start to ask ourselves, “Are we adding value to this process or are we just gate keepers?” If we’re not adding value, we put it in a chat command. Some examples of things we have our.
- Create an environment
- Restore an environment
- Take a DB Snapshot of an environment
- Scale nodes in an Autoscaling group
- Execute Jenkins build jobs
- Scale the Jenkins worker instance count
- Run migrations
- Pause Sidekiq
- Restart services
- Deploy code
- Put an environment in maintenance mode
- Turn on a feature toggle
- Get a config value from Consul
- Set a config value in Consul
In all, we have over 100 chat commands in our environment.
But what about Security
Yes, security is a thing. Like most things security related you need to take a layered approach. We use SSO to authenticate to Slack, so that’s the first layer. The second layer is provided inside the workflows that we create. You have to roll your own RBAC, but most organizations have some sort of Directory Service for group management. For Slack in particular the RBAC implementation can be a bit mess.y The chatbot variables you get as part of each message event include the user’s username, which is changeable by the user. So you really need to grab the user’s token, look up the user’s info with the token to get the email address of the account and then use that to look up group information in whatever your directory service is.
We also ensure that dangerous actions have other out-of-band workflow controls. For example, you can’t just deploy a branch to production. You can only deploy an RPM that’s in the GA YUM repository. In order to get a package to the GA repository, you need to build from a release branch. The artifact of the release branch gets promoted to GA, but only after the promotion confirms that the release branch has a PR that has been approved to go to master. These sorts of out-of-band checks are crucial for some sensitive actions.
Push based two-factor authentication for some actions is desired too. The push based option is preferred because you don’t want to have a two-factor code submitted via Chat, that is technically like for another 60–120 seconds. We’re currently working on this, so keep an eye out for another post.
Lastly, there are some things you simply can’t do via the Chatbot. No one can destroy certain resources in Production via Chat. Even OPS has to move to a different tool for those commands. Sometimes the risk is just too great.
Pitfalls
A few pitfalls with chatbots that we ran into.
- We didn’t define a common lexicon for command families. For example, a deploy should have very similar nomenclature everywhere. But because we didn’t define a specific structure, some command are
create platform environment named demo01
and some arecreate api environment demo01
. The simple omission ofname
can trip people up who need to operate in both the platform space and the api space. - The Mistral workflow is a powerful tool, but it can be a bit cumbersome. The workflow also uses a polling mechanism to move between steps. (Step 1 completes, but step 2 doesn’t start until the polling interval occurs and the system detects step 1 finished) As a result, during heavy operations you can spend a considerable amount of time wasted with steps completing, but waiting to poll successfully before they move on.
- Share the StackStorm workflow early with all teams. Empower them to create their own commands early on in the process, before the tools become littered with special use cases that makes you hesitant to push that work out to other teams.
- Make libraries of common actions early. You can do it by creating custom packs so that you can call those actions from any pack.
- Use the mistral workflow sparingly. It’s just one type of command runner StackStorm offers. I think the preferred method of execution, especially for large workflows, is to have most of that execution in a script, so that the action becomes just executing the script. The Mistral tool is nice, but becomes extremely verbose when you start executing a lot of different steps.
Conclusion
We’re pretty happy with our Chatbot implementation. It’s not perfect by any means, but it has given us back a lot of time in wasted toil work. StackStorm has been a tremendous help. The StackStorm Slack is where a lot of the developers hangout and they’re amazing. If you’ve got a problem, they’re more than willing to roll up their sleeves and help you out.
While not in-depth, I hope this brief writeup has helped someone out there in their Chatbot journey. Feel free to ping me with any questions or leave comments here.
-
Stories vs Facts in Metrics
You need to measure your processes. It doesn’t matter what type of process, whether it be a human process, a systems process or a manufacturing process, everything needs to be measured. In my experience, you’ll often find humans resistant to metrics that measure themselves. There’s a lot of emotion that gets caught up in collecting metrics on staff because unlike computers, we intuitively understand nuance. I’ve worked hard to be able to collect metrics on staff performance while at the same time not adding to the team’s anxiety when the measuring tape comes out. A key to that is how we interpret the data we gather.
At Centro, we practice Conscious Leadership, a methodology to approaching leadership and behaviors throughout the organization. One of the core tenants of Conscious Leadership is this idea of Facts vs Stories. A fact is something that is completely objective, something that could be revealed by a video camera. For example, “Bob rubbed his forehead, slammed his fist down and left the meeting”. That account is factually accurate. Stories are interpretations of facts. “Bob got really angry about my suggestion and stormed out of the meeting.” That’s a story around the fact that Bob slammed his fist down and left the meeting, but it’s not a fact. Maybe Bob remembered he left his oven on. Maybe he realized at that exact moment the solution to a very large problem and he had to test it out. The point is, the stories we tell ourselves may not be rooted in reality, but simply a misinterpretation of the facts.
This perspective is especially pertinent with metrics. There are definitely metrics that are facts. An example is number of on-call pages to an employee. That’s a fact. The problem is when we take that fact and develop a story around it. The story we may tell ourselves about that is we have a lot of incidents in our systems. But the number of pages a person gets may not be directly correlated to the number of actual incidents that have occurred. There is always nuance there. Maybe the person kept snoozing the same alert and it was just re-firing, creating a new page.
There are however some metrics that are not facts, but merely stories in a codified form. My favorite one is automatically generated stats around Mean Time to Recovery. This is usually a metric that’s generated via means of measuring the length of an incident or incidents related to an outage. But this metric is usually a story and not a fact. The fact is the outage incident ticket was opened at noon and closed at 1:30pm. The story around that is it took us 1.5 hours to recover. But maybe the incident wasn’t closed the moment service was restored. Maybe the service interruption started long before the incident ticket was created. Just because our stories can be distilled into a metric doesn’t make them truthful or facts.
Facts versus stories is important in automated systems, but even more so when dealing with human systems and their related workflows. Looking at a report and seeing that Fred closed more tickets than Sarah is a fact. But that doesn’t prove the story that Fred is working harder than Sarah or that Sarah is somehow slacking in her responsibilities. Maybe the size and scope of Fred’s tickets were smaller than Sarah’s. Maybe Sarah had more drive-by conversations than Fred, which reduced her capacity for ticket work. Maybe Sarah spent more time mentoring co-workers in a way that didn’t warrant a ticket. Maybe Fred games the system by creating tickets for anything and everything. There are many stories we could make up around the fact that Fred closed more tickets than Sarah. It’s important as leaders that we don’t let our stories misrepresent the work of a team member.
The fear of stories that we make out of facts is what drives the angst that team members have when leaders start talking about a new performance metric. Be sure to express to your teams the difference between facts and stories. Let them know that your measurements serve as signals more than truths. If Fred is closing a considerable larger number of tickets, it’s a signal to dig into the factors of the fact. Maybe Fred is doing more than Sarah, but more than likely, the truth is more nuanced. Digging in may reveal corrective action on how work gets done or it might reveal a change in the way that metric is tracked. (And subsequently, how that fact manifests) Or it might confirm your original story.
Many people use metrics and dashboards to remove the nuance of evaluating people. It should serve as the prompt to reveal the nuance. When you take your issue to your team, make sure you are open about the fact and your story around those facts. Be sure to separate the two and have an open mind as you explore your story. The openness and candor will provide a level of comfort around the data being collected, because they know it’s not the end of the conversation.
-
How You Interview is How You Hire
“Of course you can’t use Google, this is an interview!” That was the response a friend got when he attempted to search for the syntax to something he hadn’t used in awhile in an interview. After the interview was over, he texted me and told me about the situation. My advice to him was to run as fast as he could and to not think twice about the opportunity, good or bad. I haven’t interviewed at a ton of places as a candidate, so my sample size is statistically insignificant, but it seems insane in today’s world that this would be how you interview a candidate.
As a job candidate, you should pay special attention to how your interview process goes. You spend so much time focused on getting the answers and impressing the tribunal, that you can sometimes fail to evaluate the organization based on the nature of questions being asked and how they are asked. As an organization, how you interview is how you hire, how you hire is how you perform. This is an important maxim, because it can give you, the job seeker, a lot of insight into the culture and personalities you might be working with soon.
The interview process I described earlier seems to put more emphasis on rote memorization than actual problem solving ability. Coming from someone who still regularly screws up the syntax for creating symlinks, I can atest to the idea that your ability to memorize structure has no bearing on your performance as an engineer.
What does an emphasis on memorization tell me about an organization? They may fear change. They may demand the comfort of tools they know extremely well, which on the face of it isn’t a bad thing. Why use the newest whizzbangy thing when old tied and true works? Well sometimes, the definition of “works” changes. Nagios was fine for me 20 years ago, but it isn’t the best tool for the job with the way my infrastructure looks today, regardless of how well I know Nagios. (on a side note, I think this describes VIM users. We labor to make VIM an IDE because we’ve spent so many years building up arcane knowledge, that starting over seems unpalatable. But I digress)
No one expects to work at a place where Google isn’t used extensively to solve problems. So what exactly are interviewers attempting to accomplish by banning it? Creating a power-dynamic? Seeing how you work under-pressure? You work in an environment where Internet access is heavily restricted? These goals very well could be pertinent to the job, but how you evaluate those things are just as important as the results you get from them.
I wish there was a rosetta stone for interview format to personality types, but this is just one example of the type of thing I look for when interviewing and try to actively avoid when giving an interview. Things to also look out for
- Are they looking for a specific solution to a general problem? Maybe you have an answer that works, but you feel them nudging you to a predetermined answer. (e.g. Combine two numbers to get to 6. They might be looking for 4 and 2, but 7 and –1 are also valid)
- Did the interview challenge you at all technically? Will you be the smartest person in the room if you’re hired?
- Are you allowed the tools that you would fully expect to use on the job? (Google, IDE help documentation etc)
- Are they asking questions relevant to the actual role? Preferably in the problem space you’ll be working in.
Paying attention to how a company evaluates talent gives you insight into the type of talent they have. The assumption is always that the people giving the interview have the right answers, but plenty of unqualified people have jobs and those same unqualified people often sit in on interviews.
Remember that the interview is a two-way street. Look to the interview process as a way to gleam information about all of the personalities, values and priorities that make the process what it is. And then ask yourself, is it what you’re looking for?
-
Hubris — The Interview Killer
Hubris — The Interview Killer
Interviewing engineers is a bit more art than science. Every hiring manager has that trait that they look for in a candidate. As an interviewer, you subconsciously recognize early on if the candidate has that magical quality, whatever it may be for you. It either qualifies or disqualifies a candidate in an instant. The trait that I look for to disqualify a candidate is hubris.
Self-confidence is a wonderful thing, but when self-confidence becomes excessive, it’s toxic and dangerous. That danger is never more prevalent then during the build vs buy discussion. The over-confident engineer doesn’t see complex problems, just a series of poor implementations. The over-confident engineer doesn’t see how problems can be interconnected or how use cases change. Instead they say things like “That project is too heavy. We only need this one small part” or “There aren’t any mature solutions, so we’re going to write our own.”
The cocky engineer to the rescue Humility in an engineer is not a nicety, it’s a necessity. Respect for the problem space is a required ingredient for future improvements. But as important as respect for the problem is, respect for the solutions can be even more important. Every solution comes with a collection of trade-offs. The cocky engineer doesn’t respect those trade-offs or doesn’t believe that they were necessary in the first place. The cocky-engineer lives in a world without constraints, without edge cases and with an environment frozen in time, forever unchanging.
But why does all this matter? It matters because our industry is full of bespoke solutions to already solved problems. Every time you commit code to your homegrown log-shipping tool, an engineer that solves this problem as part of their full-time job dies a little bit on the inside. Every time you have an easy implementation for leader election in a distributed system, a random single character is deleted from the Raft paper.
I’m not suggesting that problems are not worth revisiting. But a good engineer will approach the problem with a reverence for prior work. (Or maybe they’re named Linus) An arrogant engineer will trivialize the effort, over promise, under deliver and saddle the team with an albatross of code that always gets described as “some dark shit” during the on-boarding process to new hires.
If problems were easy, you wouldn’t be debating the best way to solve it because the answer would be obvious and standard. When you’re evaluating candidates, make sure you ask questions that involve trade-offs. Ask them for the flaws in their own designs. Even if they can’t identify the flaws, how they respond to the question will tell you a lot, so listen closely. If you’re not careful, you’ll end up with a homegrown Javascript framework….or worse.
-
I really like the concept here, but I’m not sure I’m fully getting it.
I really like the concept here, but I’m not sure I’m fully getting it. Adaptive Capacity *can* be pretty straight forward from a technology standpoint, especially in a cloud type of environment where the “buffer” capacity doesn’t incur cost until it’s actually needed. When it comes to the people portion, I’m not sure if I’m actually achieving the goal of “adaptive” or not.
My thought is basically building in “buffer” in terms of work capacity, but still allocating that buffer for work and using prioritization to know what to drop when you need to shift. (Much like the buffers/cache of the Linux filesystem) The team is still allocated for 40 hours worth of work, but we have mechanisms in place to re-prioritize work to take on new work. (i.e. You trade this ticket/epic for that ticket/epic or we know that this lower value work is the first to be booted out of the queue)
This sounds like adaptive capacity to me, but I’m not sure if I have the full picture, especially when I think of Dr. Cook’s list of 7 items from Poised to Deploy. The combination of those things is exactly what makes complex systems so difficult to deal with. People understand their portion, but not the system as a whole, so we’re always introducing changes/variance with unintended ripple effects. And I think that’s where it feels like I have a blindspot when it comes to the concept.
I might have jumped the gun on this post, because I still have one of the keynotes you linked in the document to watch as well as a PDF that Allspaw tweeted, but figured I’d just go ahead and get the conversation rolling before it fell off my to-do list. =)