ChatOps/ChatBots at Centro - All Things Dork

ChatOps/ChatBot at Centro

“white robot action toy” by Franck V. on Unsplash

During DevOpsDays PDX I chatted with a number of people who were interested in doing ChatOps in their organizations. It was the motivation I needed to take this half-written blog post and put the finishing touches on it.

Why a ChatBot?

The team at Centro had always flirted with the idea of doing a chatbot, but we stumbled into it to be honest, which will account for a bunch of the problems we’ve encountered down the road. When we were building out our AWS Infrastructure, we had envisioned an Infrastructure OPS site that would allow users to self-service requests. A chatbot seemed like a novelty side project. One day we were spit-balling on what a chatbot would look like. I mentioned a tool I had been eyeing for a while called StackStorm. StackStorm has positioned itself as an “Event-Driven Automation” tool, the idea being that an event in your infrastructure could trigger an automation workflow. (Auto-remediation anyone?) The idea seemed solid based on the team’s previous experience at other companies. You always find that you have some nagging problem that’s going to take time to get prioritized and fixed. The tool also had a ChatOps component, since when you think about it, a chat message is just another type of event.

To make a long story short, one of our team members did a spike on StackStorm out of curiosity and in very short order had a functioning ChatBot ready to accept commands and execute them. We built a few commands for Marvin (our chatbot) with StackStorm and we instantly fell in love. Key benefits.

Slack is a client you can use anywhere. The more automation you put in your chatbot the more freedom you have to truly work anywhere.
The chatbot serves as a training tool. People can search through history to see how a particular action is done.
The chatbot (if you let it) can be self-empowering for your developers
Unifies context. (again if you let it) The chatbot can be where Ops/Devs/DBAs all use the same tool to get work done. There’s a shared pain, a shared set of responsibility and a shared understanding of how things are operated in the system. The deploy to production looks the same way as the deploy to testing.

Once you get a taste for automating workflows, every request will go under the microscope with a simple question; “Why am I doing this, instead of the developer asking a computer to do it”.

Chatbot setup

StackStorm is at the heart of our chatbot deployment. The tool gave us everything we needed to start writing commands. The project ships with Hubot but unless you run into problems, you don’t need to know anything about Hubot itself. The StackStorm setup has a chatops tutorial to get into the specifics of how to set it up.

The StackStorm tool consists of various workflows that you create. It uses the Mistral workflow engine from the OpenStack Project. It allows you to tie together individual steps to create a larger workflow. It has the ability to launch separate branches of the workflow as well, creating some parallel execution capabilities. For example, if your workflow depends on seeding data in two separate databases, you could parallelize those tasks and then have the workflow continue (or “join” in StackStorm parlance) after those two separately executing tasks complete. It can be a powerhouse option and a pain in the ass at the same time. But we’ll get into that more later in the post.

The workflows are then connected to StackStorm actions, which allow you to execute them using the command line tool or the Chatbot. An action definition is a YAML file that looks like

---
  name: "create"
  pack: platform
  runner_type: "mistral-v2"
  description: "Creates a Centro Platform environment"
  entry_point: "workflows/create.yaml"
  enabled: true
  parameters:
    environment:
      type: "string"
      required: true
      description: "The name of the environment"
    requested_version:
      type: "string"
      default: "latest"
      description: "The version of the platform to deploy"

Workflows and actions are packaged together in StackStorm via “packs”. Think of it as a package in StackStorm that provides related functionality to a product. For us, we group our packs around applications, along with a few shared libraries for actions we perform from multiple packs. The above action is from the platform pack, which controls management of our primary platform environment. There are a bunch of community supported packs available via the StackStorm Exchange.

Then to finally make this a chat command, we define an alias. The alias identifies what messages in chat will trigger the associated action.

---
name: "create"
action_ref: "platform.create"
description: "Creates a Platform environment"
formats:
  - "create platform environment named {{ environment }}( with dataset {{ dataset }})?( with version {{ requested_version='latest' }})"
ack:
  format: "Creating platform environment {{ execution.parameters.environment }}"
  append_url: false

result:
  format: "Your requested workflow is complete."

The formats section of the alias is a slightly modified regular expression. It can be a bit difficult to parse at times as commands become more complex with more optional parameters. The {{ environment }} notation expresses a parameter that will be passed on to the associated action. You can also set that parameter to a default value via assignment, as in {{ requested_version=latest }}. This means if a user doesn’t specify a requested_version, “latest” will be passed as the value for that parameter. Between regex and default parameters, you can have a lot of control over parameters that a user can specify. You can also have multiple formats that trigger the same action. You can see what action this will be invoked by the action_ref line. It’s in a pack.action_name format.

StackStorm brings a lot to the table

This might seem like a lot to get setup per command, but it’s actually quite nice to have StackStorm as this layer of abstraction. Because StackStorm is really an event-automation tool, it exposes these workflows you create in 3 different ways.

The chatbot allows you to execute commands via your chat tool. Hubot supports a number of chat tools, which I believes translates to StackStorm support as well.
The packs and actions you create can be executed from the StackStorm run command manually. This is extremely useful when there’s a Slack outage. The command syntax is st2 run platform.create environment=testing requested_version=4.3 And just like in chat, optional parameters will get default values.
The StackStorm application also provides API access. This gives you the ability to call workflows from just about any other application. This is great when someone needs to do the exact same thing a user might do themselves via the Chatbot. That whole shared context thing showing up again.

What do you run via Chatbot?

To put it simply, as much as we can. Anytime there’s a request to do something more than once, we start to ask ourselves, “Are we adding value to this process or are we just gate keepers?” If we’re not adding value, we put it in a chat command. Some examples of things we have our.

Create an environment
Restore an environment
Take a DB Snapshot of an environment
Scale nodes in an Autoscaling group
Execute Jenkins build jobs
Scale the Jenkins worker instance count
Run migrations
Pause Sidekiq
Restart services
Deploy code
Put an environment in maintenance mode
Turn on a feature toggle
Get a config value from Consul
Set a config value in Consul

In all, we have over 100 chat commands in our environment.

But what about Security

Yes, security is a thing. Like most things security related you need to take a layered approach. We use SSO to authenticate to Slack, so that’s the first layer. The second layer is provided inside the workflows that we create. You have to roll your own RBAC, but most organizations have some sort of Directory Service for group management. For Slack in particular the RBAC implementation can be a bit mess.y The chatbot variables you get as part of each message event include the user’s username, which is changeable by the user. So you really need to grab the user’s token, look up the user’s info with the token to get the email address of the account and then use that to look up group information in whatever your directory service is.

We also ensure that dangerous actions have other out-of-band workflow controls. For example, you can’t just deploy a branch to production. You can only deploy an RPM that’s in the GA YUM repository. In order to get a package to the GA repository, you need to build from a release branch. The artifact of the release branch gets promoted to GA, but only after the promotion confirms that the release branch has a PR that has been approved to go to master. These sorts of out-of-band checks are crucial for some sensitive actions.

Push based two-factor authentication for some actions is desired too. The push based option is preferred because you don’t want to have a two-factor code submitted via Chat, that is technically like for another 60–120 seconds. We’re currently working on this, so keep an eye out for another post.

Lastly, there are some things you simply can’t do via the Chatbot. No one can destroy certain resources in Production via Chat. Even OPS has to move to a different tool for those commands. Sometimes the risk is just too great.

Pitfalls

A few pitfalls with chatbots that we ran into.

We didn’t define a common lexicon for command families. For example, a deploy should have very similar nomenclature everywhere. But because we didn’t define a specific structure, some command are create platform environment named demo01 and some are create api environment demo01. The simple omission of name can trip people up who need to operate in both the platform space and the api space.
The Mistral workflow is a powerful tool, but it can be a bit cumbersome. The workflow also uses a polling mechanism to move between steps. (Step 1 completes, but step 2 doesn’t start until the polling interval occurs and the system detects step 1 finished) As a result, during heavy operations you can spend a considerable amount of time wasted with steps completing, but waiting to poll successfully before they move on.
Share the StackStorm workflow early with all teams. Empower them to create their own commands early on in the process, before the tools become littered with special use cases that makes you hesitant to push that work out to other teams.
Make libraries of common actions early. You can do it by creating custom packs so that you can call those actions from any pack.
Use the mistral workflow sparingly. It’s just one type of command runner StackStorm offers. I think the preferred method of execution, especially for large workflows, is to have most of that execution in a script, so that the action becomes just executing the script. The Mistral tool is nice, but becomes extremely verbose when you start executing a lot of different steps.

Conclusion

We’re pretty happy with our Chatbot implementation. It’s not perfect by any means, but it has given us back a lot of time in wasted toil work. StackStorm has been a tremendous help. The StackStorm Slack is where a lot of the developers hangout and they’re amazing. If you’ve got a problem, they’re more than willing to roll up their sleeves and help you out.

While not in-depth, I hope this brief writeup has helped someone out there in their Chatbot journey. Feel free to ping me with any questions or leave comments here.