-
Ask the wrong people, you build the wrong thing
Ask the wrong people and you build the wrong thing
Not long ago my wife and I received an email from our kid’s school. (They attend a CPS school.) The email was a survey of some sort that would be used to make decisions about curriculum in next year’s school program. It always excites me when parents, teachers and administrators get to collaborate on school programs.
You can imagine my frustration as I clicked on the link to the survey and was greeted with some cryptic error message from Google Forms. I wish I could remember what the error said but even as someone with a technical background, the error didn’t point to any specific action that could be taken to resolve it. Thanks to the pandemic, I’m well equipped with handling the idiosyncrasies of CPS’ implemenation of Google Apps. I logged out of all my Google accounts and then logged back in with my daughter’s CPS email address and I was granted access to the survey.
The question being asked was if we would prefer more STEM classes or more Arts related classes for extra curriculur items next year. I quickly suspected that this was probably going to lead to a case of Selection Bias as the number of people who figured out how to participate in the survey are probably more technical leaning than those that just gave up. Out of curiosity I asked a few people in my circle. The people who I’d consider technical, poked around and figured things out, while other people just gave up, assuming that there was something broken on the site, which was a fair assumption given the generic nature of the error message.
This experience got me thinking about how often we make “educated” decisions based on poor information. CPS could think that they were implementing the wishes of their student community only to find out they were addressing a subset. I’ve fallen victim to this mistake myself when I wasn’t
When my team and I were designing our infrastructure platform at Basis we had a tendency to talk to the loudest developers in the room. Those developers had very specific needs and requirements. But we failed a lesson that I’m sure every product manager in the world knows. The loudest people aren’t always representative of the larger user body. This is exactly what we encountered as we built out our chat bot. Using feedback from the noisy developers pushed us towards a model where there were many different options for building environments and packages. Instead of creating a tight streamlined process, we created different avenues for people to build and manage their environments. We supported custom datasets that were seldom used. We created different methods of creating environments, so maybe you only needed the database server or maybe you only wanted the database server and the jobs server. This created headaches for the people that didn’t want that functionality, which forced us to create omnibus commands that strung together multiple commands.
I’d really like to be angry at the developers for this but the truth is the mistake was all my own. Developers, like anyone else, have different things that they’re attracted to. Some developers love to understand the stack from top to bottom and want configurability at every level. Others are wholly disinterested in infrastructure and want to just point to a repository and say “make an environment out of this”. Despite what my personal feelings are on how much or how little interest they have, the reality is you’re probably not moving them from whatever their stated position is. And even if you do, without using a very heavy hand, you’re likely to just alienate them.
The lesson learned is to make sure that you’re talking to the audience that you actually want to talk to. Who are you asking? How are you asking them? Think about how they might self-select out of your surveys or questions and see if you can mitigate that. To engage with the entire audiece means you might need multiple methods of interviewing people. Developers who respond to surveys might not be the same developers that will respond to 1-on–1 interviews. Don’t make the mistake of optimizing for a subset of your audience or user base. Put care and thought into reaching your target audience.
-
Benefits of Conferences
Benefits of Conferences
Photo of conference attendees I love meeting new people at conferences, especially when people are first time conference attendees. One of my favorite questions to ask is “What did you have to do to get approval to attend?” The question reveals a lot about their employer and the person’s direct manager.
In many organizations, conference attendance is seen as a transactional affair with only specific line items in the transaction providing any sort of intrinsic value. These organizations saddle their employees with requirements that must be met in order to attend the conference. They have note taking requirements, presentations to give when they return and required talks to attend when at the conference. These are just a few of the requirements I’ve heard in my attendance days. It can be easy to dress these requirements up as “due diligence” but in most cases I’ve come across, this level of rigor only seems to apply to conferences. What is more likely happening is that these organizations don’t see the concrete value they expect to see from attending conferences and therefore discount them. But conferences deliver an impact that can be clearly felt, despite their concrete value being difficult to calculate and put on a ledger.
The Hallway Track
Anyone who has attended a conference will tell you that the hallway track is often the most valuable part of the conference. The hallway track is the part of the conference that is unscheduled and unscripted. As people make their way from one talk to another, they inevitably bump into each other and start a conversation that slowly balloons into something larger. Sometimes the conversation is so interesting that you forgo your next talk in favor of this impromptu conversation in-between sessions.
The magic about these conversations is that they tend to take on a life of their own, bending and weaving with the desires of the participants. Something that starts as a follow up questions on distributed locking techniques can quickly evolve into questions that are deeper and more specific to your particular problem. And despite everyone’s desire to be special, conferences make you realize that most of us are solving similar sets of problems. Even if you don’t get a definitive solution out of these talks, I assure you that you’ll get a briefing on how not to solve the problem.
The hallway track has been difficult to replicate virtually. Since the onset of the pandemic, many groups have tried and found very inventive ways to imitate it, but there’s nothing quite like the real thing. Equally difficult is putting a dollar value on the track. There’s no time slot you can point at to show your boss why you want to attend. It’s something organic that evolves, but more importantly, that you have some semblance of control over. Yes your mileage may vary but that’s really the case for everything.
Introduction to new thoughts and ideas
Albert Einstein is often attributed with the quote;
“We cannot solve our problems with the same thinking we used when we created them.”
This axiom can exist within engineering groups. They get trapped into their standard way of thinking and can’t see how a different approach might work. “That would never work here” is a common retort to new ideas. But continued and expanded expansion to new ideas and their successful implementations makes people question the way they do things. Again, never be surprised by just how many people have the same problems you have. Unless you’re Facebook, Apple, Netflix or Google, many companies have the same types of problems. It’s hard to accept that you’re not a special, magical snowflake but attending a conference can force that acceptance pretty quickly.
Sometimes these new ideas and approaches to your problem are not packaged in a flashy title that draws your attention. In my experience some of the best tid bits of information come from talks that I would have never attended or watched on my own. But when I’m at a conference, there’s always a block of time that doesn’t have a talk that speaks directly to my problems. When attending a conference in person I’m more compelled to attend a random talk in that situation. It’s incredible how often that random talk pays dividends. Would I have spent 45 minutes on that talk if I just came across it on YouTube? Probably not. But broadening the scope of what I hear and attend helps with problems that are not top of mind. Better yet, you realize that some of your underlying problems are related to activities, actions or systems that you hadn’t previously considered. Exposure to people, their problems, ideas and solutions helps to expand your thinking about your own problems.
Getting your company name out in the community
You might work for a small or medium size company that just isn’t on the mind’s of technical professionals. Attending conferences (and even better, speaking at them) helps to get your company name into the tech community. With remote work opportunities continuing to grow, the number of potential prospects sky rockets with conference attendance.
In addition to socializing the company name, you’re also socializing the company’s values by the fact that you have employees in attendance! You’d be surprised how valuable that can be to potential job seekers. I’m always surprised when I’m at a DevOps Days conference and I meet someone working at a bank or a hospital, industries that I associate with old-world thinking and mentalities. But talking to those attendees and hearing that their teams are experimenting with DevOps practices, using modern technologies and work management techniques helps to change my biased view of them.
Energizing your employees
The post-conference buzz is real. Once you’ve gotten all of this new information, you’re eager to see how it can be applied to your day-to-day work. Many people come back to the office with a basket of ideas, some of them completely crazy, but many of them completely practical and achievable. As a team you’ll have to figure out which are which. With the support of management that buzz can be channelled into making real change and providing employees with immense job satisfaction as they do it.
Job satisfaction = Retention
No amount of healthy snacks, ping-pong tables and free soda can replace the joy engineers get when they can effect change.
Virtual attendance
A quick note about virtual attendance. During the pandemic conference organizers tried very hard, with varying degrees of success, to replicate the in-person conference feel virtually. But regardless of how well conference organizers do this, remote conferences can be difficult.
For starters, networking virtually can be hard. It requires a level of intentionality either on the conference organizers or as you as an attendee. Chat rooms during a conference talk are a common way of trying to generate those networking opportunities, but they can distract you from the speaker. Hanging out in chatrooms after the talk can sometimes be effective, but again just not quite the same as in person.
Another thing to consider with virtual attendance is how you attend it. Many people attend conferences virtually, but remain logged into all of their usual modes of communication for work, which effectively means, you’re working. Without a clear separation from your work duties, virtual attendance can give way to the usual pressures of the “office”.
These are just a couple of reasons why I favor in-person conferences to virtual conferences. Are virtual conferences better than nothing? Absolutely. But I caution you to not evaluate the value of conferences based solely on virtual conferences.
Wrap-up
Conferences can be a great resource for your employees to engage in the communities that they’re a part of. Networking is crucial to building relationships and knowledge and that is an activity that can be much easier to do in person.
Conferences help expose people to new ideas and new ways to solve problems other than the standard approach the company may take. When you attend conferences you quickly learn that your problems are not as unique as you thought. You’ll without a doubt run into people that have the same problems as you. You’ll probably even meet people who have tried the same solutions and can save you from a wasted journey.
Conferences also help to energize employees. You come back from a conference and you’re excited to experiment with a lot of the techniques and technologies you learned about.
If your company won’t send you to a conference, here’s a few quick tips that might help.
- Some conferences have free tickets, especially for underrepresented groups. If you’re curious about a conference but can’t attend, definitely look into this option. I’ve seen some conferences even cover hotel and air fare.
- Speaking at conferences is another way to get into the event for free. Many conferences have a public Call for Proposals (CFP) process that you can submit to. Don’t think you need some crazy, mind bending thing to give a talk about. Your personal experience, personal communication style and touch can’t be replicated and is something unique to offer. Try it out!
- Try to show the value of the conference to your management. Highlight why you want to attend the conference and some of the soft benefits beyond what watching the YouTube videos after the conference can provide. You can use some of the points highlighted in this article.
- Pay for the conference yourself. Be sure to talk to your manager and let them know that you’re willing to pay for the conference yourself if they can support you with time off and or some help with the travel expense. This technique is depends heavily on your personal situation and the size of the conference.
- Find a new job. (Seriously) I’m not suggesting you quit right now over it, but you might want to consider adding a question about conference attendance to your list of interview questions.
-
Authoring K8s Manifests
Authoring K8s Manifests
Kubernetes Logo Note: This is an internal blog post that I wrote at our company. When I interact with people in the tech community they’re often curious about how different teams approach think about these problems more broadly, so I thought I’d include this. The audience was internal Basis employees, so some of the references may not make sense.
There are no right solutions
As humans we’re obsessed with not making the wrong choice. Everything from where you go to school to whether you should order the chicken or the steak, is besieged by the weight of making “the wrong” choice. But that framing suggests that right and wrong are absolutes, as if you could plugin all the variables of a given situation and arrive at a conclusive answer. This couldn’t be further from the truth. Not in life and definitely not in engineering.
Choices are about trade-offs. Depending on what you’re optimizing for one set of trade-offs seems more practical than another. For example, investing your savings is a good idea, but the vehicles you use to invest differ based on your goals. If you need the money soon, a money market account offers flexibility but at the expense of good returns. The stock market might offer higher returns but at the risk of losing some of the principal. Do you need the money in 2 years or 20 years? How much do you need it to grow? How quickly?
The economist Thomas Sowell famously said “There are no solutions, there are only trade-offs; and you try to get the best trade-off you can get, that’s all you can hope for.”
This statement holds true in software engineering as well.
Imperative vs Declarative Manifest Authoring
When it comes to Kubernetes manifests, there really is only one method of applying those manifests and that’s using a declarative model. We tell Kubernetes what it is we want the final state to look like (via the manifests) and we rely on Kubernetes to figure out the best way to get us to that state.
With Kubernetes all roads lead to a declarative document being applied to the cluster, but how we author those manifests can take on an imperative bend if we wanted to using various template engines like Helm, Jsonnet or the now defunct Ksonnet. But using templating languages provides a power and flexibility that allows us to do some things that we probably shouldn’t do given our past experiences. Templating opens the door for impeding some of the goals we have around the Kubernetes project and what the experience we’re specifically optimizing for. I’d prefer to stay away from templating layers as much as possible and be explicit in our manifest declarations.
What are we optimizing for?
In order to really evaluate the tools we’ve got to discuss what it is we’re optimizing for. These optimizations are in part due to past experiences with infrastructure tools as well as acknowledgements of the new reality we’ll be living in with this shared responsibility model for infrastrcuture.
Easy to read manifests to increase developer involvement
With the move to Kubernetes we’re looking to get developers more involved with the infrastructure that runs their applications. There won’t be a complete migration of ownership to development teams, but we do anticipate more involvement from more people. The team that works on infrastructure now is only 6 people. The development org is over 40 people. That said the reality is that many of these developers will only look at the infrastructure side of things 4 or 5 times a year. When they do look at it, we want that code to be optimized for reading rather than writing. The manifests should be clear and easy to follow.
This will require us to violate some principles and practices like code reuse and DRY, but after years of managing infrastructure code we find that more often than not, each case requires enough customization where the number of parameters and inputs to make code actually reusable, ballons quickly and becomes a bit unwieldy. Between our goals and the realities of infrastructure code reuse, using clear and plain manifest definitions is a better choice for us. We don’t currently have the organizational discipline to be able to reject certain customizations of an RDS instance. And honestly, rejecting a customization request because we don’t have the time to modify the module/template doesn’t feel like a satisfying path forward.
A single deployment tool usable outside the cluster
Because of the application awareness our current orchestration code has, we end up with multiple deployment code bases that are all fronted by a common interface. (Marvin, the chatbot) Even with Marvin serving as an abstraction layer, you can see chinks in the facade as different deployment commands have slightly different syntax and or feature support. In the Kubernetes world we want to rely on a single deploy tool that tries to keep things as basic as
kubectl apply
when possible. Keeping the deploy tool as basic as possible will hopefully allow us to leverage the same tool in local development environments. In order to achieve this goal, we’ll need to standardize on how manifests are provided to the deployment tool.There is a caveat to this however. The goal of a single method to appply manifests is separate and distinct from how the manifests are authored. One could theoretically use a template tool like Helm to write the manifests in, but then provide the final output to the deploy tool. This would violate another goal of easy to read manifests. I just wanted to call out it could be done. Having some dynamic preprocessor that happens ahead of the deploy tool and commits the final version of the manifest to the application repository could be a feasible solution.
Avoiding lots of runtime parameters
Another issue that we see in today’s infrastructure is that our deploy tool requires quite a bit of runtime information. A lot of this runtime information happens under the hood, so while the user isn’t required to provide it, Marvin infers a lot of information based on what the user does provide. For example, when a user provides the name “staging0x” as the environment, Marvin then recognizes that he needs to switch to the production account vs the preproduction account. He knows there’s a separate set of Consul servers that need to be used. He knows the name of the VPC that it needs to be created in as well as the Class definition of the architecture. (Class definitions are our way to scope the sizing requirements of the environment. So a class of “production” would give you one sizing and count of infrastructure, while a class of “integration” or “demo” will give you another)
This becomes problematic when we’re troubleshooting things in the environment. For example, if you want to manually run
terraform apply
or even aterraform destroy
, many times you have to look at previously run commands to get a sense what some of the required values are. In some cases, like during a period of terraform upgrading, you might need to know precisely what was provided at runtime for the environment in order to continue to properly manage the infrastructure. This has definitely complicated the upgrades of components that are long lived, especially areas where state is stored. (Databases and Elasticache for example)Much of the need for this comes from the technical debt incurred when we attempted to create reusable modules for various components. Each reusable module would create a sort of bubble effect, where input parameters for a module at level 3 in the stack would necessitate that we ask for that value at level 1 so that we can propagate it down. As we added support for a new parameter to support a specific use case, it would have the potential to impact all of the other pieces that use it. (Some of this is caused by limitations of the HCL language that Terraform uses)
Nevertheless, when we use templating tools we open the door for code reuse as well as levels of inference that makes the manifest harder to read. (I acknowledge that putting “code reuse” into a negative context seems odd) This code reuse in particular though tends to be the genesis of parameterization that ultimately bubbles its way up the stack. Perhaps not on day one, but by day 200 it seems almost too tempting to resist.
As an organization, we’re relatively immature as it relates to this shared responsibility model for infrastructure. A lot of the techniques that could mitigate my concerns haven’t been battle tested in the company. After some time of running in this environment and getting use to developer and operations interactions my stance may soften on this, but for day one it is a little bit too much to add additional processes to circumvent the short comings.
Easily repeated environment creation
In our internal infrastructure as code (IaC) testing we would often have a situation where coordinating infrastructure changes that needed to be coupled with code changes was a bit of a disaster. Terraform would be versioned in one repository, but SaltStack code would be versioned in a different repository, but the two changes would need to be tested together. This required either a lot of coordination or a ton of manual test environment setup. To deal with the issue more long-term we started to include a branch parameter on all environment creation commands, so that you could specify a custom SaltStack server, a specific Terraform branch and a specific SaltStack branch. The catch was you had to ensure that these parameters were enacted all the way down the pipeline. The complexity that this created is one of the reasons I’ve been leaning towards having the infrastructure code and the application code exist in the same repository.
Having the two together also allows us to hardcode information to ensure that when we deploy a branch, we’re getting a matching set of IaC and application code by setting the image tag in the manifest to match the image built. (There are definite implementation details to work out on this) This avoids the issue of infrastructure code being written for the expectations of version 3.0 of application code, but then suddenly being provided with version 2.0 of application code and things breaking.
We see this when we’re upgrading core components that are defined at the infrastructure layer or when we role out new application environment requirements, like AuthDB. When AuthDB rolled out, it required new infrastructure, but only for versions of the software that were built off the AuthDB branch. It resulted in us spinning up AuthDB infrastructure whether you needed it or not, prolonging and sometimes breaking the creation process for environments that didn’t need AuthDB.
Assuming we can get over a few implementation hurdles, this is a worthwhile goal. It will create a few headaches for sure. How do we make a small infrastructure change in an emergency (like replicacount) without triggering an entire CI build process? How do we ensure OPS is involved with changes to the /infrastructure directory? All things we’ll need to solve for.
Using Kustomize Exclusively
The mixture of goals and philosophies has landed us on using Kustomize exclusively in the environment. Along with that we’d like to adopt many of the Kustomize philosophies around templating, versioning and their approach to manifest management.
While Helm has become a popular method for packaging Kubernetes applications, we’ve avoided authoring helm charts in order to minimize not just the tools, but also the number of philosophies at work in the environment. By using Kustomize exclusively, we acknowledge that some things will be incredibly easy and some things will be incredibly more difficult than they need to be. But that trade-off is part of adhering to an ideology consistently. Some of those tradeoffs are established in the Kubernetes team’s Eschewed Features document. Again, this isn’t to say one approach is right and one is wrong. The folks at Helm are serving the needs of many operators. But the Kustomize approach aligns more closely with the ProdOps worldview of running infrastructure.
We’re looking to leverage Kustomize so that we:
- Don’t require preprocessing of manifests outside of the Kustomize commands
- Being as explicit as possible in manifest definitions, making it easy for people who aren’t in the code base often, to read it and get up to speed.
- Being able to easily recreate environments without the need for storing or remembering run time parameters that were passed.
- Minimizing the number of tools used in the deployment pipeline
I’m not saying it’s the right choice. But for ProdOps it’s the preferred choice. Some pain will definitely follow.
-
Organizing your todos for better effectiveness
Organizing your todos for better effectiveness
If I’ve learned anything during the pandemic it’s this; time is not my constraining resource. The lockdown has forcibly removed many of the demands on my time that I’ve conveniently used as an excuse. My 35-minute commute each way is gone. My evening social commitments have all evaporated. Time spent shuffling kids between extra curricular activities has now become a Zoom login. What am I doing with all of this extra time?
After a few work days that felt incredibly productive, I decided to deeply examine what made those days more effective than others. I didn’t necessarily accomplish more. I spent most of the time doing a rewrite of some deployment code. At the end of the day I had a bunch of functions and unit tests written, but I didn’t have anything impactful to share just yet. That’s when I realized it wasn’t the deliverable of a task that made me feel productive but the level of purpose with which I worked.
What was it about those days that made me feel so unproductive? The one thing they all had in common was a heavy sense of interruption. Sometimes the interruptions were driven by the meetings that seem to invade my calendar, spreading like a liquid to fill every available slice of time. Other times it was the demands of my parallel full time job as a parent/teacher/daycare provider, now that my kids are permanently trapped inside with me. The consistent theme was that when I only had 30 minutes of time, it seemed impractical to work on a task named “Rewrite the deployment pipeline”. My problem consisted of two major issues, the size of the work and how the work was presented to me.
We tend to think of tasks in terms of a deliverable. A large task gets unfairly summarized as a single item, when in fact, it’s many smaller items. I learned this quite some time ago but the issue still shows up in my task list from time to time. The first step was to make sure that my tasks were broken down into chunks that could be accomplished in a maximum of 30 to 60 minutes. Breaking down “Rewrite the deployment pipeline”, could be separated into tasks like:
- Write unit tests for the metadata retrieval function
- Write the metadata retrieval function
- Move common functions into a standard library
- Update references of the common functions to the new standard library
You get the idea. These are all small tasks that I should be able to tackle in a 60 minute period.
The more pressing issue that I would encounter however is presenting work based on the current work context that I’m in. If I’ve only got 15 minutes before my next meeting, it takes a lot of energy to start to get into a task that I know I can’t finish in that period. Because I didn’t have time to finish any of the items on my important list, I’d decide to play hero and go looking in Slack channels to see whose questions I could answer. But for some reason at the end of the week when I review my list of tasks, I’d still have all these small tasks that I hadn’t made any progress on.
This is where Omnifocus’s perspectives functionality saves me. Perspectives allow me to look at tasks that meet a specific criteria. I have a perspective I use called “Focus” that shows me which tasks I’ve flagged as important and which tasks are “due” soon. (In my system, due means that I’ve made an external commitment to a date or there is some other time based constraint on the task)
While this is great to make sure that I’m on top of things that I’ve made commitments to, it doesn’t do a great job of showing me what I can actually work on given the circumstances. There’s no indication of how much time a task will take. Having a separate category for phone calls is great when I’m in phone calling mode. But there’s different level of time commitments between “Call Mom and make sure she got the gift” and “Call your mortgage broker to discuss refinancing options”. I needed a way to also distinguish those tasks from each other.
Awhile ago I had started leveraging an additional context/tag of “Short Dashes” and “Full Focus”. This was just a quick hint of how much energy was required for the task. But by using those contexts/tags, I can create a new filter that highlighted short dash items that I could do between meetings. And now that Omnifocus supports multiple tags, I can also add a tag based on the tool that I need to complete the task. (e.g. Email, Phone, Computer, Research)
Now when I have a short amount of time, I can quickly flip to this perspective of work, which allows me to wrap up a lot of the smaller tasks that I need to do. This helps me to maximize those few minutes that I would normally waste checking Twitter because I didn’t have enough time to complete a larger task.
Another common scenario I’d run into was where my physical presence was tied up, but my mind was free. (Think of waiting for a doctor’s appointment to start. Back when we did those crazy things) I created a mobile perspective specifically for that purpose! It looks at all the tasks that I could complete on a mobile device.
These small changes have helped me to become more effective in those smaller slices of time. Now I know what I can make progress on regardless of my situation and begin to make some of that extra time I’ve got useful.
If you don’t have a to do management system, I’d highly recommend Omnifocus and reading the book Getting Things Done by David Allen.
-
ChatOps/ChatBots at Centro
ChatOps/ChatBot at Centro
“white robot action toy” by Franck V. on Unsplash During DevOpsDays PDX I chatted with a number of people who were interested in doing ChatOps in their organizations. It was the motivation I needed to take this half-written blog post and put the finishing touches on it.
Why a ChatBot?
The team at Centro had always flirted with the idea of doing a chatbot, but we stumbled into it to be honest, which will account for a bunch of the problems we’ve encountered down the road. When we were building out our AWS Infrastructure, we had envisioned an Infrastructure OPS site that would allow users to self-service requests. A chatbot seemed like a novelty side project. One day we were spit-balling on what a chatbot would look like. I mentioned a tool I had been eyeing for a while called StackStorm. StackStorm has positioned itself as an “Event-Driven Automation” tool, the idea being that an event in your infrastructure could trigger an automation workflow. (Auto-remediation anyone?) The idea seemed solid based on the team’s previous experience at other companies. You always find that you have some nagging problem that’s going to take time to get prioritized and fixed. The tool also had a ChatOps component, since when you think about it, a chat message is just another type of event.
To make a long story short, one of our team members did a spike on StackStorm out of curiosity and in very short order had a functioning ChatBot ready to accept commands and execute them. We built a few commands for Marvin (our chatbot) with StackStorm and we instantly fell in love. Key benefits.
- Slack is a client you can use anywhere. The more automation you put in your chatbot the more freedom you have to truly work anywhere.
- The chatbot serves as a training tool. People can search through history to see how a particular action is done.
- The chatbot (if you let it) can be self-empowering for your developers
- Unifies context. (again if you let it) The chatbot can be where Ops/Devs/DBAs all use the same tool to get work done. There’s a shared pain, a shared set of responsibility and a shared understanding of how things are operated in the system. The deploy to production looks the same way as the deploy to testing.
Once you get a taste for automating workflows, every request will go under the microscope with a simple question; “Why am I doing this, instead of the developer asking a computer to do it”.
Chatbot setup
StackStorm is at the heart of our chatbot deployment. The tool gave us everything we needed to start writing commands. The project ships with Hubot but unless you run into problems, you don’t need to know anything about Hubot itself. The StackStorm setup has a chatops tutorial to get into the specifics of how to set it up.
The StackStorm tool consists of various workflows that you create. It uses the Mistral workflow engine from the OpenStack Project. It allows you to tie together individual steps to create a larger workflow. It has the ability to launch separate branches of the workflow as well, creating some parallel execution capabilities. For example, if your workflow depends on seeding data in two separate databases, you could parallelize those tasks and then have the workflow continue (or “join” in StackStorm parlance) after those two separately executing tasks complete. It can be a powerhouse option and a pain in the ass at the same time. But we’ll get into that more later in the post.
The workflows are then connected to StackStorm actions, which allow you to execute them using the command line tool or the Chatbot. An action definition is a YAML file that looks like
---
name: "create"
pack: platform
runner_type: "mistral-v2"
description: "Creates a Centro Platform environment"
entry_point: "workflows/create.yaml"
enabled: true
parameters:
environment:
type: "string"
required: true
description: "The name of the environment"
requested_version:
type: "string"
default: "latest"
description: "The version of the platform to deploy"Workflows and actions are packaged together in StackStorm via “packs”. Think of it as a package in StackStorm that provides related functionality to a product. For us, we group our packs around applications, along with a few shared libraries for actions we perform from multiple packs. The above action is from the platform pack, which controls management of our primary platform environment. There are a bunch of community supported packs available via the StackStorm Exchange.
Then to finally make this a chat command, we define an alias. The alias identifies what messages in chat will trigger the associated action.
---
name: "create"
action_ref: "platform.create"
description: "Creates a Platform environment"
formats:
- "create platform environment named {{ environment }}( with dataset {{ dataset }})?( with version {{ requested_version='latest' }})"
ack:
format: "Creating platform environment {{ execution.parameters.environment }}"
append_url: false
result:
format: "Your requested workflow is complete."The formats section of the alias is a slightly modified regular expression. It can be a bit difficult to parse at times as commands become more complex with more optional parameters. The {{ environment }} notation expresses a parameter that will be passed on to the associated action. You can also set that parameter to a default value via assignment, as in {{ requested_version=latest }}. This means if a user doesn’t specify a requested_version, “latest” will be passed as the value for that parameter. Between regex and default parameters, you can have a lot of control over parameters that a user can specify. You can also have multiple formats that trigger the same action. You can see what action this will be invoked by the action_ref line. It’s in a pack.action_name format.
StackStorm brings a lot to the table
This might seem like a lot to get setup per command, but it’s actually quite nice to have StackStorm as this layer of abstraction. Because StackStorm is really an event-automation tool, it exposes these workflows you create in 3 different ways.
- The chatbot allows you to execute commands via your chat tool. Hubot supports a number of chat tools, which I believes translates to StackStorm support as well.
- The packs and actions you create can be executed from the StackStorm run command manually. This is extremely useful when there’s a Slack outage. The command syntax is
st2 run platform.create environment=testing requested_version=4.3
And just like in chat, optional parameters will get default values. - The StackStorm application also provides API access. This gives you the ability to call workflows from just about any other application. This is great when someone needs to do the exact same thing a user might do themselves via the Chatbot. That whole shared context thing showing up again.
What do you run via Chatbot?
To put it simply, as much as we can. Anytime there’s a request to do something more than once, we start to ask ourselves, “Are we adding value to this process or are we just gate keepers?” If we’re not adding value, we put it in a chat command. Some examples of things we have our.
- Create an environment
- Restore an environment
- Take a DB Snapshot of an environment
- Scale nodes in an Autoscaling group
- Execute Jenkins build jobs
- Scale the Jenkins worker instance count
- Run migrations
- Pause Sidekiq
- Restart services
- Deploy code
- Put an environment in maintenance mode
- Turn on a feature toggle
- Get a config value from Consul
- Set a config value in Consul
In all, we have over 100 chat commands in our environment.
But what about Security
Yes, security is a thing. Like most things security related you need to take a layered approach. We use SSO to authenticate to Slack, so that’s the first layer. The second layer is provided inside the workflows that we create. You have to roll your own RBAC, but most organizations have some sort of Directory Service for group management. For Slack in particular the RBAC implementation can be a bit mess.y The chatbot variables you get as part of each message event include the user’s username, which is changeable by the user. So you really need to grab the user’s token, look up the user’s info with the token to get the email address of the account and then use that to look up group information in whatever your directory service is.
We also ensure that dangerous actions have other out-of-band workflow controls. For example, you can’t just deploy a branch to production. You can only deploy an RPM that’s in the GA YUM repository. In order to get a package to the GA repository, you need to build from a release branch. The artifact of the release branch gets promoted to GA, but only after the promotion confirms that the release branch has a PR that has been approved to go to master. These sorts of out-of-band checks are crucial for some sensitive actions.
Push based two-factor authentication for some actions is desired too. The push based option is preferred because you don’t want to have a two-factor code submitted via Chat, that is technically like for another 60–120 seconds. We’re currently working on this, so keep an eye out for another post.
Lastly, there are some things you simply can’t do via the Chatbot. No one can destroy certain resources in Production via Chat. Even OPS has to move to a different tool for those commands. Sometimes the risk is just too great.
Pitfalls
A few pitfalls with chatbots that we ran into.
- We didn’t define a common lexicon for command families. For example, a deploy should have very similar nomenclature everywhere. But because we didn’t define a specific structure, some command are
create platform environment named demo01
and some arecreate api environment demo01
. The simple omission ofname
can trip people up who need to operate in both the platform space and the api space. - The Mistral workflow is a powerful tool, but it can be a bit cumbersome. The workflow also uses a polling mechanism to move between steps. (Step 1 completes, but step 2 doesn’t start until the polling interval occurs and the system detects step 1 finished) As a result, during heavy operations you can spend a considerable amount of time wasted with steps completing, but waiting to poll successfully before they move on.
- Share the StackStorm workflow early with all teams. Empower them to create their own commands early on in the process, before the tools become littered with special use cases that makes you hesitant to push that work out to other teams.
- Make libraries of common actions early. You can do it by creating custom packs so that you can call those actions from any pack.
- Use the mistral workflow sparingly. It’s just one type of command runner StackStorm offers. I think the preferred method of execution, especially for large workflows, is to have most of that execution in a script, so that the action becomes just executing the script. The Mistral tool is nice, but becomes extremely verbose when you start executing a lot of different steps.
Conclusion
We’re pretty happy with our Chatbot implementation. It’s not perfect by any means, but it has given us back a lot of time in wasted toil work. StackStorm has been a tremendous help. The StackStorm Slack is where a lot of the developers hangout and they’re amazing. If you’ve got a problem, they’re more than willing to roll up their sleeves and help you out.
While not in-depth, I hope this brief writeup has helped someone out there in their Chatbot journey. Feel free to ping me with any questions or leave comments here.
-
Stories vs Facts in Metrics
You need to measure your processes. It doesn’t matter what type of process, whether it be a human process, a systems process or a manufacturing process, everything needs to be measured. In my experience, you’ll often find humans resistant to metrics that measure themselves. There’s a lot of emotion that gets caught up in collecting metrics on staff because unlike computers, we intuitively understand nuance. I’ve worked hard to be able to collect metrics on staff performance while at the same time not adding to the team’s anxiety when the measuring tape comes out. A key to that is how we interpret the data we gather.
At Centro, we practice Conscious Leadership, a methodology to approaching leadership and behaviors throughout the organization. One of the core tenants of Conscious Leadership is this idea of Facts vs Stories. A fact is something that is completely objective, something that could be revealed by a video camera. For example, “Bob rubbed his forehead, slammed his fist down and left the meeting”. That account is factually accurate. Stories are interpretations of facts. “Bob got really angry about my suggestion and stormed out of the meeting.” That’s a story around the fact that Bob slammed his fist down and left the meeting, but it’s not a fact. Maybe Bob remembered he left his oven on. Maybe he realized at that exact moment the solution to a very large problem and he had to test it out. The point is, the stories we tell ourselves may not be rooted in reality, but simply a misinterpretation of the facts.
This perspective is especially pertinent with metrics. There are definitely metrics that are facts. An example is number of on-call pages to an employee. That’s a fact. The problem is when we take that fact and develop a story around it. The story we may tell ourselves about that is we have a lot of incidents in our systems. But the number of pages a person gets may not be directly correlated to the number of actual incidents that have occurred. There is always nuance there. Maybe the person kept snoozing the same alert and it was just re-firing, creating a new page.
There are however some metrics that are not facts, but merely stories in a codified form. My favorite one is automatically generated stats around Mean Time to Recovery. This is usually a metric that’s generated via means of measuring the length of an incident or incidents related to an outage. But this metric is usually a story and not a fact. The fact is the outage incident ticket was opened at noon and closed at 1:30pm. The story around that is it took us 1.5 hours to recover. But maybe the incident wasn’t closed the moment service was restored. Maybe the service interruption started long before the incident ticket was created. Just because our stories can be distilled into a metric doesn’t make them truthful or facts.
Facts versus stories is important in automated systems, but even more so when dealing with human systems and their related workflows. Looking at a report and seeing that Fred closed more tickets than Sarah is a fact. But that doesn’t prove the story that Fred is working harder than Sarah or that Sarah is somehow slacking in her responsibilities. Maybe the size and scope of Fred’s tickets were smaller than Sarah’s. Maybe Sarah had more drive-by conversations than Fred, which reduced her capacity for ticket work. Maybe Sarah spent more time mentoring co-workers in a way that didn’t warrant a ticket. Maybe Fred games the system by creating tickets for anything and everything. There are many stories we could make up around the fact that Fred closed more tickets than Sarah. It’s important as leaders that we don’t let our stories misrepresent the work of a team member.
The fear of stories that we make out of facts is what drives the angst that team members have when leaders start talking about a new performance metric. Be sure to express to your teams the difference between facts and stories. Let them know that your measurements serve as signals more than truths. If Fred is closing a considerable larger number of tickets, it’s a signal to dig into the factors of the fact. Maybe Fred is doing more than Sarah, but more than likely, the truth is more nuanced. Digging in may reveal corrective action on how work gets done or it might reveal a change in the way that metric is tracked. (And subsequently, how that fact manifests) Or it might confirm your original story.
Many people use metrics and dashboards to remove the nuance of evaluating people. It should serve as the prompt to reveal the nuance. When you take your issue to your team, make sure you are open about the fact and your story around those facts. Be sure to separate the two and have an open mind as you explore your story. The openness and candor will provide a level of comfort around the data being collected, because they know it’s not the end of the conversation.
-
How You Interview is How You Hire
“Of course you can’t use Google, this is an interview!” That was the response a friend got when he attempted to search for the syntax to something he hadn’t used in awhile in an interview. After the interview was over, he texted me and told me about the situation. My advice to him was to run as fast as he could and to not think twice about the opportunity, good or bad. I haven’t interviewed at a ton of places as a candidate, so my sample size is statistically insignificant, but it seems insane in today’s world that this would be how you interview a candidate.
As a job candidate, you should pay special attention to how your interview process goes. You spend so much time focused on getting the answers and impressing the tribunal, that you can sometimes fail to evaluate the organization based on the nature of questions being asked and how they are asked. As an organization, how you interview is how you hire, how you hire is how you perform. This is an important maxim, because it can give you, the job seeker, a lot of insight into the culture and personalities you might be working with soon.
The interview process I described earlier seems to put more emphasis on rote memorization than actual problem solving ability. Coming from someone who still regularly screws up the syntax for creating symlinks, I can atest to the idea that your ability to memorize structure has no bearing on your performance as an engineer.
What does an emphasis on memorization tell me about an organization? They may fear change. They may demand the comfort of tools they know extremely well, which on the face of it isn’t a bad thing. Why use the newest whizzbangy thing when old tied and true works? Well sometimes, the definition of “works” changes. Nagios was fine for me 20 years ago, but it isn’t the best tool for the job with the way my infrastructure looks today, regardless of how well I know Nagios. (on a side note, I think this describes VIM users. We labor to make VIM an IDE because we’ve spent so many years building up arcane knowledge, that starting over seems unpalatable. But I digress)
No one expects to work at a place where Google isn’t used extensively to solve problems. So what exactly are interviewers attempting to accomplish by banning it? Creating a power-dynamic? Seeing how you work under-pressure? You work in an environment where Internet access is heavily restricted? These goals very well could be pertinent to the job, but how you evaluate those things are just as important as the results you get from them.
I wish there was a rosetta stone for interview format to personality types, but this is just one example of the type of thing I look for when interviewing and try to actively avoid when giving an interview. Things to also look out for
- Are they looking for a specific solution to a general problem? Maybe you have an answer that works, but you feel them nudging you to a predetermined answer. (e.g. Combine two numbers to get to 6. They might be looking for 4 and 2, but 7 and –1 are also valid)
- Did the interview challenge you at all technically? Will you be the smartest person in the room if you’re hired?
- Are you allowed the tools that you would fully expect to use on the job? (Google, IDE help documentation etc)
- Are they asking questions relevant to the actual role? Preferably in the problem space you’ll be working in.
Paying attention to how a company evaluates talent gives you insight into the type of talent they have. The assumption is always that the people giving the interview have the right answers, but plenty of unqualified people have jobs and those same unqualified people often sit in on interviews.
Remember that the interview is a two-way street. Look to the interview process as a way to gleam information about all of the personalities, values and priorities that make the process what it is. And then ask yourself, is it what you’re looking for?
-
Hubris — The Interview Killer
Hubris — The Interview Killer
Interviewing engineers is a bit more art than science. Every hiring manager has that trait that they look for in a candidate. As an interviewer, you subconsciously recognize early on if the candidate has that magical quality, whatever it may be for you. It either qualifies or disqualifies a candidate in an instant. The trait that I look for to disqualify a candidate is hubris.
Self-confidence is a wonderful thing, but when self-confidence becomes excessive, it’s toxic and dangerous. That danger is never more prevalent then during the build vs buy discussion. The over-confident engineer doesn’t see complex problems, just a series of poor implementations. The over-confident engineer doesn’t see how problems can be interconnected or how use cases change. Instead they say things like “That project is too heavy. We only need this one small part” or “There aren’t any mature solutions, so we’re going to write our own.”
The cocky engineer to the rescue Humility in an engineer is not a nicety, it’s a necessity. Respect for the problem space is a required ingredient for future improvements. But as important as respect for the problem is, respect for the solutions can be even more important. Every solution comes with a collection of trade-offs. The cocky engineer doesn’t respect those trade-offs or doesn’t believe that they were necessary in the first place. The cocky-engineer lives in a world without constraints, without edge cases and with an environment frozen in time, forever unchanging.
But why does all this matter? It matters because our industry is full of bespoke solutions to already solved problems. Every time you commit code to your homegrown log-shipping tool, an engineer that solves this problem as part of their full-time job dies a little bit on the inside. Every time you have an easy implementation for leader election in a distributed system, a random single character is deleted from the Raft paper.
I’m not suggesting that problems are not worth revisiting. But a good engineer will approach the problem with a reverence for prior work. (Or maybe they’re named Linus) An arrogant engineer will trivialize the effort, over promise, under deliver and saddle the team with an albatross of code that always gets described as “some dark shit” during the on-boarding process to new hires.
If problems were easy, you wouldn’t be debating the best way to solve it because the answer would be obvious and standard. When you’re evaluating candidates, make sure you ask questions that involve trade-offs. Ask them for the flaws in their own designs. Even if they can’t identify the flaws, how they respond to the question will tell you a lot, so listen closely. If you’re not careful, you’ll end up with a homegrown Javascript framework….or worse.
-
I really like the concept here, but I’m not sure I’m fully getting it.
I really like the concept here, but I’m not sure I’m fully getting it. Adaptive Capacity *can* be pretty straight forward from a technology standpoint, especially in a cloud type of environment where the “buffer” capacity doesn’t incur cost until it’s actually needed. When it comes to the people portion, I’m not sure if I’m actually achieving the goal of “adaptive” or not.
My thought is basically building in “buffer” in terms of work capacity, but still allocating that buffer for work and using prioritization to know what to drop when you need to shift. (Much like the buffers/cache of the Linux filesystem) The team is still allocated for 40 hours worth of work, but we have mechanisms in place to re-prioritize work to take on new work. (i.e. You trade this ticket/epic for that ticket/epic or we know that this lower value work is the first to be booted out of the queue)
This sounds like adaptive capacity to me, but I’m not sure if I have the full picture, especially when I think of Dr. Cook’s list of 7 items from Poised to Deploy. The combination of those things is exactly what makes complex systems so difficult to deal with. People understand their portion, but not the system as a whole, so we’re always introducing changes/variance with unintended ripple effects. And I think that’s where it feels like I have a blindspot when it comes to the concept.
I might have jumped the gun on this post, because I still have one of the keynotes you linked in the document to watch as well as a PDF that Allspaw tweeted, but figured I’d just go ahead and get the conversation rolling before it fell off my to-do list. =)
-
A Post-mortem on the Mental Model of a System
On Thursday, December 14th we suffered a small incident that lead to various user notifications and actions not being triggered as they would during normal system operations. The issue was caught by alerting so that staff could react prior to customers being impacted, but the nature of the failure and the MTTR (approx 4 hours) was higher than it should have been given the nature of the error and the corrective action taken to resolve it. This seemed like an opportune time to evaluate the nature of the issue with regards to our mental models of how we think the system operates versus how it actually operates. This post-mortem is much more focused on those components of the failure than our typical march towards the ever elusive “root cause”.
Below is a timeline of the events that transpired. After that we’ll go into differnet assumptions made by the participants and how they’re unaligned with the actual way the system behaves.
Timeline
- Datadog alert fires stating that the activity:historyrecordconsumer queue on the RabbitMQ nodes is above thresholds.
- Operator on-call receives the alert, but doesn’t take immediate action
- Second Datadog alert fires at 5:37am for direct::deliveryartifactavailableconsumer-perform
- Operator is paged and begins to diagnose. Checks system stats for any sort of USE related indicators. The system doesn’t appear to be in any duress.
- The operator decides to restart the Sidekiq workers. This doesn’t resolve the issue, so the operator decides to page out to a developer.
- The operator checks the on-call schedule but finds the developer on-call and the backup-developer on-call have no contact information listed. With no clear escalation path on the developer side, the operator escalates to their manager.
- Management creates an incident in JIRA and begins assisting in the investigation.
- Manager requests that the operator restart the Sidekiq workers. This doesn’t resolve the issue.
- Developers begin to login as the work day begins
- Developer identifies that the work queues stopped processing at about 2:05am
- Developer suggests a restart of the Consumer Daemon
- Operator restarts Consumer Daemon
- Alerts clear and the queue begins processing
As you can see, the remedy for the incident was relatively straight forward. But there were a lot of assumptions, incorrect mental models and bad communication that led to the incident taking so long to be resolved. Below is a breakdown of different actions that were taken and the thought process behind them. This leads us to some fascinating insights on ways to make the system better for its operators.
Observations
Some brief context about the observations. The platform involved in the incident has recently been migrated from our datacenter hosting provider to AWS. Along with that migration was a retooling of our entire metrics and alerting system, moving away from InfluxDB, Grafana, Sensu to Datadog.
The platform is also not a new application and precedes the effort of having ProdOps and Development working more closely together. As a result, Operations staff do not yet have the in-depth knowledge of the application they might otherwise have.
Operator on-call receives the alert, but doesn’t take immediate action
The operator received the page, but noticed that the values for the queue size were just above the alerting threshold. Considering the recent migration and this being the first time the alert had fired in Production, the operator made the decision to wait, assuming the alert was a spike that would clear itself. You can notice a clear step change in the graph below.
We have many jobs that run on a scheduled basis and these jobs drop a lot of messages in a queue when they start. Those messages usually get consumed relatively quickly, but due to rate limiting by 3rd parties, the processing can slow down. In reality the queue that’s associated with this alert does not exhibit this sort of behavior.
Second Datadog alert fires at 5:37am for direct::deliveryartifactavailableconsumer-perform
After this second alert the Operator knew there must be a problem and began to take action. The operator wasn’t sure how to resolve the problem and said a feeling of panic began to sink in. Troubleshooting from a point of panic can lead to clouded decision making and evaluation. There were no runbooks on this particular alert, so the impact to the end-user was not entirely clear.
The operator decided to restart Sidekiq because the belief was that they were consumers of the queue. The operator had a mental model that resembled Celery, where work was processed by RabbitMQ workers and results were published to a Redis queue for notification. The actual model is the reverse. Workers work off of the Redis queue and publish their results to RabbitMQ. As a result, the Sidekiq workers only publish messages to RabbitMQ but they do not consume messages from RabbitMQ, therefore the Sidekiq restart was fruitless.
The Operator began to troubleshoot using the USE methodology (Utilization, Saturation, Errors) but didn’t find anything alarming. In truth, it turned out that the absence of log messages was the indicator that something was wrong, but the service that should have been logging wasn’t known to the operator. (The Operator assumed it would be Sidekiq workers based on their mental model described above. Sidekiq workers were logging normally)
Operator checks the on-call schedule and notices the Developer on call doesn’t have a phone number listed
The Operator checked the confluence page but didn’t have any contact information or any general information on who to escalate to if the contact listed didn’t respond. This is a solved problem with tools like PagerDuty, where we programmatically handle on-call and escalations. Due to budget concerns though we leverage the confluence page. It could be worthwhile to invest some development effort into automating the on-call process somehow in lieu of adding more users to pager duty. ($29 per user)
Management creates an incident in JIRA and begins assisting in the investigation.
Management began investigating the issue and requested another restart of Sidekiq. The manager assumed that the operator was using Marvin, the chatbot, to restart services. The operator however was unsure of the appropriate name of the services to restart. The help command for the restart service command reads
restart platform service {{ service }} in environment {{ environment }}
This was confusing because the operator assumed that {{service}} meant a systemd managed service. We run Sidekiq workers as Systemd services so each worker has a different name, such as int_hub or bi_workers. Because the operator didn’t know the different names, it was much easier to SSH into a box, do the appropriate systemd commands and restart the services.
The disconnect is that {{ service }} is actually an alias that maps to particular application components. One of those aliases is sidekiq, which would have restarted all of the Sidekiq instances including the consumer_daemon, which would have resolved the issue. But because of the confusion surrounding the value of {{ service }}, the operator opted to perform the task manually. In a bit of irony, consumer_daemon is technically not a Sidekiq worker, so it’s also incorrectly classified and could cause further confusion for someone who has a different definition for these workers. The organization needs to work on a standard nomenclature to remove this sort of confusion across disciplines.
Developer identifies that the work queues stopped processing at about 2:05am
When developers began to login, they quickly noticed that the Consumer daemon hadn’t processed anything since 2:05am. This was identified by the absence of log messages in Kibana. This was missed by the Operator for two reasons
- As previously stated, the Operator was unaware that Consumer Daemon was the responsible party for processing this queue, so any log searching was focused on Sidekiq.
- The messages that denoted processing from Consumer Daemon have a much more internal representation. The log entries refer to an internal structure in the code “MappableEntityUpdateConsumer”. But the Operator being unaware of internal structures in the code, would have never correlated that to the behavior being seen. The log message is written for an internal view of the system versus that of an external operator.
Additional Observations
There were some additional observations that came out as part of the general discussion that didn’t map specifically to the timeline but are noteworthy.
activity:historyrecordconsumer is not a real queue
This queue is actually just something of a reporting mechanism. It aggregates the count of all queues and emits that as a metric. When this queue shows a spike in volume, it’s really just an indicator that some other queue is encountering problems. This is an older implementation that may not have value in the current world any longer. It also means that each of our stacked queue graphs are essentially reporting double their actual size. (Since this queue would aggregate the value of all queues, but then also be reported in the stacked graph) We should probably eliminate this queue entirely, but we’ll need to adjust alerting thresholds appropriately with its removal.
Background job exceptions don’t get caught with Airbrake
Airbrake normally catches exceptions and reports them via Slack and email. But background workers (i.e. not Sidekiq) do not report to Airbrake. There have been instances where a background job is throwing exceptions but no action is being taken to remedy.
Fixing the problem vs resolving the incident
Restart the Consumer daemon solved the issue we were having, but there was never any answer as to why every worker node in the fleet suddenly stopped processing from RabbitMQ. The support team was forced to move on to other issues before fully resolving or understanding the nature of the issue.
Action Items
With the observations listed above, we’ve found a few things that will make life easier the next time a similar incident occurs.
- Continue to ensure that our alerting is only alerting when there is a known/definite problem. There’s more value to getting alerts 5 minutes late, but being confident that the alert is valid and actionable. In this case the alerting was correct, but we’ll need to continue to build trust by eliminating noisy alerts.
- Ensure that the On-Call support list has phone numbers listed for each contact. We also need to document the escalation policy for when on-call staff are unavailable. We should also look at automating this, either through expanding PagerDuty or otherwise.
- Marvin chatbot commands need a bit more thorough help page. The suggestion of using a man page like format with a link in the help documentation was suggested.
- Common nomenclature for workers should be evangelized. A simple suggestion is that “workers” accounts for all types of publish/subscribe workers and when we’re talking about a particular subset fo workers we describe the messaging system they interact with. “RabbitMQ Workers” vs “Sidekiq Workers”.
- Support staff need to be afforded the time and energy to study an incident until its cause and prevention are sufficiently understood. We need to augment our processes to allow for this time. This will be a cross-functional effort lead by Prod Ops.
-
I Don’t Understand Immutable Infrastructure
We were at the airport getting ready to go through security. A deep baritone voice shouted, “Everybody must take their shoes off and put them in the bin.” Hearing the instruction I told my son and daughter to take their shoes off and put them in the bin. When we got in line for the X-Ray machine, another man looked at my kids and said “Oh they don’t need to take their shoes off.” My wife and I looked at each other puzzled, “But the man over there said everyone take their shoes off.” “Oh, everyone except children under 12” he responded, as if that was the universal definition of “everybody”. I tell this story to highlight the idea that the words we choose to use matter a great deal when trying to convey an idea, thought or concept. Nowhere is this more true than the world of computing.
Immutable Infrastructure is one of those operational concepts that has been very popular, at least in conference talks. The idea isn’t particularly new, I remember building “golden images” in the 90’s. But there’s no doubt that the web, the rate of change and the tooling to support it has put the core concepts en vogue again. But is what we’re doing really immutable? I feel like it’s not. And while it may be a simple argument over words, we use the benefits of immutability in our arguments without any of the consequences that design choice incurs.
I often hear the argument that configuration management is on its way out, now that we’re ready to usher in an era of “immutable” infrastructure. You don’t push out new configurations, you build new images with the new configuration baked in and replace the existing nodes. How do we define configuration? That answer is simultaneously as concreate and maleable. I define configuration as
The applications, libraries, users, data and settings that are necessary to deliver the intended functionality of an application.
That’s a fairly broad definition, but so is configuration! Configuration management is the process (or absence of process) for managing the components in this list. Therefore if any one of these items is modified, that constitutes not just a change to your configuration, but a changer to your infrastructure as well.
Since we’ve defined configuration, what do we mean by immutability? (Or what do we as an industry mean by it) The traditional definition is
Not subject or susceptible to change or variation in form or quality or nature.
In the industry we boil it down to the basic meaning of “once it’s set, it never changes. A string is often immutable in programming languages. Though we give strings the appearance of mutability, in reality it’s a parlor trick to simplify development. But if you tell a developer that strings are immutable, it conveys a specific set of rules and the consequences for those rules.
What doe these definitions mean in practice? Let’s pretend it’s a normal Tuesday. There’s a 60% chance there’s a new OpenSSL package out and you need to update it. Rolling out a new OpenSSL package by creating a new image for your systems seems like a reasonable methodology. Now there’s a known good configuration of our system that we can replicate like-for-like in the environment. If you’re particularly good at it, getting the change rolled out takes you 30 minutes. (Making the change, pushing it, kicking off the image build process and then replacing, while dialing down traffic) For the rest of us mere mortals, it’s probably closer to a couple of hours. But regardless of time, immutable infrastructure wins!
Now lets pretend we’re in our testing environment. This obviously has a different set of nodes it communicates with vs production, so our configuration is different. We don’t want to maintain to separate images, one for production one for testing because that would rob us of our feeling of certainty about the images being the same. Of course we solve this with service discovery! Now instead of baking this configuration into the application, our nodes can use tools like Consul and Eureka to find the nodes it needs to communicate with. The image remains the same, but the applications configured on the image are neatly updated for reflect their running environment.
But isn’t that a change? And the definition of immutable was that the server doesn’t change. Are we more concerned that OpenSSL stays on the same version than we are about what database server an instance is talking to? I’m sure in the halls of Google, Netflix and LinkedIn, a point release of a library could have catastrophic consequences. But if you asked most of the industry “What frightens you more? Updating to the latest version of OpenSSL or updating worker_threads from 4 to 40?” I imagine most of us would choose the latter with absolutely zero context around what worker_threads is. Let’s wave our magic wand though and say service discovery has also relieved us of this particular concern. Let’s move on to something more basic, like user management.
In testing environments I have widely different access policies than I do for my production systems. I also have a completely different profile of users. In production, operations and a few developers are largely the only people that have access. In testing, development, QA and even product may have a login. How does that get managed? Do I shove that into service discovery as well? I could run LDAP in my environment, but that pushes my issue from “How do I manage users and keys?” to, “How do I manage access policy definitions for the LDAP configuration?”
This is all just to say that I’m incredibly confused about the Immutable Infrastructure conversation. In practice it doesn’t solve a whole host of concerns. Instead it pushes them around into a layer of the system that is often ill suited to the task. Or worse, the idealology simply ignores the failures caused by configuration changes and decides that “Immutable Infrastructure” is actually “Immutable Infrastructure, except for the most dangerous parts of the system”.
This doesn’t even tackle the idea that configuration management is still the best tool for…..wait for it…managing configuration, even if you’re using immutable infrastructure. Docker and Packer both transport us back to the early 90’s in their approach to defining configuration. It be a shame if the death of configuration management was as eminent as some people claim.
So what am I missing? Is there a piece of the puzzle that I’m not aware of? Am I being too pedantic in my definition of things? Or is there always an unexpressed qualifier when we say “immutable”?
Maybe words don’t matter.
-
Our Salt Journey Part 2
Our Salt Journey Part 2
Structuring Our Pillar Data
This is the 2nd part in our Salt Journey story. You can find the previous article here. With our specific goals in mind we decided that designing our pillar data was probably the first step in refactoring our Salt codebase.
Before we start about how we structure Pillar data, we should probably explain what we plan to put in it, as our usage may not line up with other user’s and their expectations. For us, Pillar data is essentially customized configuration data beyond the defaults. Pillar data is less about minion specific data customizations and more about classes of minions getting specific data.
For example, we have a series of grains (which we’ll talk about in a later post) that have classification information. One of the grains set is
class
, which identifies the node as being part of development, staging or production. This governs a variety of things we may or may not configure based on the class. If a node is classified asdevelopment
, we'll install metrics collections and checks, but the alerting profile for them will be very different than if the node was classified asstaging
orproduction
.With this in mind, we decided to leverage Pillar Environments in order to create a tiered structure of overrides. We define our pillar’s
top.sls
file in a specific order ofbase
,development
,staging
and lastlyproduction
like the diagram below.├── base
├── development
│── staging
├── productionIt’s important that we order the files correctly because when the
pillar.get()
function executes, it will merge values, but on a conflict the last write wins. We need to ensure that the order the files are read in match the ascending order that we want values to be overridden. In this example, conflicting values in theproduction
folder will override any previously defined values.This design alone however might have unintended consequences. Take for example the below YAML file.
packages:
- tcpdump
- rabbitmq-server
- redisIf this value is set in the
base
pillar lookup, then (assuming you've defined base asbase: '*'
), then apillar.get('packages')
will return the above list. But if you also had the below defined in theproduction
environment:packages:
- elasticsearchthen your final list would be
packages:
- tcpdump
- rabbitmq-server
- redis
- elasticsearchBecause the
pillar.get()
will traverse all of the environments by default. This results in a possible mashup of expected values without care. We protect against this by ensuring that Pillar data is restricted to only nodes that should have access to it. Each pillar environment is guarded by a match syntax based on the grain. Lets say our Pillar data looks like the below:├── base
│ └── apache
│ └── init.sls
├── development
│── staging
│ └── apache
│ └── init.sls
├── production
│ └── apache
│ └── init.slsIf we’re not careful, we can easily have a mashup of values that result in a very confusing server configuration. So in our
top.sls
file we have grain matching that helps prevent this.base:
- apache
-
production:
'G@class:production'
- apacheThis allows us to limit the scope of the nodes that can access the
production
version of the Apache pillar data and avoids the merge conflict. We repeat this pattern fordevelopment
andstaging
as well.What Gets a Pillar File?
Now that we’ve discussed how Pillar data is structured, the question becomes, what actually gets a pillar file? Our previous Pillar structure had quite a number of entries. (I’m not sure that this denotes a bad config however, just an observation) The number of config files was largely driven on how our formulas were defined. All configuration specifics came from pillar data, which meant in order to use any of the formulas, it required some sort of Pillar data before the formula would work.
To correct this we opted to moving default configurations into the formula itself using the standard (I believe?) convention of a
map.jinja
file. If you haven't seen themap.jinja
file before, it's basically a Jinja defined dictionary that allows for setting values based on grains and then ultimately merging that with Pillar data. A common pattern we use is below:A map.jinja for RabbitMQ
{% set rabbitmq = salt['grains.filter_by']({
'default': {
},'RedHat': {
'server_environment': 'dev',
'vhost': 'local',
'vm_memory_high_watermark': '0.4',
'tcp_listeners': '5672',
'ssl_listeners': '5673',
'cluster_nodes': '\'rabbit@localhost\'',
'node_type': 'disc',
'verify_method': 'verify_none',
'ssl_versions': ['tlsv1.2', 'tlsv1.1'],
'fail_if_no_peer_cert': 'false',
'version': '3.6.6'
}
})%}With this defined, the formula has everything it needs to execute, even if no Pillar data is defined. The only time you would need to define pillar data is if you wanted to override one of these default properties. This is perfect for formulas you intend to make public, because it makes no assumptions about the user’s pillar environment. Everything the formula needs is self-contained.
Each Pillar file is defined first with the key that matches the formula that’s calling it. So an example Pillar file might be
rabbitmq:
vhost: prod01-server
tcp_listeners: 5673The name spacing is a common approach, but it’s important because it gives you flexibility on where you can define overrides. They can be in their own standalone files or they can be in a pillar definitions for multiple components. For example our home grown applications need to configure multiple pillar data values. Instead of spreading these values out, they’re collapsed with name spacing into a single file.
postgres:
users:
test_app:
ensure: present
password: 'password'
createdb: False
createroles: True
createuser: True
inherit: True
replication: Falsedatabases:
test_app:
owner: 'test_app'
template: 'template0'logging:
- input_type: log
paths:
- /var/log/httpd/access_log
- /var/log/httpd/error_log
- /var/log/httpd/test_app-access.log
- /var/log/httpd/test_app-error.log
- /var/log/httpd/test_app-access.log
- /var/log/httpd/test_app-error.log
document_type: apache
fields: { environment: {{ grains['environment'] }},
application: test_app
}
- input_type: logWe focus on our formulas creating sane defaults specifically for our environment so that we can limit the amount of data that actually needs to go into our Pillar files.
The catch with shoving everything into the
map.jinja
file is that sometimes you have a module that needs a lot of default values. OpenSSH is a perfect example of this. When this happens you're stuck with a few choices:- Create a huge
map.jinja
file to house all these defaults. This can be unruly. - Hardcode defaults into the configuration file template that you’ll be generating, skipping the lookup altogether. This is a decent option if you have a bunch of values that you doubt you’ll ever change. Then you can simply turn them into lookups as you encounter scenarios where you need to deviate from your standard.
- Shove all those defaults into a
base
pillar definition and do the lookups there. - Place the massive list of defaults into a
defaults.yaml
file and load that in
We opted for option #3. I think each choice has its pluses and minuses, so you need to figure out what works best for your org. Our choice was largely driven by the OpenSSH formula and its massive number of options being placed in Pillar data. We figured we’d simply follow suit.
This pretty much covers how we’ve structured our Pillar data. Since we started writing this we’ve extended the stack a bit more which we’ll go into in our next post, but for now this is a pretty good snapshot of how we’re handling things.
Gotchas
Of course no system is perfect and we’ve already run into a snag with this approach. Nested lookup overrides is problematic for us. So take for example we have the following in our
base.sls
file:apache:
sites:
cmm:
DocumentRoot: /
RailsEnvironment: developmentand then you decide that you want to override it in a
production.sls
Pillar file below:apache:
sites:
cmm:
RailsEnvironment: productionWhen you look do a
pillar.get('apache')
with a node that has access to the production pillar data, you'd expect to getapache:
sites:
cmm:
DocumentRoot: /
RailsEnvironment: productionbut because Salt won’t handle nested dictionary overrides you instead end up with
apache:
sites:
cmm:
RailsEnvironment: productionwhich of course breaks a bunch of things when you don’t have all the necessary pillar data. Our hack for this has been to have a separate key space for overrides when we have nested properties.
apache_overrides:
sites:
cmm:
RailsEnvironment: productionand then in our Jinja Templates we do the look up like:
{% set apache = salt['pillar.get']('apache') %}
{% set overrides = salt['pillar.get']('apache') %}
{% do apache.update(overrides) %}This allows us to override at any depth and then rely on Python’s dictionary handling to merge the two into a useable Pillar data with all the overrides. In truth we should do this for all look ups just to provide clarity, but because things grew organically we’re definitely not following this practice.
I hope someone out there is finding this useful. We’ll continue to post our wins and losses here, so stay tuned.
- Create a huge
-
Thanks Weighted Decision. Great resources there!
Thanks Weighted Decision. Great resources there!
-
Our Journey with Salt
These are a few of the major pain points that we are trying to address, but obviously we’re going to do it in stages. The very first thing we decided to tackle was formula assignment.
Assigning via hostname has its problems. So we opted to go with leveraging Grains on the node to indicate what type of server it was.
With the role custom grain, we can identify the type of server the node is and based on that, what formulas should be applied to it. So our top.sls file might look something like
'role': 'platform_webserver':
- match: grain
- webserverNothing earth shattering yet, but still a huge upgrade from where we’re at today. The key is getting the grain populated on the server instance prior to the Salt Provisioner bootstrapping the node. We have a few ideas on that, but truth be told, even if we have to manually execute a script to properly populate those fields in the meantime, that’s still a big win for us.
We’ve also decided to add a few more grains to the node to make them useful.
- Environment — This identifies the node as being part of development, staging, production etc. This will be useful to us later when we need to decide what sort of Pillar data to apply to a node.
- Location — This identifies which datacenter the node resides in. It’s easier than trying to infer via an IP address. It also allows us a special case of local for development and testing purposes
With these items decided on, our first task will be to get these grains installed on all of the existing architecture and then re-work our top file. Grains should be the only thing that dictates how a server gets formulas assigned to it. We’re making that explicit rule mainly so we have a consistent mental model of where particular functions or activities are happening and how changes will ripple throughout.
Move Cautiously, But Keep Moving
Whenever you make changes like this to how you work, there’s always going to be questions, doubts or hypotheticals that come up. My advice is to figure out what are the ones you have to deal with, what are the ones you need to think about now and what you can punt on till later. Follow the principle of YAGNI as much as possible. Tackle problems as they become problems, but pay no attention to the hypotheticals.
Another point is to be clear about the trade-offs. No system is perfect. You’ll be constantly making design choices that make one thing easer, but another thing harder. Make that choice with eyes wide open, document it and move on.
It’s so easy to get paralyzed at the whiteboard as you come up with a million and one reasons why something won’t work. Don’t give in to that pessimistic impulse. Keep driving forward, keep making decisions and tradeoffs. Keep making progress.
We’ll be back after we decide what in the hell we’re going to do with Pillar data.
-
Being a Fan
Being a Fan
It was November 26th 1989, my first live football game. The Atlanta Falcons were taking on the New York Jets at Giant’s stadium. I was 11 years old. A friend of my father’s had a son that played for the Falcons, Jamie Dukes. They invited us down for the game since it was relatively close to my hometown. Before the game we all had breakfast together. Jamie invited a teammate of his, a rookie cornerback named Deion Sanders, to join us. The game was forgettable. The Falcons got pounded, which was par for the course that year. But it didn’t matter, I was hooked.
For the non-sports fan, the level of emotional investment fans have may seem like an elaborate ponzi scheme. Fans pour money into t-shirts, jerseys, hats, tickets etc. When the dream is realized, when your team lifts that Lombardi trophy and are declared champions, the fan gets……nothing. No endorsement deals. No free trophy replica. No personal phone call from the players. Nothing. We’re not blind to the arrangement. We enter it willingly. To the uninitiated it’s the sort of hero-worshiping you’re supposed to shed when you’re 11 years old. Ironically this is when initiation is most successful.
Fandom is tribalism. Tribalism is at the epicenter of the human condition. We dress it up with constructs as sweeping as culture and language and as mundane as logos and greek letters. We strive to belong to something and we reflexively otherize people not of our tribe. Look at race, religion or politics.
But that’s the beauty of sports. The otherization floats on an undercurrent of respect and admiration. That otherization fuels the gameday fire, but extinguishes itself when a player lays motionless on the field. That otherization stirs the passion that leads to pre-game trash talk, but ends in a handshake in the middle of the field. That otherization causes friendly jabs from the guy in a Green Bay jersey in front of you at the store, but ends in a “good luck today” as you part ways.
In today’s political and social climate, sports are not just an escape, but a blueprint for how to handle our most human of urges. Otherization in sports has rules, but those rules end in respect for each other and respect for the game. I read the lovefest between these two teams and think how much it differs from our political discourse. If the rules of behavior for politicans changed, so would the rules for its fans.
Fandom is forged in the furnace of tribalism. As time passes it hardens. Eventually, it won’t bend, it won’t break. A bad season may dull it, but a good season will sharpen it. You don’t choose to become a fan. Through life and circumstance, it just happens. By the time you realize you’re sad on Mondays after a loss, it’s too late. You’re hooked.
Best of luck to the Patriots. Even more luck to the Falcons. Win or lose, I’ll be with the tribe next year…and the year after that…and the year after. I don’t have a choice. I’m a fan. #RiseUp
-
The Myth of the Working Manager
The Myth of the Working Manager
The tech world is full of job descriptions that describe the role of the workingmanager. The title itself is a condescension, as if management alone doesn’t rise to the challenge of being challenging.
I was discussing DHH’s post on Moonlighting Managers with a colleague when it occurred to me that many people have a fundamental misunderstanding of what a manager should do. We’ve polluted the workforce with so many bad managers that their toxic effects on teams hovers like an inescapable fog. The exception has become the rule.
When we talk about management, what we’re often describing are more supervisory tasks than actual management. Coordinating time-off, clearing blockers and scheduling one-on-ones is probably the bare minimum necessary to consider yourself management. There’s an exhaustive list of other activities that management should be responsible for, but because most of us have spent decades being lead in a haze of incompetency, our careers have been devoid of these actions. That void eventually gives birth to our expectations and what follows is our collective standards being silently lowered.
Management goes beyond just people management. A manager is seldom assigned to people or a team. A manager is assigned to some sort of business function. The people come as a by-product of that function. This doesn’t lessen the importance of the staff, but it highlights an additional scope of responsibility for management, the business function. You’re usually promoted to Manager of Production Operations not Manager of Alpha Team. Even when the latter is true, the former is almost always implied by virtue of Alpha Team’s alignment in the organization.
As the manager of Production Operations, I’m just as responsible for the professional development of my team as I am for the stability of the platform. Stability goes beyond simply having two of everything. Stability requires a strategy and vision on how you build tools, from development environments to production. These strategies don’t come into the world fully formed. They require collaboration, a bit of persuasion, measurement, analysis and most notably, time. It’s the OODA loop on a larger time scale.
Sadly, we use reductive terms like measurement and analysis which obfuscates the complexity buried within them. How do you measure a given task? What measurement makes something a success or failure? How do you acquire those measurements without being overly meddlesome with things like tickets and classifications. (Hint: You have to sell the vision to your team, which also takes time) When managers cheat themselves of the time needed to meet these goals, they’re technically in dereliction of their responsibilities. The combination of a lack of time with a lack of training leads to a cocktail of failure.
This little exercise only accounts for the standard vanilla items in the job description. It doesn’t include projects, incidents, prioritization etc. Now somewhere inside of this barrage of responsibility, you’re also supposed to spend time as an engineer, creating, reviewing and approving code among other things. Ask most working managers and they’ll tell you that the split between management and contributor is not what was advertised. They also probably feel that they half-ass both halves of their job, which is always a pleasant feeling.
I know that there are exceptions to this rule. But those exceptions are truly exceptional people. To hold them up as the standard is like my wife saying Why can’t you be more like Usher? Lets not suggest only hiring these exceptional people unless you work for a Facebook or Google or an Uber. They have the resources and the name recognition to hold out for that unicorn. If you’re a startup in the mid-west trying to become the Uber of knitting supplies, then chances are your list of qualified candidates looks different.
The idea of a working manager is a bit redundant, like an engineering engineer. Management is a full-time job. While the efficacy of the role continues to dwindle, we should not compound the situation by also dwindling our expectations of managers, both as people and as organizations. Truth be told the working manager is often a creative crutch as organizations grapple with the need to offer career advancement for technical people who detest the job of management.
But someone has to evaluate the quality of our work as engineers and by extension, as employees. Since we know the pool of competent managers is small, we settle for the next best thing. An awesome engineer but an abysmal manager serving as an adequate supervisor.
The fix is simple.
- Recognize that management is a different skill set. Being a great engineer doesn’t make you a great manager.
- Training, training, training for those entering management for the first time. Mandatory training, not just offering courses that you know nobody actually has time to take.
- Time. People need time in order to manage effectively. If you’re promoting engineers to management and time is tight, they’ll always gravitate towards the thing they’re strongest at. (Coding)
- Empower management. Make the responsibilities, the tools and the expectations match the role.
Strong management, makes strong organizations. It’s worth the effort to make sure management succeeds.
-
When You Think You’re a Fraud
When You Think You’re a Fraud
Imposter Syndrome is a lot like alcoholism or gout. It comes in waves, but even when you’re not having an episode, it sits there dormant, waiting for the right mix of circumstances to trigger a flare up.
Hi, my name is Jeff and I have Imposter Syndrome. I’ve said this out loud a number of times and it always makes me feel better to know that others share my somewhat irrational fears. It’s one of the most ironic set of emotions I think I’ve ever experienced. The feelings are a downward spiral where you even question your ability to self-diagnose. (Who knows? Maybe I just suck at my job )
I can’t help but compare myself to others in the field, but I’m pretty comfortable recognizing someone else’s skill level, even when it’s better than my own. My triggers are more about what others expect my knowledge level to be at, regardless of how absurd those expectations are. For example, I’ve never actually worked with MongoDB. Sure, I’ve read about it, I’m aware of its capabilities and maybe even a few of its hurdles. But I’m far from an expert on it, despite its popularity. This isn’t a failing of my own, but merely a happen-stance of my career path. I just never had an opportunity to use or implement it.
For those of us with my strand of imposter syndrome, this opportunity doesn’t always trigger the excitement of learning something new, but the colossal self-loathing for not having already known it. Being asked a question I don’t have an answer to is my recurring stress dream. But this feeling isn’t entirely internal. It’s also environmental.
Google did a study recently that said the greatest indicator for high performing teams is being nice. That’s the sort of touchy, feely response that engineers tend to shy away from, but it’s exactly the sort of thing that eases imposter syndrome. Feeling comfortable enough to acknowledge your ignorance is worth its weight in gold. There’s no compensation plan in the world that can compete with a team you trust. There’s no compensation plan in the world that can make up for a team you don’t. When you fear not knowing or asking for help out of embarrassment, you know things have gone bad. That’s not always on you.
The needs of an employee and the working environment aren’t always a match. The same way some alcoholics can function at alcohol related functions and others can’t. There’s no right or wrong, you just need to be aware of the environment you need and try to find it.
Imposter syndrome has some other effects on me. I tend to combat it by throwing myself into my work. If I read one more blog post, launch one more app or go to one more meetup, then I’ll be cured. At least until the next time something new (or new to me) pops up. While I love reading about technology, there’s an opportunity cost that tends to get really expensive. (Family time, hobby time or just plain old down time)
If you’re a lead in your org, you can help folks like me with a few simple things
- Be vulnerable. Things are easier when you’re not alone
- Make failure less risky. It might be through tools, coaching, automation etc. But make failure as safe as you can.
- When challenging ideas propose problems to solve instead of problems that halt. “It’s not stateless” sounds a lot worse than “We need to figure out if we can make it stateless”
Writing this post has been difficult, mainly because 1) I don’t have any answers and 2) It’s hard not to feel like I’m just whining. But I thought I’d put it out there to jump start a conversation amongst people I know. I’ve jokingly suggested an Imposter Syndrome support group, but the more I think about it, the more it sounds like a good idea.
-
People, Process, Tools, in that Order
People, Process, Tools, in that Order
Imagine you’re at a car dealership. You see a brand new, state of the art electric car on sale for a killer price. It’s an efficiency car that seats two, but it comes with a free bike rack and a roof attachment for transporting larger items if needed.
You start thinking of all the money you’d save on gas. You could spend more time riding the bike trails with that handy rack on the back. And the car is still an option for your hiking trips with that sweet ass roof rack. You’re sold. You buy it at a killer deal. As you drive your new toy home, you pull into the driveway and spontaneously realize a few things.
- You don’t hike
- You live close enough to bike to the trail without needing the car
- You don’t have an electrical outlet anywhere near the driveway
- You have no way of fitting your wife and your 3 kids into this awesome 2 seater.
This might sound like the beginning of a sitcom episode, where the car buyer vehemently defends their choice for 22 minutes before ultimately returning the car. But this is also real life in countless technology organizations. In most companies, the experiment lasts a lot longer than an episode of Seinfeld.
In technology circles, we tend to solve our problems in a backwards fashion. We pick the technology, retrofit our process to fit with that technology, then we get people onboard with our newfound wizardry. But that’s exactly why so many technology projects fail to deliver on the business value they’re purported to provide. We simply don’t know what problem we’re solving for.
The technology first model is broken. I think most of us would agree on that, despite how difficult it is to avoid. A better order would be
People -> Process -> Tools
is the way we should be thinking about how we apply technology to solve business problems, as opposed to technology for the sake of itself.
People
Decisions in a vacuum are flawed. Decisions by committee are non-existent. Success is in the middle, but It’s a delicate balance to maintain. The goal is to identify who the key stakeholders are for any given problem that’s being solved. Get the team engaged early and make sure that everyone is in agreement on the problem that’s being solved. This eliminates a lot of unnecessary toil on items that ultimately don’t matter. Once the team is onboard you can move on to…
Process
Back in the old days, companies would design their own processes instead of conforming to the needs of a tool. If you define your process up front, you have assurances that you’ve addressed your business need, as well as identified what is a “must have” in your tool choice. If you’ve got multiple requirements, it might be worthwhile to go through a weighting exercise so you can codify exactly which requirements have priority. (Requirements are not equal)
Tools
Armed with the right people and the process you need to conform to, choosing a tool becomes a lot easier. You can probably eliminate entire classes of solutions in some instances. Your weighted requirements is also a voice in the process. Yes Docker is awesome, but if you can meet your needs using VMs and existing config management tools, that sub-second boot time suddenly truffle salt. (Nice to have, but expensive as HELL)
Following this order of operations isn’t guaranteed to solve all your problems, but it will definitely eliminate a lot of them. Before you decide to Dockerize your MySQL instance, take a breath and ask yourself, “Why am I starting with a soution?”
-
Metrics Driven Development?
Metrics Driven Development?
At the Minneapolis DevOps Days during the open space portion of the program, Markus Silpala proposed the idea of Metrics Driven Development. The hope was that it could bring the same sort of value to monitoring and alerting that TDD brought to the testing world.
Admittedly I was a bit skeptical of the idea, but was intrigued enough to attend the space and holy shit-balls am I glad I did.
The premise is simple. A series of automated tests that could confirm that a service was emitting the right kind of metrics. The more I thought about it, the more I considered its power.
Imagine a world where Operations has a codified series of requirements (tests) on what your application should be providing feedback on. With a completely new project you could run the test harness and see results like
expected application to emit http.status.200 metric
expected application to emit application.health.ok metric
expected application to emit http.response.time metric
These are fairly straight forward examples, but they could quickly become more elaborate with a little codification by the organization.
Potential benefits:
- Operations could own the creation of the tests, serving as an easy way to tell developers the types of metrics that should be reported.
- It helps to codify how metrics should be named. A change in metric names (or a typo in the code) would be caught.
Potential hurdles:
- The team will need to provide some sort of bootstrap environment for the testing. Perhaps a docker container for hosting a local Graphite instance for the SUT.
- You’ll need a naming convention/standard of some sort to be able to identify business level metrics that don’t fall under standard naming conventions.
I’m sure there are more, but I’m just trying to write down these thoughts while they’re relatively fresh in my mind. I’m thinking the test harness would be an extension of existing framework DSLs. For an RSpec type example:
describe “http call”
context “valid response”
subject { metrics.retrieve_all }
before(:each)
get :index {}
end
it { expect(subject).to have_metric(http.status.200)}
it { expect(subject).to have_metric(http.response.time)}
end
end
This is just a rough sketch using RSpec, but I think it gets the idea across. You’d also have to configure the test to launch the docker container, but I left that part out of the example.
Leaving the open space I was extremely curious about this as an idea and an approach, so I thought I’d ask the world. Does this make sense? Has Markus landed on something? What are the HUGE hurdles I’m missing. And most importantly, do people see potential value in this? Shout it out in the comments!