Our Salt Journey Part 2

Structuring Our Pillar Data

This is the 2nd part in our Salt Journey story. You can find the previous article here. With our specific goals in mind we decided that designing our pillar data was probably the first step in refactoring our Salt codebase.

Before we start about how we structure Pillar data, we should probably explain what we plan to put in it, as our usage may not line up with other user’s and their expectations. For us, Pillar data is essentially customized configuration data beyond the defaults. Pillar data is less about minion specific data customizations and more about classes of minions getting specific data.

For example, we have a series of grains (which we’ll talk about in a later post) that have classification information. One of the grains set is class, which identifies the node as being part of development, staging or production. This governs a variety of things we may or may not configure based on the class. If a node is classified as development, we'll install metrics collections and checks, but the alerting profile for them will be very different than if the node was classified as staging or production.

With this in mind, we decided to leverage Pillar Environments in order to create a tiered structure of overrides. We define our pillar’s top.sls file in a specific order of base, development, staging and lastly production like the diagram below.

├── base
├── development
│── staging
├── production

It’s important that we order the files correctly because when the pillar.get() function executes, it will merge values, but on a conflict the last write wins. We need to ensure that the order the files are read in match the ascending order that we want values to be overridden. In this example, conflicting values in the production folder will override any previously defined values.

This design alone however might have unintended consequences. Take for example the below YAML file.

packages:
  - tcpdump
  - rabbitmq-server
  - redis

If this value is set in the base pillar lookup, then (assuming you've defined base as base: '*'), then a pillar.get('packages') will return the above list. But if you also had the below defined in the productionenvironment:

packages:
  - elasticsearch

then your final list would be

packages:
  - tcpdump
  - rabbitmq-server
  - redis
  - elasticsearch

Because the pillar.get() will traverse all of the environments by default. This results in a possible mashup of expected values without care. We protect against this by ensuring that Pillar data is restricted to only nodes that should have access to it. Each pillar environment is guarded by a match syntax based on the grain. Lets say our Pillar data looks like the below:

├── base
│   └── apache
│       └── init.sls
├── development
│── staging
│     └── apache
│        └── init.sls
├── production
│   └── apache
│       └── init.sls

If we’re not careful, we can easily have a mashup of values that result in a very confusing server configuration. So in our top.sls file we have grain matching that helps prevent this.

base: 
  - apache
  - 
production:
  'G@class:production'
    - apache

This allows us to limit the scope of the nodes that can access the production version of the Apache pillar data and avoids the merge conflict. We repeat this pattern for development and staging as well.

What Gets a Pillar File?

Now that we’ve discussed how Pillar data is structured, the question becomes, what actually gets a pillar file? Our previous Pillar structure had quite a number of entries. (I’m not sure that this denotes a bad config however, just an observation) The number of config files was largely driven on how our formulas were defined. All configuration specifics came from pillar data, which meant in order to use any of the formulas, it required some sort of Pillar data before the formula would work.

To correct this we opted to moving default configurations into the formula itself using the standard (I believe?) convention of a map.jinja file. If you haven't seen the map.jinja file before, it's basically a Jinja defined dictionary that allows for setting values based on grains and then ultimately merging that with Pillar data. A common pattern we use is below:

A map.jinja for RabbitMQ

{% set rabbitmq = salt['grains.filter_by']({

  'default': {
  },

  'RedHat': {
    'server_environment': 'dev',
    'vhost': 'local',
    'vm_memory_high_watermark': '0.4',
    'tcp_listeners': '5672',
    'ssl_listeners': '5673',
    'cluster_nodes': '\'rabbit@localhost\'',
    'node_type': 'disc',
    'verify_method': 'verify_none',
    'ssl_versions': ['tlsv1.2', 'tlsv1.1'],
    'fail_if_no_peer_cert': 'false',
    'version': '3.6.6'
  }
})%}

With this defined, the formula has everything it needs to execute, even if no Pillar data is defined. The only time you would need to define pillar data is if you wanted to override one of these default properties. This is perfect for formulas you intend to make public, because it makes no assumptions about the user’s pillar environment. Everything the formula needs is self-contained.

Each Pillar file is defined first with the key that matches the formula that’s calling it. So an example Pillar file might be

rabbitmq:
  vhost: prod01-server
  tcp_listeners: 5673

The name spacing is a common approach, but it’s important because it gives you flexibility on where you can define overrides. They can be in their own standalone files or they can be in a pillar definitions for multiple components. For example our home grown applications need to configure multiple pillar data values. Instead of spreading these values out, they’re collapsed with name spacing into a single file.

postgres:
  users:
    test_app:
      ensure: present
      password: 'password'
      createdb: False
      createroles: True
      createuser: True
      inherit: True
      replication: False

  databases:
    test_app:
      owner: 'test_app'
      template: 'template0'

logging:
  - input_type: log
    paths:
      - /var/log/httpd/access_log
      - /var/log/httpd/error_log
      - /var/log/httpd/test_app-access.log
      - /var/log/httpd/test_app-error.log
      - /var/log/httpd/test_app-access.log
      - /var/log/httpd/test_app-error.log
    document_type: apache
    fields: { environment: {{ grains['environment'] }},
              application: test_app
            }
  - input_type: log

We focus on our formulas creating sane defaults specifically for our environment so that we can limit the amount of data that actually needs to go into our Pillar files.

The catch with shoving everything into the map.jinja file is that sometimes you have a module that needs a lot of default values. OpenSSH is a perfect example of this. When this happens you're stuck with a few choices:

Create a huge map.jinja file to house all these defaults. This can be unruly.
Hardcode defaults into the configuration file template that you’ll be generating, skipping the lookup altogether. This is a decent option if you have a bunch of values that you doubt you’ll ever change. Then you can simply turn them into lookups as you encounter scenarios where you need to deviate from your standard.
Shove all those defaults into a base pillar definition and do the lookups there.
Place the massive list of defaults into a defaults.yaml file and load that in

We opted for option #3. I think each choice has its pluses and minuses, so you need to figure out what works best for your org. Our choice was largely driven by the OpenSSH formula and its massive number of options being placed in Pillar data. We figured we’d simply follow suit.

This pretty much covers how we’ve structured our Pillar data. Since we started writing this we’ve extended the stack a bit more which we’ll go into in our next post, but for now this is a pretty good snapshot of how we’re handling things.

Gotchas

Of course no system is perfect and we’ve already run into a snag with this approach. Nested lookup overrides is problematic for us. So take for example we have the following in our base.sls file:

apache:
  sites:
    cmm:
       DocumentRoot: /
       RailsEnvironment: development

and then you decide that you want to override it in a production.sls Pillar file below:

apache:
  sites:
    cmm:
      RailsEnvironment: production

When you look do a pillar.get('apache') with a node that has access to the production pillar data, you'd expect to get

apache:
  sites:
    cmm:
       DocumentRoot: /
       RailsEnvironment: production

but because Salt won’t handle nested dictionary overrides you instead end up with

apache:
  sites:
    cmm:
      RailsEnvironment: production

which of course breaks a bunch of things when you don’t have all the necessary pillar data. Our hack for this has been to have a separate key space for overrides when we have nested properties.

apache_overrides:
  sites:
    cmm:
      RailsEnvironment: production

and then in our Jinja Templates we do the look up like:

{% set apache = salt['pillar.get']('apache') %}
{% set overrides = salt['pillar.get']('apache') %}
{% do apache.update(overrides) %}

This allows us to override at any depth and then rely on Python’s dictionary handling to merge the two into a useable Pillar data with all the overrides. In truth we should do this for all look ups just to provide clarity, but because things grew organically we’re definitely not following this practice.

I hope someone out there is finding this useful. We’ll continue to post our wins and losses here, so stay tuned.