Two Simple Ways Automation Can Save You Money on Your AWS Bill

Two Simple Ways Automation Can Save You Money on Your AWS Bill

Red Hat Ansible Automation Platform is an excellent automation and orchestration tool for public clouds. For this post, I am going to walk through two common scenarios where Ansible Automation Platform can help out. I want to look outside the common public cloud use-case of provisioning and deprovisioning resources and instead look at automating common operational tasks.

cloud engineer diagram

What is an operational task? It is simply anything that an administrator has to do outside of creating and deleting cloud resources (e.g. instances, networks, keys, etc.) to help maintain their company's public cloud account. One of the problems I've encountered is instances being left on, running up our public cloud bill in the background while we were focusing our attention elsewhere. The more users you have, the more likely problems are to occur; automation can help address these issues and maintain control of your account. There are two common scenarios I want to address here:

  1. Bespoke AWS instances were manually created for a one-off initiative, usually to test something, then instances were forgotten about and left running.
  2. Continuous Integration (CI) instances were spun up to test changes programmatically every time a Pull Request (PR) went into our project, and would sometimes hit a corner case where not everything was deprovisioned correctly (turned off).

In both cases, orphaned instances can be left on for a long time. Imagine you spun up a couple dozen instances to test something on a public cloud, then you got busy, lost track of time and forgot to terminate the instances before stopping work for the day. That might be 16 hours (at minimum) of time when you were charged and received no value out of the public cloud that your company was financing. Now multiply this by dozens of users and that bill can end up in tens of thousands of dollars really quickly.

Use-case one: dealing with bespoke orphaned instances

So let's tackle each of these issues and use Ansible Automation Platform to automate a solution for the first scenario above where instances are being spun up outside any automation guard rails (i.e. they are not using any automation tools, including Ansible, to spin up cloud resources). We require everyone on my team who has access to the public cloud account to tag their instances. They must create a key, pair tag that says: owner: person

inventory tags screenshot

This creates a really easy way to audit and see who (which person, organization, or team) is accountable for billing, which is half the battle. I am going to write a very simple Ansible Playbook that will enforce this. I will use the fully supported amazon.aws collection to demonstrate this.  

The primary difference between the community and supported Collections here is support with your Red Hat subscription. There is also significant integration testing, code auditing and Python 3 / boto3 support with the fully supported amazon.aws collection that is included as part of your Red Hat subscription.

Dealing with untagged instances

In my first Ansible Playbook, I want to get a list of all instances that have no tags. First, let's retrieve all instances in a particular region that are running:

- name: grab info for un-tagged instances
  amazon.aws.ec2_instance_info:
    region: "{{ ec2_region }}"
    filters:
      instance-state-name: running
  register: ec2_node_info

I am using the ec2_instance_info module found in the AWS Collection, part of the Amazon namespace. This task retrieves all instances (regardless of tags). I found the easiest way was to grab everything then filter out for empty tags:

- name: set the untagged to a var
  set_fact:
    untagged_instances: "{{ ec2_node_info.instances | selectattr('tags', 'equalto', {}) | map(attribute='instance_id') | list }}"

This selectattr filter is simply matching any instance that has no tags with the ['tags', 'equalto', {} ]

I can then simply terminate these since my colleague didn't follow my well establish guidelines:

- name: Terminate every un-tagged running instance in a region.
  amazon.aws.ec2:
    region: "{{ ec2_region }}"
    state: absent
    instance_ids: "{{ untagged_instances }}"
  when: untagged_instances | length  > 0

However, since you might be more forgiving than me, you could use state: stopped versus absent which will turn them off versus terminate them.

Retrieving any instances with missing tags

To expand on the above, we don't just care about instances that are untagged entirely (meaning there are no tags at all), but we are specifically looking for the owner tag. I now want to retrieve any instance that is missing the owner tag. I can use the exact same logic as above but instead use the selectattr filter to look for undefined. 

- name: set the missing tag to a var
  set_fact:
    missing_tag: "{{ ec2_node_info.instances | selectattr('tags.owner', 'undefined') | map(attribute='instance_id') | list }}"

I wanted to show both examples above to give a path to operationalization. Implementing the above using Ansible Automation Platform, your organization can now understand that they need to use tags, or their instances will be turned off (or worse!). Going further, the organization could use automation to enforce a particular tag to assign ownership, or action on the instance will be taken. You could use one or both of these previous examples.

Use-case two: Dealing with automated instances

For my particular use-case I have a code repository that is automatically tested. Our code is tested nightly, and everytime there is a Pull Request (PR) into the code repository. The CI testing will provision instances on AWS, configure the instances, run through automated tests, then deprovision them. Sometimes the deprovisioning step will not complete successfully, leaving orphaned hosts. One of the common things I have noticed is that the instances are often found partially turned off, where their tags are completely missing (removed) but the instances are not actually off, so we are still getting billed. The above Ansible Playbook in the previous example can catch that.

However another great test is to use a new uptime parameter.

- name: grab info
  amazon.aws.ec2_instance_info:
    region: "{{ ec2_region }}"
    uptime: 121
    filters:
      instance-state-name: [ "running" ]
      "tag:ansible-workshops": true
  register: ec2_node_info

In this task there are two parameters I want to call out. First is the uptime parameter (added in the 1.4.0 of amazon.aws), which will only return instances that have run for more than that integer in minutes. For this example it has to run for more than 121 minutes, or more than two hours. I know that my CI testing should never take more than two hours. Either the instance is stuck, my automated testing broke, or deprovisioning didn't happen successfully.

The other tag here is simply a filter so that I only return instances that are part of my automated testing (versus other initiatives). In this example, it has to be a workshop. Now it should click why I need the "no tagged" example at the beginning! This entire operational task will fail if there is no tag at all. So the no tags use-case overlaps with every other use case because of how important tags can be in public cloud infrastructure. 

Automating the automation

So this automation is great and all, but manually running playbooks only saves you so much time. I went ahead and used the Ansible workflows feature to hit multiple regions at once, and then schedule it so that my automation jobs run every hour.

ansible workflow diagrams

Each rectangle on the right represents an automation job. Each job in the same column is run in parallel on my Ansible Automation Platform cluster. Each job template is set to a different region. I also used the survey feature to make this easy to configure from the Web UI.

screenshot

In my particular scenario I was running automated testing in four AWS regions (us-east-1, ap-northeast-1, ap-southeast-1 and eu-central-1). Now that my workflow is complete, it is trivial to schedule my workflow to run every hour.

screenshot

Voila! Now I have automated testing behind the scenes to make sure that no orphaned instances are running. For my particular use case this will save a lot of money, and force a culture of accountability around public cloud use so that costs are clear and transparent between team members.  







How to Migrate your Ansible Playbooks to Support AWS boto3

How to Migrate your Ansible Playbooks to Support AWS boto3

Red Hat Ansible Automation Platform is known for automating Linux, Windows and networking infrastructure. While both the community version of Ansible and our enterprise offering, Red Hat Ansible Automation Platform, are prominently known for configuration management, this is just a small piece of what you can really achieve with Ansible's automation. There are many other use-cases that Ansible Automation Platform is great at automating, such as your AWS, Azure or Google public cloud 

diagram of Ansible on public clouds

Ansible Automation Platform can automate deployments, migrations and operational tasks for your public cloud. This is extremely powerful because you can orchestrate your entire infrastructure workflow, from cloud deployment, to instance configuration, to retirement, rather than requiring a point tool for each separate use-case. This also allows IT administrators to concentrate on automating business outcomes rather than individual technology silos.

Specifically for this blog, I wanted to cover converting your Ansible Playbooks for provisioning an instance on AWS from the unsupported ec2 module to the fully supported ec2_instance module. Amazon has deprecated their Software Development Kit (SDK) Boto in favor of the newer fully supported SDK Boto3. Alina Buzachis announced "What's New: The Ansible AWS Collection 2.0 Release" back in October 2021, which includes full support in our Red Hat Ansible Certified Content Collection for the amazon.aws.ec2_instance module, which uses Python 3 and Boto3.

The supported ec2_instance module has existed for some time, but I had not adopted it for my use-case yet because we needed one last feature for parity with the older ec2 module.  Specifically, for demos and workshops, I required the exact_count parameter. This allows me to boot as many identical instances as I specify. For example, if I specify exact_count: 50, it will spin up 50 identical Red Hat Enterprise Linux 8 instances.  

Using exact_count can save hours of time versus using a loop, and I don't need a massive declarative file to represent my 50 servers; it's just a tweak of a single parameter to make identical copies. Luckily we know that we have parameter, so I started converting all workshops and demos that the technical marketing team uses to Boto3.

Let's look at an older version of a task file from our technical workshops so I can show you how to convert from ec2 to ec2_instance:

---
- name: Create EC2 instances for RHEL8
  ec2:
    assign_public_ip: true
    key_name: "{{ ec2_name_prefix }}-key"
    group: "{{ ec2_security_group }}"
    instance_type: "{{ ec2_info[rhel8].size }}"
    image: "{{ node_ami_rhel.image_id }}"
    region: "{{ ec2_region }}"
    exact_count: "{{ student_total }}"
    count_tag:
      Workshop_node1": "{{ ec2_name_prefix }}-node1"
    instance_tags:
      Workshop_node1": "{{ ec2_name_prefix }}-node1"
      Workshop: "{{ ec2_name_prefix }}"
      Workshop_type: "{{ workshop_type }}"
    wait: "{{ ec2_wait }}"
    vpc_subnet_id: "{{ ec2_vpc_subnet_id }}"
    volumes:
      - device_name: /dev/sda1
        volume_type: gp2
        volume_size: "{{ ec2_info[control_type].disk_space }}"
        delete_on_termination: true
  register: control_output

For booting an instance into AWS, there are only six required parameters. You need to specify a key (i.e. the SSH key to access the image), security group (virtual firewall for your ec2 instances), instance_type (e.g. t2.medium), a region (i.e. us-east-1), image (e.g. an AMI for RHEL8) and a network interface or VPC subnet id (vpc_subnet_id). 

The rest of the parameters in my task above are for:

  • tweaking the instance
  • adding a public IP address, increasing storage
  • changing the module behavior
    • wait refers to waiting for the instance to reach running state,
    • exact_count refers to provisioning multiple instances in parallel
  • tagging, which is just adding key value tags to the instance so we can filter on them in subsequent automation, or just sort easily in the AWS web console.

To convert this to ec2_instance, there are only a few small tweaks you need to make!

ec2: ec2_instance:
assign_public_ip: true
network:
          assign_public_ip: true
group: "{{ ec2_security_group }}"
security_group: "{{ ec2_security_group }}"
image: "{{ node_ami_rhel.image_id }}"
image_id: "{{ node_ami_rhel.image_id }}"
count_tag:
  Workshop_node1": "{{ ec2_name_prefix }}-node1"
filters:
  "tag:Workshop_node1": "{{ ec2_name_prefix }}-node1"
instance_tags:
tags:
volumes:
  - device_name: /dev/sda1
  volume_type: gp2
  volume_size: "{{ ec2_info[control_type].disk_space }}"
  delete_on_termination: true
volumes:
- device_name: /dev/sda1
  ebs:
  volume_type: gp2
  volume_size: "{{ ec2_info[rhel].disk_space }}"
  delete_on_termination: true

The entire modified task looks like the following:

- name: Create EC2 instances for node1
  ec2_instance:
    key_name: "{{ ec2_name_prefix }}-key"
    security_group: "{{ ec2_security_group }}"
    instance_type: "{{ ec2_info[rhel].size }}"
    image_id: "{{ node_ami_rhel.image_id }}"
    region: "{{ ec2_region }}"
    exact_count: "{{ student_total }}"
    network:
      assign_public_ip: true
    filters:
      "tag:Workshop_node1": "{{ ec2_name_prefix }}-node1"
    tags:
      Workshop_node1: "{{ ec2_name_prefix }}-node1"
      Workshop: "{{ ec2_name_prefix }}"
      uuid: "{{ ec2_name_prefix }}"
      guid: "{{ ec2_name_prefix }}"
      Workshop_type: "{{ workshop_type }}"
    wait: "{{ ec2_wait }}"
    vpc_subnet_id: "{{ ec2_vpc_subnet_id }}"
    volumes:
      - device_name: /dev/sda1
        ebs:
          volume_type: gp2
          volume_size: "{{ ec2_info[rhel].disk_space }}"
          delete_on_termination: true

While the task may look long, realize that optional tags are taking up seven lines... which is OK, and I am displaying many default values. Remember that there is no additional cost to add tags to cloud resources, and they help with subsequent automation and filtering. I once heard a colleague exclaim that you can never have too many tags!

Looking at the task above, you will see that anything with the tag Workshop_node1: "-node1" will be used to verify whether existing instances match. It will make sure that exact_count of instances exist with the tag Workshop_node1. This can also be used in subsequent automation to filter and retrieve just the instances you want.

- name: grab instance ids to tag rtr1
  ec2_instance_info:
    region: "{{ ec2_region }}"
    filters:
      "tag:Workshop_node1": "{{ ec2_name_prefix }}-node1"
  register: node1_output

This will retrieve all instances with their common tag. You will also probably require unique tags for each instance. In that case, I recommend the ec2_tag module, where looping is less time intensive (versus looping with the ec2_instance module):

- name: Ensure tags are present for node1
  ec2_tag:
    region: "{{ ec2_region }}"
    resource: "{{ item.1.instance_id }}"
    state: present
    tags:
      Name: "{{ ec2_name_prefix }}-student{{ item.0 + 1 }}-node1"
      Index: "{{ item[0] }}"
      Student: "student{{ item.0 + 1 }}"
      launch_time: "{{ item.1.launch_time }}"
  with_indexed_items:
    - "{{ node1_output.instances }}"
  when: node1_output.instances|length > 0

The ec2_tag module is great for when you need unique tags for a particular cloud resource. In the example above, the name, index, student identifier and launch time are unique for that resource. Again there is no time punishment or cost to additional tags, so tag as much as you want. So the workflow for provisioning a bunch of instances on AWS would look like the following:

  1. provisioning in bulk exact_count amount of instances
  2. register the output to a variable with either ec2_instance or ec2_instance_info
  3. for unique tags, loop over the instances with the ec2_tag module

Thank you for reading through my blog and I hope this helped you on your Ansible cloud automation journey.