Infrastructure As Code – a brief synopsis

Introduction

Infrastructure as code, otherwise known as programmable infrastructure, is one of the key practices utilized in implementing data projects in the cloud.

In this article I explain what is meant by infrastructure as code. The motivations for implementing infrastructure as code are discussed as well as the mechanics.

Motivation

The advent of virtualized computing aka virtualization heralded a new dawn in computing and the provisioning of hardware resources. Prior to virtualization, this was a process that could take weeks, if not months for projects that required significant
computing power. For example, In my own career, I remember waiting for 6 months while the needed  hardware for a key data warehouse was being provisioned !!

Virtualization has helped to change all that. Suddenly the balance of power shifted from the system administrators who provide these resources to the developers who request these resources for their projects. With virtualized server farms, developers could provision and configure high-end computing nodes for a project in minutes rather than days or weeks.

This trend has only accelerated with the move to cloud computing and the public cloud in the past 5 years.
In public cloud computing scalable information resources are provided as a service to multiple external customers via the Internet.
In fact the cloud computing era has only been made possible because of virtualization – so much so that it is often dubbed as Virtualization 2.0.

Evolution to the cloud

The top 3 public cloud platforms are : Google Cloud Platform, Amazon AWS & Microsoft Azure.

Each of them provide a range of services including but not limited to:

  • Virtual Servers & Scalable Computing on Demand
  • Durable Storage
  • Relational Databases
  • Analytics & Big Data processing

and so on. A more detailed view of the services offered by the various cloud providers is available in the Appendix.

Each cloud customer can request services through a web service api that is accessible via their accounts.
In the pre-virtualization and pre-cloud era the provision and management of computing resources was done in a rather manual fashion. Once the hardware was provisioned, a system administrator would install an operating system and needed software, setup networking and databases based on requirements from the development teams. This was appropriate and feasible given the length of time it took to get resources provisioned.
In the cloud era, however, the situation is very different. Developers can request and provision services at will and in the case of distributed computing at a high volume. Sticking to the old manual approach becomes infeasible and error prone.
For example, imagine a large project involving Hadoop MapReduce and a cluster of 5 nodes for the development and 50 nodes for scale testing and QA. To keep costs in check, the development team may wish to provision machines repeatedly and shut them down after use.

While such a request can be fulfilled via the Google Cloud console each time, it can be extremely error prone and inconsistent. This speaks to the need for some kind of automated approach and this is where Infrastructure as Code comes in.

Description

Infrastructure-as-code (IAC) is an approach to the automation of infrastructure based on software development practices.

 With the IAC approach, service provisioning and management along with configuration changes are expressed in an automated fashion through code. This ensures that changes are made in a consistent and enables easier validation of these changes.

For example, using a tool such as Ansible, developers can provision an entire set of virtual servers running Centos OS, install and configure Spark on them to form a cluster for distributed computing and run an entire ETL pipeline as a batch process and then terminate the cluster.

The idea is that modern tools can manage infrastructure like software and data.
Infrastructure can be managed via version control systems, automated testing
libraries and deployment orchestration tools. Software development
practices such as continuous integration and delivery (CICD) and test-driven-development (TDD) can be applied to the management of infrastructure.

Thus IAC is a way to represent your environment using software/config files so one can replicate it multiple times.

It consists of 2 main components:

  • a declarative template that enables us to provision resources from our cloud
    provider. Such resources could be load balancers, auto-scaling groups,
    VM instances, RDBMS etc.
  • configuration management component – code that enables us to configure
    and deploy software on the resource we have provisioned via our declarative
    template.

Benefits

  1. In configuration management we’re describing every aspect of our system that
    is configurable. The intent is to eliminate the need to make emergency changes
    amd prevent application configuration drift.
    This is because if manual changes are made to the configuration, its original
    state has been codified via configuration management scripts and thus can
    easily be restored by execution of those scripts.
  2. IAC eases friction between app developers and operations by requiring operations engineers to adhere more closely to traditional software development practices such as CICD, automated testing and source code version control.This has given rise to what is known as DevOps with operations engineers implementing a workflow akin to a traditional software development life cycle.
Devops Life Cycle

Tools for Implementation

There are multiple tools used to implement the IAC process.
Some of these tools are Ansible, SaltStack, Puppet and Chef.

IAC tools are often divided into 2 groups based on their functionality :

  1. Provisioning tools
    These tools focus solely on provisioning virtual servers on premise or within a cloud environment. Examples include Terraform, AWS Cloud Formation, Google Deployment Manager and Azure Resource Manager.
  2. Configuration Management tools
    These tools install and manage software on existing servers.
    Examples are: Ansible, Chef, Puppet, Salt.

Most of the more popular configuration management tools such as Ansible do increasing offer provisioning capabilities, blurring the distinction between the 2 groups.

This has led to a robust debate about their capabilities, with some commentators emphasizing the distinction. My opinion is that for more complex infrastructural requirements such a distinction may have merit and necessitate usage of a different tool for each capability. I feel such a distinction will not last for long as the vendors of these configuration management tools will increasing add features to make their tools just as capable when it comes to provisioning and orchestration.

Thus tools such as Terraform, AWS Cloud Formation, Google Deployment Manager and Azure Resource Manager which are solely resource provisioning tools need configuration management tools such as Chef, Puppet or Ansible in order for to have the full IAC stack.
A brief synopsis/comparison of each can be found in the Appendix.
For our relatively small size project, we will focus on using Ansible as a full stack IAC tool.

Code Examples

The intention is not to dive in depth into any one tool, but to give the reader an idea of what implementing infrastructure-as-code looks like.

Ansible
Here is a simple code snippet that illustrates how one can provision virtual servers on Google cloud platform :
– name: Launch instances
     gce:
         instance_names: dev
         machine_type: n1-standard-1
         image: debian-9
         service_account_email: myself@myself.com
         credentials_file: mycredentials.json
         project_id: Test Project
Assuming one has a Google Cloud account with the necessary credentials,  we can save the above script to a playbook file (provision_gce_instance.yml),
and run
ansible-playbook provision_gce_instance.yml
to create a new virtual server instance on the Google Cloud platform.

References

Appendix


Public Cloud Provider Services

Service TypeService Name
Virtual servers, scalable computing on demandAmazon EC2
Google Compute Engine
Azure Virtual Machines.
Durable StorageGoogle Cloud Storage Amazon S3
Azure Storage
Relational DatabaseGoogle CloudSQL
Amazon RDS
Azure SQL Database
Analytics & Big Data processingGoogle DataProc & DataFlow
Amazon EMR
Azure HDInsight
Data WarehouseGoogle BigQuery
Amazon Redshift
Azure SQL Data Warehouse
Networking - DNSGoogle Cloud DNS
Amazon Route 53
Microsoft Azure DNS
Networking - Virtual Private CloudGoogle Cloud VPC
Amazon VPC
Azure Virtual Network
NoSQL DatabaseGoogle Cloud Datastore & Bigtable
Amazon DynamoDB
Azure Cosmos DB
MessagingGoogle Cloud Pub/Sub
Amazon SNS
Azure Notification Hubs
Deployment/ProvisioningGoogle Cloud Deployment Manager
AWS CloudFormation
Azure Resource Manager


Cloud Provisioning Tools

ToolMain FeaturesDomain-Specific Language (DSL)
AnsibleWorkflow orchestration
Configuration
Management
Provisioning
App deployment
CICD
Python, YAML
SaltStackCloud orchestration and automation
CICD
Configuration management
DevOps toolchain workflow automation
Python, YAML
PuppetConfiguration management
Provisioning
Ruby
ChefConfiguration management
CICD
Provisioning
Ruby
TerraformProvisioningGo
AWS Cloud FormationProvisioningJSON/YAML
Google Cloud Deployment ManagerProvisioningJSON/YAML
Azure Resource ManagerProvisioningJSON/YAML

 

Upload a file to Google Drive using gdrive

Steps

Quick tip on using gdrive to upload to Google Drive:

gdrive upload <path-to-local-file>

e.g.

gdrive upload mydir/myfile.txt

This uploads the file to the home directory on Google Drive which is My Drive

To upload to a specific directory, do the following:

List the directories on Google Drive showing directory ids:

gdrive list

Obtain the directory id for the directory you wish to upload to.

Then do

gdrive upload --parent <id> mydir/myfile.txt

to upload the file to the directory in question

You can also search for specific folder in Google Drive by doing:

gdrive list -q "name contains 'BigData'"

References

Google Drive CLI Client