Introduction
Infrastructure as code, otherwise known as programmable infrastructure, is one of the key practices utilized in implementing data projects in the cloud.
In this article I explain what is meant by infrastructure as code. The motivations for implementing infrastructure as code are discussed as well as the mechanics.
Motivation
Virtualization has helped to change all that. Suddenly the balance of power shifted from the system administrators who provide these resources to the developers who request these resources for their projects. With virtualized server farms, developers could provision and configure high-end computing nodes for a project in minutes rather than days or weeks.
This trend has only accelerated with the move to cloud computing and the public cloud in the past 5 years.
In public cloud computing scalable information resources are provided as a service to multiple external customers via the Internet.
In fact the cloud computing era has only been made possible because of virtualization – so much so that it is often dubbed as Virtualization 2.0.

The top 3 public cloud platforms are : Google Cloud Platform, Amazon AWS & Microsoft Azure.
Each of them provide a range of services including but not limited to:
- Virtual Servers & Scalable Computing on Demand
- Durable Storage
- Relational Databases
- Analytics & Big Data processing
and so on. A more detailed view of the services offered by the various cloud providers is available in the Appendix.
While such a request can be fulfilled via the Google Cloud console each time, it can be extremely error prone and inconsistent. This speaks to the need for some kind of automated approach and this is where Infrastructure as Code comes in.
Description
Infrastructure-as-code (IAC) is an approach to the automation of infrastructure based on software development practices.
With the IAC approach, service provisioning and management along with configuration changes are expressed in an automated fashion through code. This ensures that changes are made in a consistent and enables easier validation of these changes.
For example, using a tool such as Ansible, developers can provision an entire set of virtual servers running Centos OS, install and configure Spark on them to form a cluster for distributed computing and run an entire ETL pipeline as a batch process and then terminate the cluster.
The idea is that modern tools can manage infrastructure like software and data.
Infrastructure can be managed via version control systems, automated testing
libraries and deployment orchestration tools. Software development
practices such as continuous integration and delivery (CICD) and test-driven-development (TDD) can be applied to the management of infrastructure.
Thus IAC is a way to represent your environment using software/config files so one can replicate it multiple times.
It consists of 2 main components:
- a declarative template that enables us to provision resources from our cloud
provider. Such resources could be load balancers, auto-scaling groups,
VM instances, RDBMS etc. - configuration management component – code that enables us to configure
and deploy software on the resource we have provisioned via our declarative
template.
Benefits
- In configuration management we’re describing every aspect of our system that
is configurable. The intent is to eliminate the need to make emergency changes
amd prevent application configuration drift.
This is because if manual changes are made to the configuration, its original
state has been codified via configuration management scripts and thus can
easily be restored by execution of those scripts. - IAC eases friction between app developers and operations by requiring operations engineers to adhere more closely to traditional software development practices such as CICD, automated testing and source code version control.This has given rise to what is known as DevOps with operations engineers implementing a workflow akin to a traditional software development life cycle.
Tools for Implementation
IAC tools are often divided into 2 groups based on their functionality :
- Provisioning tools
These tools focus solely on provisioning virtual servers on premise or within a cloud environment. Examples include Terraform, AWS Cloud Formation, Google Deployment Manager and Azure Resource Manager. - Configuration Management tools
These tools install and manage software on existing servers.
Examples are: Ansible, Chef, Puppet, Salt.
Most of the more popular configuration management tools such as Ansible do increasing offer provisioning capabilities, blurring the distinction between the 2 groups.
This has led to a robust debate about their capabilities, with some commentators emphasizing the distinction. My opinion is that for more complex infrastructural requirements such a distinction may have merit and necessitate usage of a different tool for each capability. I feel such a distinction will not last for long as the vendors of these configuration management tools will increasing add features to make their tools just as capable when it comes to provisioning and orchestration.
Code Examples
The intention is not to dive in depth into any one tool, but to give the reader an idea of what implementing infrastructure-as-code looks like.
References
Appendix
Public Cloud Provider Services
Service Type | Service Name |
---|---|
Virtual servers, scalable computing on demand | Amazon EC2 Google Compute Engine Azure Virtual Machines. |
Durable Storage | Google Cloud Storage Amazon S3 Azure Storage |
Relational Database | Google CloudSQL Amazon RDS Azure SQL Database |
Analytics & Big Data processing | Google DataProc & DataFlow Amazon EMR Azure HDInsight |
Data Warehouse | Google BigQuery Amazon Redshift Azure SQL Data Warehouse |
Networking - DNS | Google Cloud DNS Amazon Route 53 Microsoft Azure DNS |
Networking - Virtual Private Cloud | Google Cloud VPC Amazon VPC Azure Virtual Network |
NoSQL Database | Google Cloud Datastore & Bigtable Amazon DynamoDB Azure Cosmos DB |
Messaging | Google Cloud Pub/Sub Amazon SNS Azure Notification Hubs |
Deployment/Provisioning | Google Cloud Deployment Manager AWS CloudFormation Azure Resource Manager |
Cloud Provisioning Tools
Tool | Main Features | Domain-Specific Language (DSL) |
---|---|---|
Ansible | Workflow orchestration Configuration Management Provisioning App deployment CICD | Python, YAML |
SaltStack | Cloud orchestration and automation CICD Configuration management DevOps toolchain workflow automation | Python, YAML |
Puppet | Configuration management Provisioning | Ruby |
Chef | Configuration management CICD Provisioning | Ruby |
Terraform | Provisioning | Go |
AWS Cloud Formation | Provisioning | JSON/YAML |
Google Cloud Deployment Manager | Provisioning | JSON/YAML |
Azure Resource Manager | Provisioning | JSON/YAML |