Terraform Journey In Startup

Pros and Cons | Best Practices | Restrospection

Feb 14, 2024

Introduction

Thank you for clicking through to my arcticle. I've been a DevOps engineer for 2 years in dev-team of 7 engineers.

My name is MINSEOK, LEE, but I use Unchaptered as an alias on the interenet. So, you can call me anythings "MINSEOK, LEE" or "Unchaptered" to ask something.

Topics

This post is a retrospective of an early startup's Terraform Journey.

Pros and Cons
Solution of Cons
Some example codes of Terraform

Target Engineer

As a DevOps Engineer, I'd the following concerns. So I hope this article will help startup engineers with similar concerns.

Want to follow Best Practices and Security Principals.
Want to increase the reliability of infrastructure provisioning and operations
Want to reduce "human error" in repetitive tasks.

Target Team, Account, Product

10 engineer or less.
1 or 2 devops engineer.
Have complex worloads of infrastructure
1. Have workloads based on scheduler.
2. Have workloads based on messaing(SQS).
Have several AWS accounts for several products

Connections

GitHub : github.com/unchaptered

Requisites

This section is written for "What is the IaC Tools?"and "Why are you choose Terraform, not CloudFormation, AWS CDK and so on?"

What is the IaC Tools?

IaC is shortcut of Infrastructure of Code.
It literally means "Replace from Console/GUI to Codes for provisioning infrastructure".

If you want more information,
- Read "Infrastructure as Code (IaC) — What is it?"
- Watch "Infrastructure as code: What is it? Why is it important?"

What is the expected Pros?

First, when I knew IaC Tools in Oct 2022, I expect this pros.

Versioning of infra
Increase reusability
Reduce documentation
"Can" power up autiomation pipeline or CI/CD Platform

And, after used IaC Tools in Dec 2023, t expect new pros.

"Can" power up testability of infrastructure.

What's the meaning of "Can"?
I thought, terraform is optimized of provisioning of infrastructure resources.
When you use terraform only, you can't power up automation and testability.

Rather, using terraform modules, propagating erros from parent to child modules. So, I thought, sometimes terraform reduce testability.

So DevOps Engineers will use more tools to integrate for a power-up of automation and increasing testability. Such as Ansbile, Terragrunt, Terratest and so on.

What is the expected Cons?

From Oct 2022 to Feb 2024, I thought IaC Tools reduce productivity temporarily.

But with the good module system and technical proficiency, I think productivity will be similar or better than Cosnole(Manaul Process).

And a backend engineer is having trouble working with terraform, because of learning curve.

Why I choose Terraform?

Alternatives(of terraform) include, AWS CloudFormation, CDK or Pulumi.

I have these following requirements.

Don't require any programming skills
I didn't assume all DevOps is good to deal with multiple programming languages. I thought, having IaC in Java and TypeScript has Pros and Cons.
Don't locked into a specific CSP(AWS, Azure)
Needed a solution would be scalable in Hybrid Cloud or Multi-CSP
Must have engineer community.

As a first requirements, I can't choose CDK and Pulumi.
And second requirements, I don't choose CloudFormation.
Of the many other IaC Tools out there, Terraform had the largest community.

So I choose Terraform as a IaC Tools.

Design of Structure

Before start to work, I thought about the structure of the task, following two concept.

Define the complexity of company's infrastructure.
Design a folder structure to match

Determine Structure Type with Best Practices

I thought, structure changes of terraform is very difficult.
Therefore, I wanted to make our structure as extensible as possible.
First, I read the article "Terraform Best Practices - Code structure examples", written by Anton Babenko. I thought, the categorization made sense for us.

However, due to the fast-paced nature of early-startup, I thought about it in terms of time: post, present and future.

Terms of Time	Key Requirements	Type
Post	Use AWS to provisining infrastructure Use GCP for SDK(goelocation API)	large
Present	Use Multi Account of AWS Deploy the entire service in perfect `N pairs`	very-large
Future	Currently considering Oracle, NCP, GCP etc for Each Business.	very-large

Terms of Time

Key Requirements

Type

Post

Use AWS to provisining infrastructure
Use GCP for SDK(goelocation API)

large

Present

Use Multi Account of AWS

Deploy the entire service in perfect N pairs

very-large

Future

Currently considering Oracle, NCP, GCP etc for Each Business.

very-large

According to each business/product dicisions, the complexity already became too high.

In Dec 2023, our first-party services is launched.
And Jan 2024, we prepare to fix first-party service to integrated outer service.
And Jan 2024, we prepare to launch new mvp service for user demand research.

And there're several SaaS AI Offerings, was expected.

So, I determined service type is very-large.

Design Folder Structure with Best Practices

Around June 2023, I was using Ansible with the wrong file structure, designed by my own ideas. It took me 2 weeks to change it to good structure. And this time, I researched a lot of references of Best Practices.

Especially, I looked up "MarketCurly, DevOps' Terraform Journey".
MarketCurly is a same-day grocery delivery service in Republic of Korea.

In MarketCurly, introduce this folder structure.

├── README.md
├── env                             // Environment Files
│   ├── dev
│   └── stg
│       ├── main.tf
│       ├── terraform.tfvars
│       ├── variables.tf
│       └── version.tf
└── modules                         // AWS modules Codes
    ├── acm
    └── compute
        └── alb
            ├── main.tf
            ├── output.tf
            ├── variables.tf
            └── versions.tf

By default, it separates environment variables from module files.
However, it only talked about some trouble-shooting and didn't give me the overall project structure.

Depending on the purpose, I've devided them into two categories.

Company Resources
It means "resources are used for multiple-products"
Product Resources
It means "resources are used for single-products"

└── services
    ├── <COMAPNY>
    ├── <PRODUCT_A>
    └── <PRODUCT_B>

I further categorized the modules based on their purpose.
For example, it would be dangerous to manage storage(s3), database(rds), compute(ec2), serverless(lambda).

Why single folder is dangerous, i thought?
The terraform actions consist of create, update and destroy, basically.
Because some resources don't support update, terraform action can destroy and create resource. By default these processes happen concurrently, which can cause fatal problem if a particular values is a unique value. Previous resource and new resources is encountered in the same time.

Or some syntax, for_each, occured all resources of list is destroyed and created issue.

Therefore, separating the modules used in the product according to their purpose is more safe than single folder.

└── services
    ├── <COMAPNY>
    │   └── domain
    ├── <PRODUCT_A>
    └── <PRODUCT_B>
        ├── compute
        ├── storage
        ├── database
        └── serverless

Example Codes

Here's a simple example code to help your understanding.
After some feedback, I realized the problem of this structure.
Therefore, I recommend that you only use this code "for understanding".

Defining S3 Module Codes

I've written code to provisioning AWS S3 Bucket using Terraform.
As a Designed Folder Structure, I seperate main.tf, variables.tf, output.tf.

main.tf : Define modules
variables.tf : Define modules' arguments
outputs.tf : Define modules' attributes

# modules/s3/bucket/main.tf
resource "aws_s3_bucket" "aws_s3_bucket_module" {
  bucket = var.bucket_name
  acl    = var.bucket_acl
}

# modules/s3/bucket/variables.tf
variable "bucket_name" {
  type    = string
}
variable "bucket_acl" {
  type    = string
}

# modules/s3/bucket/outputs.tf
output "bucket_domain_name" {
  value    = aws_s3_bucket.aws_s3_bucket_module.bucket_domain_name
}

Why declare output blocks?
Basically, you can access resource.aws_s3_bucket_aws_s3_bucket_module.bucket_domain_name.
However, in odrer for the module system to access the properties of an internal module, it must be declared as output in the internal modules.

Use S3 Module Codes

Let's provision our infrastructure by source(=import) the s3 module in the services. If you want to deploy your product in dev, prod, stage, qa and so on, you'll need to use variables again.

# services/product/storages/sample_s3_bucket.tf
module "sample_s3_bucket" {
  source       = "../../../modules/s3/bucket"
  bucket_name  = "${var.service}-${var.stage}-s3-bucket-sample"
  bucket_acl   = "acl"
}

# services/product/storages/_.variables.tf
variable "service" { type = string }
variable "stage" { type = string }

Create tfvars file

And create tfvars file

# env/dev/sample_s3_bucket.tfvars

service = "example"
stage = "dev"

Create S3 Bucket

And you can provisioning infrastructure

cd services/product/storages/

terraform init
terraform apply -var-file=../../../env/dev/sample_s3_bucket.tfvars

Can we use terraform with DevOps and Backend Engineer in this system? "No!"

Just before product launched, I did a self-code review first.
And then, I ask a question to "Server Engineer Lay". The question was, "Do you think, you can work simple modification using Terraform?". And he said "No, it's little worry to me"

Too many files, too long codes

S3 buckets will have different preferences depending on their purpose.
When using AWS Console/GUI, some options are automatically assigned. But in terraform, you'll need to manually put a options.

In s3 examples, we use the following resources together.

acl
bucket
bucket_notification
bucket_policy
core_configuration
ownership_controls
pbulic_access_block

So, the CSP Resources, are secured, are tens to hunderds of lines long in single any.tf files. If your service is more complex, you'll have anywhere from a few a dozones of each.tf files in one folder.

This means you'll end up with thousands of lines of terraform code in one folder.

It makes the code hard to read and unwieldy to work with.

Backend Developer's Answer

As I mentioned earlier, one folder has thousands of lines.
And each terraform modules has many references to other module, it seems like developer were afraid about side effect.

So, I realized there was a fatal problem with this approach.
Backend Engineer can't fix any modules with safety. If organization wants backend engineer to fix Terraform Codes, DevOps engineer must share all of the terraform.

It doesn't look smart to me.

Where there any other critical issues?

If you use terraform module system for reusability, you'll encounter error propagation. When you modify high-level module, sometimes low-level module can occured some error.

So, you must write test code to reduce side effect.
Nowdays I'm used terratest to test terraform.

inblog.ai/unchaptered/How can I test Terraform?

Conclusion

I've been working with Terraform for 3 months, 7 days a week. And it's hard to change our production codes now. So that's where I'd like to conclude for now.

Pros

[IDK] Versioning of Infra
- Expected : Good
- Reality : I don't feel any advantages yet.
[GOOD] Increase reusability
- Expected : Good
- Reality : totally increase reusability all modules.
[GOOD] Reduce documentation
- Expected/Reality : Good
[GOOD] "Can" power up automation pipeline or CI/CD Platform
- Expected : Maybe good?
- Reality : It's incredibly useful for DevOps. With cloud secret storages(Vault, AWS SecretManager), you can manage secrets or ids in central tower.
[BAD] "Can" power up testability of infrastructure
- Expected : Maybe Good?
- Reality : I think, terraform modules system reduce testability.

Cons

[GOOD] reduce productivity
- Expected : reduce productivity
- Reality : If I used VPC and EC2 only, terraform reduce productivity. But, if you used many cloud resources, terraform increase productivity, I thought.
[BAD] backend engineer's learning curve
- Expected : backend engineer can't modify infra using terraform.
- Reality : It's true i think.

How I'm improving using Terraform?

Of the expected entire Pros and Cons, each one thing appeared to be a cons, I thought.

[Pros to Cons] "Can" power up testability of infrastructure
- [Solution] Use test code with terratest
  - Article : How can I test Terraform?
  - GitHub : github.com/unchaptered/iac-storage
[Cons to Cons] backend engineer's learning curve
- [Solution] Use sepereated module layer
  - Article : Production-level Guide to Terraform
  - GitHub : github.com/unchaptered/iac-storage