How we saved millions on AWS - Part 1

Part 1: Building a Cloud Strategy

In early 2022, as the world was emerging from the Covid-19 pandemic, inflation surged to multi-decade highs, prompting central banks to raise interest rates. The stock market declined, and tech companies around the world began to shift gears: after the high-growth mindset of 2021, efficiency and profitability became the focus.

At this point, Forter was an 8 years old start-up. We grew rapidly in that time, both in revenue and customers. It was time to optimize our unit economics, and focus more of our attention on our cloud spend.

Join us on a multi-year journey, as we have transitioned cloud cost budget ownership from Finance to Engineering, generated millions of dollars in savings, improved the company’s gross margins and valuation, and created a cultural shift that saw us adopt cost best practices across the organization.

This is Part 1 of a 3 part series. Read parts 2 and 3 here:

  1. Part 2: Learn from dozens of real world examples that allowed us to save millions of $ in cloud costs, going in depth on S3, EC2, Elasticsearch, and more.
  2. Part 3: Create a cultural shift in your company that will allow you to sustainably and consistently ensure costs best practices are followed throughout the company.

We understand that we want to optimize our costs and increase profits. Now we needed to understand what our baseline was, set a target we wanted to reach, and create a plan to achieve this target.

Read on to learn how to create a cloud strategy from scratch, helping you to lead your company on a journey of cost optimization.

Table of Contents

Final Words

Begin with a plan

As a Principal Engineer at Forter, I was put in charge of our Cloud Cost strategy. Our SVP of Engineering, presented me with these requirements:

  1. Reduce our AWS spend to align with the 2022 budget.
  2. Define a KPI for measuring our spend and see where we are vs similar companies.
  3. Enable easier AWS budget planning, led by Engineering. We will provide a framework (spreadsheet) to assess cloud spend.
  4. Create visibility inside Engineering and Finance teams to let us review our actual vs. plan
  5. Change our Engineering Culture to be cost aware

 

* me after getting the task

The strategy ended up consisting of four distinct focus areas that needed to be addressed:

  1. Calibration: understanding where we were at this point, and setting healthy target to aspire to
  2. Planning and measurement: creating plans for our cloud spend ahead of time, and measuring ourselves against them
  3. Cost optimization: optimizing and reducing our spend to improve the company’s margins (the bulk of the work)
  4. Culture shift: getting engineers to think about cost and gross margins when planning, developing and operating their systems and products

Calibration

First, we needed to understand who in the company was responsible for the cloud budget, as that department would be the one accountable for reducing it.

Ownership of the budget

Back in 2022, the process for deciding on the cloud budget for the next year looked like this:

FP&A: Last year your AWS spend was $XM. We expect Y% growth in revenue, so the budget should be proportional to that growth. OK?

CTO: Sounds good.

There was a lot wrong with this approach.

First of all, the budget was suggested by FP&A and not Engineering. Finance had, understandably, little knowledge of how the cloud budget was spent. So their ability to predict next year’s spend was very limited.

Engineering could do much better when anticipating cloud spend. For example, we could take new product launches into account, or be aware of a new Analytics platform which would result in R&D costs growing. Similarly, cost optimization projects for the next year could be planned in advance and taken into account when deciding on a budget.

Moreover, we wanted Engineering to be accountable for the cloud spend. It needed to be our problem if we couldn’t hit cost targets. By having Engineering set a yearly target they could get behind, we put the responsibility on our shoulders. Which made sense, as ultimately we would be the ones doing most of the saving.

The only conclusion here is this: it should be Engineering’s job to say what the Cloud budget should be. Only we know the real world details, and only we can be accountable for meeting the budget.

Choosing the measurement

As we said, we needed to reduce spend and increase profit. But that’s a simplistic way to look at it. In a SaaS company, cloud spend is likely a major part of COGS (costs of goods sold). Becoming more efficient means improving the company’s gross margins.

Let’s quickly define the terms here:

Revenue: how much $$ your company makes

COGS:how much $$ in direct costs it costs to produce the revenue

Gross margin: the difference between revenue and cost of goods sold (COGS), divided by revenue

Great SAAS companies, like DataDog and Splunk, have a gross margin of >75%. The gross margin impacts how investors view and evaluate a company.

A high gross margin means a significantly higher multiplier for your company.
As stakeholders, this impacts the value of our stocks.

If cloud spend is 90% of COGS (as typical for a SAAS), reducing the cloud spend by just 10% can have a very significant effect on your margins (e.g. 70% -> 73%). Therefore, for an engineer in a SaaS company, cloud cost optimization can be an extremely worthwhile venture, as it directly affects the company’s valuation.

Setting a target

Next, we needed to know whether our cloud spend was high, low or downright embarrassing.

A simple metric to use is cloud spend as % of revenue. If a company earns $500M/year, and spends $50M on cloud expenses, their metric would be 10%, where lower means more efficient.

If I could compare Forter to similarly sized SaaS companies, I could tell where we were based on this benchmark. So, I went to do some research. Reading quarterly reports of public companies and asking around on Twitter about similarly sized companies helped me arrive to this table:

Data Intensive?

>$1B revenue companies, average cloud cost as % of revenue

<$1B revenue companies, average cloud cost as % of revenue

Very data intensive, data is core to the business

4%

8%

Not very data intensive

2%

4%

I decided to organize the companies by two dimensions:

● Company size: bigger companies tend to scale better in terms of efficiency. They can also get much better prices from cloud vendors.

● Data intensity: for some companies (e.g DataDog), data is core to the business. In a sense it is the business, for those companies it makes sense that Cloud would be a significant expense (maybe even the main expense of the company).

Now, we were ready to compare ourselves. Forter is:

● A <$1B revenue company (for now!)

● Very data intensive

Each company needs to run their own analysis and decide if they're satisfied with the results. More importantly, is to understand the target the company should aim for in the next few years. We saw a lot of potential for improvement with significant upside.

Understanding the spend

Production Costs vs R&D Costs

If you remember, one of the main purposes of our entire venture here is to increase our company’s Gross Margin. As a reminder, it is calculated like this:

Gross Margin = (Revenue - COGS) / Revenue

COGS (Cost of Goods Sold) in the formula in Forter’s case was the cost to produce an accurate decision for a customer.

You might think that all AWS spend would be counted as COGS, but that’s actually not the case. A lot of our AWS spend is not about running our production decisions, but is rather about R&D for new features and improvements to the system. Since R&D is an operational expense (OPEX), it does not count towards COGS.

It is therefore crucial for us to have an accurate measure of our R&D spend within our AWS spend so we can subtract R&D from our COGS.

We did this with tagging. If we could add an RND=true tag to all of our AWS resources, we could easily filter on this tag in the Cost Explorer and get our R&D Costs. Alas, adding such a tag to all of our existing resources was difficult, and we opted for a simple script that runs monthly and queries the cost explorer API to find all the R&D resources and their costs.

Unit Economics

At Forter, our main business unit is a decision. For a given transaction, login, or any event that requires trust, we need to decide whether to approve or decline it. We needed to understand the cloud cost of a single decision, in order to:

● Enabled complex pricing schemes—we wanted to know how much to charge a customer for 1M monthly decisions

● Understand our gross margin per customer—perhaps we were losing money on some customers and needed to renew their contract under different terms

Getting Unit Economics right is one of the best things Engineering can do for GTM and Finance departments

We called this new metric CPI - Cost per Interaction.

Since we already know our Production Costs, as explained above, the only thing missing is the denominator—the number of units. Or, in our case, the number of decisions. This gives us the following formula:

CPI = Production Costs / Number of interactions

We now track this metric on a monthly, quarterly and yearly basis, and we set yearly targets to substantially improve it over time. Thanks to the cost optimization process we went through, we were able to improve our CPI by over 70%!

Building a cloud cost model

If we want to answer the question “how much will our company spend on AWS next year”, we need a predictive model. The model needs to answer this question:

“Given our cloud spend in the previous N months, and an expected growth of X%, what will be the spend in month N+1?”

The naive formula would be something like:

spend(current_month) = spend(previous_month)*expected_revenue_growth

Applying a naive calculation, if we spent $1M on AWS in January, and the company expects 2% of growth in revenue in February, February spend would be $1.02M. This naive approach was how we worked initially. It was also very wrong, for multiple reasons:

● Relying on revenue growth is wrong, as it doesn’t directly correlate with increase in cloud spend

● Storage costs tend to be accumulative, so we needed a different formula for them

● We needed to take cost optimizations and cost increases driven by new projects into account

We ended up setting up a Google Sheet, which solved each of those problems:

Problem

Solution

Revenue as a growth metric is bad

Traffic growth is a much better metric than revenue growth. For us, we define traffic as the number of decisions that were sent to Forter. It turned out to be a much more accurate measure of how much computing power we needed.

Storage costs don’t grow and shrink with traffic growth

We used this formula for storage increases:

storage_cost(current_month) = storage_cost(previous_month) + storage_cost_added_by_current_month_traffic

This formula requires figuring out how much storage we add to the system for each decision sent to Forter, which was hard to do accurately. But even an approximation here was much better than what we had before.

Cost increases and optimizations not taken into account

We created a dedicated sheet for optimizations/increases which feeds into the estimation. All the teams are required to fill their projections for the year.

Understanding the AWS contract

Before looking at specific optimizations, I needed to understand our AWS Contract Terms. If you are a big customer (>$1M/year of cloud spend), you likely have an EDP (Enterprise Discount Plan). The EDP allows you to get lower prices from AWS, in return for a commitment:

● A general spend commitment for $X millions, in return for a Y% discount across all AWS services

● A service specific commitment. For example, using N petabytes of S3 storage, in return for a lower storage price per TB

In both cases, not meeting the commitment means paying the shortfall to AWS, so you don’t want to overcommit. However, under-committing means leaving money on the table (e.g. getting a 20% discount instead of 25%).

We needed to develop the muscle of predicting AWS spend years in advance, as the EDP tends to be a multi-year contract. We also needed someone to be in charge of making sure we meet all the commitments in the contract. These kinds of things tend to fall between the cracks, only to be noticed when the unexpected shortfall bill comes from AWS.

Once you understand the contract, I suggest publishing the main discounts in your Engineering group. Engineers should know the real price they are paying for S3 storage, as knowing how much stuff really costs will affect architecture decisions.

Get granular with your spend

When going on an optimization binge, Cost Explorer is your best friend. It’s not very pretty, but it gets the job done.

** Cost Explorer dashboard from a dummy account

Before you do anything with Cost Explorer consider filtering out Credit, Refund, Tax and Marketplace purchases, as they skew the numbers.

In addition, it’s important to use net amortized costs (instead of the default of unblended costs)

Net amortized costs account for all the discounts, and spread the cost of commitments paid upfront over time, based on actual usage [1].

Cost Explorer allows grouping by different kinds of dimensions. Here are the most important ones:

Service: the default. Groups by AWS Services.

Tag: group by application, system  or other tags. Tags are custom strings attached to AWS resources, and is a must to be able to understand where the $$ goes. Read more here.

Usage Type: within an AWS service, what % of the cost went on which type of usage. For example, in S3, did we pay more for storage or API calls?

Resource: specific resources, like S3 bucket or EC2 instances. These are very granular and usually only useful after using tags to filter first. Note that grouping by Resource is anopt-in feature in Cost Explorer, so make sure to enable it.

Region: AWS region the resource resides in.

Grouping by Service will tell you how much you are paying for EC2 or S3. Grouping by application tag will tell you which of your internal systems are the most costly. Combining grouping dimensions with filters allows you to drill down and identify that expensive S3 bucket, surprisingly high CloudFront distribution costs or unusually high EC2 cross-AZ data transfer charges.

System Cost Targets

How much should a system cost? Having one target for the entire cloud spend of an engineering organization can be way too wide. By helping teams set cost targets for their own systems we:

● Better focus the optimization efforts on the most wasteful systems

● Ensure teams can manage and own their budgets, and check themselves against their target (see also the cultural learnings we’ve adopted)

In order to build such target we encourage you to build a table like this:

System

How much it currently costs (%)

How much it should cost (%)

How much it should cost ($$/month)

Decision Service

40%

60%

$6000

Observability ETL

20%

10%

$1000

Customer Front End

20%

10%

$1000

Streaming Engine

20%

20%

$2000

Determining how much a system should cost is not easy and requires strong intuition about the value of each system.

Once we had this table, it became very clear which systems were in the most dire need of optimization. We also had clear targets for teams to manage and measure themselves against.


Creating a plan, calibrating and setting a target set us on a path. Continue reading here to see how the journey proceeded, and which optimization strategies we ended up implementing based on those findings.



[1] Suppose you bought an upfront Savings Plan for a 1 year term. You paid in advance, so all the $$ was spent in January. However you are using the savings plan during the entire year. If we use “Unblended cost” it would appear January was a really expensive month and the next months were “free”, which is not how most people think about the cost of services and systems. Amortizing the cost of the savings plan across the year makes more sense here, and will allow you to understand the real cost of things. It will also align you better with your accounting team, which is probably amortizing pre-paid expenses.

</div>