How we saved millions on AWS - Part 2

Part 2: Cost Saving Initiatives

In early 2022, as the world was emerging from the Covid-19 pandemic, inflation surged to multi-decade highs, prompting central banks to raise interest rates. The stock market declined, and tech companies around the world began to shift gears: after the high-growth mindset of 2021, efficiency and profitability became the focus.

At this point, Forter was an 8 years old start-up. In our early days, and especially during the Covid days, we grew rapidly, both in revenue and customers. It was time to optimize our unit of economics, and focus more of our attention on our cloud spend.

Join us on a multi-year journey, as we have transitioned cloud cost budget ownership from Finance to Engineering, generated millions of dollars in savings, improved the company’s gross margins and valuation, and created a cultural shift that saw us adopt cost best practices across the organization.

This is Part 2 of a 3 part series. Read parts 1 and 3 here:

  1. Part 1: Understand how to launch your cost optimization journey, setting targets, understanding Gross Margins, and more.
  2. Part 3: Create a cultural shift in your company that will allow you to sustainably and consistently ensure costs best practices are followed throughout the company.

After creating a cloud cost strategy, understanding our spend and setting targets, we were ready to begin saving. Below, we outlined dozens of best practices, tools and workflows we’ve used or implemented in order to drastically reduce our costs and increase our gross margins.

Following the steps outlined below, we were able to generate a >80% improvement in our cloud costs, save millions of dollars, and meaningfully improve Forter’s gross margin and valuation. Now you can do so as well!

Table of Contents

Conclusion

Tagging your resources

Why tagging is crucial

Once you understand what your spend targets are, the first thing you have to do is to attribute all of your spend to the relevant systems and teams. The way to do so in AWS is by tagging literally everything under the sun.

The most important, though by no means only, tagging targets for us were EC2 machines (later joined by EKS namespaces as we began the transition to K8s) and S3 buckets, as Compute and Storage accounted for the vast majority of our cloud spend.

Tagging every EC2 instance and S3 buckets allowed us to answer questions like:

● How much does service X cost us?

● Which team in Forter uses the most S3 storage?

Which then allowed us to ask the next level questions, like “do we want to be paying $Y for service X?” and “where should we be investing first to save the most costs?”

How to tag effectively

We decided to add the following tags to every AWS resource:

● application: the name of the application or service that uses the resource

● system: the high level component the application belongs to

● team: the Forter Engineering team that owns the resource

● tech: the technology that’s running on the resource (like Elasticsearch or in-house)

At this point, Forter had many thousands of AWS resources. Tagging everything manually using AWS Tag Editor was not an option, and we chose to utilize Infrastructure as Code.

At Forter, we use a mixture of automatically generated CloudFormation templates and Terraform (you can read more about our system here). Both of these tools support automatic AWS resource tagging. Here’s how to do so with Terraform.

Even with automation, tagging everything was a major challenge. Understanding who owns which resource was a daunting task. And we were never able to achieve 100% (though we got pretty close to it), because some resources, like kafka connect, don’t support custom tags in AWS.

But the crucial point here remains:

Ensuring comprehensive tag coverage on your resources is essential for optimizing costs effectively

Choosing cost savings initiatives

We explain here how to use the AWS Cost Explorer to understand where to most effectively spend the Engineering effort on cost optimization.

After diving into the details for a long while we came up with the following list of potential optimization targets, with some initiatives for each one:

System

Issue

Initiatives to pursue

S3

High data storage cost

  1. Ensure we compress all the data sent to S3
  2. Utilize cheaper S3 storage tiers
  3. Delete unnecessary data

S3

High API cost

  1. Identify why the S3 API calls cost is so high and reduce it

EC2

High on-demand usage

  1. Increase Reserved Instances and Savings Plan coverage
  2. Buy additional commitments, and do so with optimal savings

AWS

High cost of secondary region

  1. Given that we have a main and a secondary region which runs only a small portion of the traffic, reduce the cost of the secondary region to be more proportional to the traffic it handles

Elasticsearch

High EC2 cost

  1. Given that we self manage Elasticsearch in EC2 VMs, and that they account for 50% of our entire EC2 spend, optimize our ES Compute costs

NAT Gateway

High cost relative to budget

  1. Reduce our NAT Gateway costs, as it shouldn’t account for 2% of our entire cloud bill

We took those targets and initiatives and launched dozens of projects.

Each and every one of the initiatives listed here has made a real, meaningful difference in our spend.

We hope you can use our 2 years of learnings to make a difference in your cloud budget as well.

S3

Reducing S3 storage costs

Forter stores A LOT of data in S3 - double digit petabytes. This is pricey! As we were starting out, the general (correct!) sentiment about S3 storage cost was that it was “cheap enough to ignore”. Eventually, our success as a company meant that this mindset had to change, and fast, as over 30% of our cloud bill was spent on S3.

To combat climbing S3 costs, we adopted the following principles which we applied to all of our S3 buckets:

  1. Delete data we don’t need
  2. Compress everything
  3. Use the correct Storage Class for each use case

Let’s dive deeper into each one.

Delete data we don’t need, and apply continuous retention

To delete unnecessary data we first needed to understand:

● What is stored in each of our S3 buckets?

● Who owns the data?

● Is the data in use?

The main tool we used for this research is the S3 Storage Lens dashboards. The first thing to check is how much data you have in each bucket:

Lifecycle Policies and Advanced Metrics

Once you’ve identified the largest buckets, you can drill down to each one and find out:

● How much data is there in non-current versions of objects? (%noncurrent version bytes is the metric to look for)

○ In our case, we saw that many buckets had versioning enabled, but didn’t have lifecycle policies which would delete old versions. This meant that data slated for deletion remained in the bucket forever.

○ Setting up a Lifecycle Policy to delete non current versions was a breeze and saved us >80% of the cost of some buckets.

● Which prefix is using the most storage?

○ Some of our buckets were being used by multiple teams, so in order to know which prefixes store the most data, we enabled Storage Lens Advanced Metrics.

○ This is a paid feature, so you should only turn it on for a few days at a time.

○ Even a few days were enough to point out, for example, that a lot of data was going into a /tmp prefix, which wasn’t being deleted.

Identifying data that can be deleted

After achieving quick wins by deleting old versions and temporary data, we needed to: 

● Identify data that isn’t being used and delete it (safely!) 

● Apply proper retention policies to the remaining data

We prepared a good old Google Sheets spreadsheet which listed all our buckets, and asked all the teams to review it. Everyone had to write:

● Which of the buckets they are using

● How long they need the data for

● Mark themselves as the owners if they are the ones producing the data

Getting everyone to review the spreadsheet was not easy (people do hate spreadsheets), but the exorbitantly high spend, coupled with persistent nagging provided the necessary motivation.

After going over the spreadsheet, we discovered many buckets that no one was using (they were created a long time ago and forgotten) which we simply deleted.

Additionally, we saw that most of our data was being stored for extended periods (ie forever), even though no one actually needed the older data.

Of course, we couldn’t just go ahead and delete a bucket just because someone wrote “OK to delete!” on a spreadsheet.

Don't be this guy.

Safely deleting S3 data

We used a few methods to identify safe to delete data:

S3 Access Logs: with a few clicks in S3 console and Athena console, we set up a table that contained usage data of our S3 objects. It was easy then to write SQL queries to find, for example, the IAM users that were accessing old data.

○ We could then see when someone said they only needed 1 year of data, but were actually using 3 year old data, and ask them about it.

Remove permissions before deleting (AKA Scream Test). Deletion is final and irreversible. We first removed access permissions to buckets, and waited a while to see if someone complained. If no one complained after a week (or a month, depending on your risk tolerance), we went ahead and deleted the data with an S3 Lifecycle Policy.

After completing the above steps, we were in a much better place, but we recognized that data retention is an ongoing challenge. We saw that we needed to shift our mindset. Defining retention periods should be the default for all new data, and retaining data long-term must be justified by clear business reasons. As a result,

Data retention is now a required topic in any Design Review at Forter that involves adding new data.

Compress everything

Let’s start with a bang. Given what we’ve learned, we firmly believe that

There’s no good reason to write uncompressed data to S3

Modern compression algorithms, like zstd, are incredible. They can reduce your storage costs by a factor of 5 or even 10. The effect on CPU utilization for compression and decompression is usually negligible, and most data frameworks natively support it. Seriously, compress everything by default.

Unfortunately, past “us” didn’t heed this warning, so we had to pay for our mistakes now. One of our biggest buckets had multiple petabytes of uncompressed data, across billions of objects. It was also, sadly, used by many different readers. This meant that not only did we need a heavy duty ETL to compress all our data, we first needed to modify all the readers to be able to work with compressed data. Don’t be past us.

The full story of how we went about this project can probably fill a blog post of its own, but here’s the main gist of it:

  1. We mapped out all readers of a given bucket. And yes, that meant another spreadsheet, and another rummage through S3 Access Logs with Athena.
  2. We built a cloud storage abstraction library, implemented in Java, Python and Javascript (the main languages used at Forter), to which we added compression support. The library uses the Content Encoding header on each S3 object it wrote, to denominate the compression used. We only needed to support zstd and gzip. Upon reading an object, if the content encoding said it was compressed, the library decompressed the object before handing it out back to the application.
  3. We migrated all readers to use the new library. This was a long and difficult process. Not because the code changes were difficult, they were quite simple, but having to modify hundreds of applications and offline processes, owned by many different teams proved to be extremely arduous.
  4. We validated readers by writing a few compressed objects, and making sure all readers could properly read them, and had no errors.
  5. At this point we began writing all the new data with compression.
  6. Finally, we built an ETL to compress all historical data. This process used S3 inventory, in order to produce batches of objects to be compressed into an SQS queue. A dedicated service was built to read these batches from SQS and apply compression to the data. Note that we couldn’t parallelize as much as we wanted since S3 has a rate limit which we had to carefully adhere to, or we would risk hurting production reads & writes.

Batch Compression ETL diagram

Utilize S3 Storage Classes for Tiering

At this point we had only the data we needed in S3, and we are keeping it nice and compressed. But we were still using the most expensive S3 storage class—S3 Standard—for everything.

For each object you write to S3 you can choose a Storage Class. S3 Standard is the default, which is why everybody, including us, was using it. But there are other, cheaper options. The gist of it is this:

You should be paying less for data that is rarely used (usually old data)

In most systems we've encountered, only 1% of the data is accessed frequently. Typically, this is the most recent data. Older data tends to remain dormant, incurring unnecessary costs. Therefore, we needed to design our data storage so that data with lower business value incurs lower costs.

Here's a cheat sheet for how to utilize S3’s Storage Closses:

Storage class

Storage cost

API cost

Retrieval cost

When should you use it

Standard

Highest

Low

None

Only when you have very small files (<128kb), as the other storage classes don’t support those

Intelligent Tiering

Varies:

● newly “touched” data costs like Standard

● data you don’t touch becomes cheaper over time

Low

None

Make this your default. In most cases, there is very little reason to not use Intelligent Tiering by default. The management cost is trivial, unless you have billions of objects.

“Intelligent Tiering” is an excellent option, because it automatically optimizes for “paying less for old data”, and has no surprises in API or retrieval costs.

Standard-IA

Medium

Medium

Medium

Only when intelligent tiering isn’t working out well for you. This can happen when objects are read infrequently, but still often enough to keep most of the objects in the “frequent access” tier, which costs the same as Standard.

Unlike intelligent tiering, IA has a retrieval cost per GB, so beware of using it with buckets read by Spark or Athena. They can download massive amounts of data that end up costing thousands of $$ in retrieval costs.

Glacier Instant Retrieval

Low

High

High

Glacier Instant Retrieval is like IA, only with cheaper storage cost, and higher API & retrieval cost. Here as well, use only if intelligent tiering is not working out. The same caveats apply for surprising retrieval costs.

Amazon S3 Glacier Deep Archive

Very low

High

Medium

Use mainly for backups. Retrievals are expensive and slow, but the storage is extremely cheap. Can be handy for data you are considering to delete, but would like to wait  6 months or a year to feel more safe to do so.

Note that with this storage class, you can’t just call GetObject and retrieve objects. You first need to initiate a “restore” operation. This means that unlike the other classes, your applications might fail to work with this storage class without a manual restoration step.

Intelligent Tiering

The main takeaway from the above cheat sheet for you should be:

Intelligent Tiering is great. Study it and use it often.

This diagram explains how it works:

The three Intelligent Tiering tiers are equivalent to similarly named S3 Storage classes:

● Frequent Access costs like Standard

● Infrequent Access costs like Standard-IA

● Archive Instant Access costs like Glacier Instant Retrieval

The great thing about Intelligent Tiering is that you don’t have to fully understand the access patterns to your data in order to achieve meaningful storage cost reduction. You simply flip a lifecycle policy, and it’s on. You can enable it on buckets you don’t own without risking angering the users: API cost will remain the same and there are no surprising retrieval costs.

That being said, it does have limitations. Without understanding your access patterns, you might find that data seems to return to the Frequent Access storage tier all the time. For your most costly buckets, understanding the access patterns to the data is worth it.

Here are a couple of approaches we’ve used successfully:

● Enabling storage class analysis will show you how often old data is accessed, though it won’t show you who/what is using it.

● To drill down deeper, enable S3 access logs and query them using Athena [1]. From there you’ll be able to arrive at the IAM user who is accessing the old data, like an old ETL left running unused for years (just an example, *cough*)

Reducing S3 API costs

S3 API costs are often neglected, as people focus on storage costs. For us, though, API costs ran exceedingly high. Digging in the cost explorer gave us a clear understanding of the problem.

The PutObject S3 API Operation was our biggest cost driver (we grouped on API Operation in the Cost Explorer to find this out). In some places we were using S3 as a KV store—writing and reading small objects at a high rate, without paying attention to the API costs.

Unfortunately, there was no one-size-fits-all solution for this problem, and we had to go case by case to find solutions that avoided high API costs.

Here are some of the solutions we implemented that worked well for us:

Solution

When is it applicable?

Use a different data-store. Either classic KV Stores such as Aerospike, or even RDS can be appropriate in some cases.

Where we had low storage costs (very small objects or very short retention period), but high ingestion cost. S3 has by far the cheapest storage, but if you’re paying much more on the ingestion (API) than on storage, consider a different data store.

Use file formats like Parquet that batch multiple objects in the same file

Parquet is great when you use tools like Spark or Athena to run analytical queries over your data. It can also be OK for fetching specific rows by ID if you can tolerate slow query time. If you need O(1 second) fetch-by-id performance, you’ll need to look elsewhere though.

Custom compaction - hold multiple objects in the same file and fetch specific ranges via a byte-range fetch

This is a more complicated option, but it allows for reducing ingestion cost, while keeping fast get-by-id performance. In our case, we are storing metadata on every object in a dedicated KV store (Aerospike). To fetch a specific object from S3, we first fetch the relevant key and byte range from Aerospike and then fetch a specific file by ID. A Data Access Layer is provided via a REST API to abstract these details from the clients.

Obviously, this makes updates more difficult. On update you might need to update a much bigger file just to rewrite a part of it. Alternatively, you can rewrite the object to a new file, change the key+offset in the KV store, and leave the old file in the bucket. Later, a separate garbage collection process will remove old versions of data.

Your mileage might vary, but

Ignore S3 API costs at your own peril

And with that we’re done with S3! Let’s move on to EC2 costs.

EC2

Whether your applications run on Kubernetes or directly on VMs, EC2 is likely to dominate your AWS bill. It was certainly the case for us.

To battle this, we adopted the following guiding principles:

  1. Delete unused VMs, and scale-in/down underutilized systems
  2. Reach >90% coverage with Savings Plans/RIs. Aim for 10%, at most, of Compute running On Demand.
  3. Use Spot instances where possible
  4. Migrate to Graviton instances

Delete and scale-in

Forter has existed since 2013, and has a sizable engineering department. This means that:

● We almost certainly have workloads that were launched long ago and were not being used. Developers are only human, and sometimes we forget to delete things.

● We create new products and add new workloads all the time, but we don’t constantly adjust the instance sizes and counts to fit the actual traffic. It is very common for systems to be over-provisioned.

A good way to find low utilization workloads is built into AWS. In th Trusted Advisor you can easily find low utilization EC2 instances. Here for example, we have nodes of an Elasticsearch cluster that seems unutilized:

A deeper look with the application team showed that the cluster could indeed be scaled in, with significant cost savings.

In other cases we found temporary workloads that were launched and then forgotten. We found multiple deployments of the same system, left around by accident. We found instances that could be downsized from 8xlarge to xlarge, and on and on, until we got most of the unused instances and scaled most of the workloads to fit their actual usage.

EC2 Commitment Strategy

Just like with S3 Storage Classes, EC2 instances also have different purchasing options that affect their costs.

Unless yours is an early-stage startup, you should NOT be paying On Demand prices for Compute

Here’s a handy table that explains when to use each one:

Purchase Option

What does it mean?

Flexibility

Can I get rid of it?

When should I use it?

On Demand

The default and the priciest option. If you don’t buy any commitment in advance, this is what you’ll be using.

-

Use it as little as possible. Keeping 10% at most on on-demand will give you some flexibility while not breaking the bank.

Compute Savings Plan (CSP)

A commitment you purchase for using $X of EC2 per hour. The commitment is for 1 or 3 years, and can be paid upfront or in monthly payments. In return for the commitment you get a significant discount.

For example, if you buy a $10/hour 3-year no-upfront CSP, you will get an average of 50% discount compared to On Demand (depending on the instance type and region). Effectively this means that you can use $20 worth of EC2 instances, but pay only $10.

Since this is a commitment, if you fail to use the full $10 commitment, you will pay the shortfall to AWS. In other words, if you only used $8 worth of compute in a give hour, you will still pay the full $10 to AWS.

Instance size, family and region

NO

CSPs are great If your EC2 compute usage is fairly stable. It’s a commitment you can’t back down from, so it might not be a great fit for early stage startups.

The average company, though,should be able to commit to some base EC2 usage. In our case, we are aiming for 60-70% CSP coverage, as Savings Plans are our favorite type of commitments.

We like CSPs because it’s the most flexible option. With CSPs you only commit to $$ spend, as opposed to specific instance families or regions. This way, we never block our engineers from, switching from Intel to ARM, from EBS storage to local NVMe, or even moving workloads to a different region.

Standard Reserved Instances (SRIs)

A commitment you purchase, for using N amount of instances of a certain instance type. For example, you commit to using 4 m6g.4xlarge instances for 1 or 3 years.

There is size flexibility here, so you can use 8 m6g.2xlarge, or 2 m6g.8xlarge, and you will still use the commitment. However, this is the only flexibility you get here.

SRIs can be purchased directly from AWS, or via the RI marketplace, where you can sometimes find RIs with a shorter term.

Instance size only

Probably not.

EDP customers cannot sell SRIs on the marketplace, so if you are a big AWS customer you can’t.

You probably shouldn’t. While SRIs give the highest discounts, the fact that they do not allow you to switch instance families makes them a bad option. You can find yourself “stuck” with m6g, when you really want to move to r7g.

For this reason, only non EDP customers should use SRIs, as they might be able to sell unused RIs in the marketplace. Still, if no one is buying for a long time, you lose money.

Convertible Reserved Instances (cRIs)

A commitment you purchase, for using N amount of instances of a certain instance type. For example, you commit to using 5 m6g.4xlarge instances for 1 or 3 years.

Unlike SRIs, cRIs allow you to change the instance families. This requires a manual conversion operation in AWS console (or via the EC2 API), which makes it a more complicated option compared to CSPs.

In addition, unlike CSPs, cRIs can’t move between AWS regions.

Instance size and family

NO, but you can “stretch” it

In most sense, cRIs are a worse version of CSPs. They offer similar discounts to CSPs, with no regional flexibility and the need to work with the AWS API.

However, cRIs do have one important trick. By merging a short term cRI with a long term cRI, you can lengthen the term of your commitment (for example from 1 year to 3 years), and thereby reduce your overall monthly commitment. This can give you some flexibility in case a workload was reduced in size.

EC2 Instance Savings Plan

Like CSP but with less flexibility, as it is locked to an instance family and region. Or, in other words, it’s like a SRI but instead of committing to a number of instances, you are committing to a $$/hour spend. Unlike a standard RI, you can’t sell this commitment even if you’re not an EDP customer.

Instance size only

NO

We believe that never. Yes, they offer a higher discount than regular CSPs, but the reduced flexibility is not worth it. Not being able to change instance families for a 3-year period is not healthy. And, unlike SRIs, the option to sell excess capacity on the marketplace is not available even if you are a small AWS user without an EDP.

Given all of the above, here’s what’s working well for us:

  1. Compute Savings Plan: 70% of our EC2 workload. This gives us the most flexibility, with with AWS regions.
  2. Convertible RIs: 20% of our EC2 workload. Convertible RIs are useful because of the ability to “stretch” the commitment. However, you will either need to build automations to do the instance conversion, or use a vendor that does it for you.
  3. On demand + spot instances: 10% of our EC2 workload, as there’s always bound to be some volatility in the EC2 spend. We’ll discuss it next.

Spot Instances

Traffic patterns mean that some hours require more servers than others. Some workloads are periodical by nature, and only run once a day or a week. This means that it’s very hard to cover all of your EC2 workloads with commitments such as CSPs or cRIs.

When to use Spot Instances

Above you can see a screenshot from the AWS Savings Plan recommendation page (you should use it; it’s very handy). The red bit is on demand usage. Its spiky nature makes it hard to cover with commitments. However, this doesn’t mean we have to pay the full price for them. This is where Spot Instances shine.

Spot Instances + Spiky workloads = 💗

Spot Instances are spare EC2 capacity that can be bought for a much lower price. In the past, average spot prices were 80% cheaper than on demand. These days, expect 50-60% discounts, which makes spot prices only slightly cheaper than CSPs. Also note that Spot Instances can be grabbed from under your feet at any given moment, when someone else is launching on-demand instances and AWS needs the capacity back.

Therefore, we recommend using Spot Instances when:

● Your application can deal with interruptions. Fault tolerant services running behind a queue and offline Spark jobs are good examples.

● You are paying on demand prices. But since the discounts on Spot Instances are close to the discounts you get on 3-year commitments, we strongly recommend using CSPs/cRIs over Spot Instances for stable, non spiky traffic.

Choosing Instance Types for Spot Instances

So our target for Spot Instances are spiky workloads that can handle interruptions well. Now, let’s talk about selecting instance types. Familiarize yourself with the Spot Instance Advisor.

In it you can search for an instance type in a region, and find two important pieces of information:

● The frequency of interruption—wou want to choose Spot instances with fewer interruptions.

● The average savings—obviously you want the highest savings, but take these with a grain of salt, as the price also varies by AZ (Availability Zone).

A better place to check out the Spot prices is the Spot Instance Pricing History in the EC2 console.

Here you can see that the discounts vary greatly between AZs, and only the cheapest one seems to correspond to the Spot Instance Advisor numbers. Your company might not be using all of the AZs, so make sure to check the prices in the AZs you use to make decisions.

The above tools help us plan projects and understand if Spot Instances are a good fit for them. Once we decide to use Spot Instances, we always use Auto Scaling Groups (ASGs) with Mixed Instances Policy. This allows us to specify multiple instance types in the ASG, as well as a spot allocation strategy.

For example, if your workload is using r6i.xlarge, you can use all of the following instance types in your ASG: r6i.xlarge, r5.xlarge, r7i.xlarge, r5a.xlarge, r6a.xlarge, r7a.xlarge. Together with a Price capacity optimized allocation strategy, AWS will pick the instance types with the fewest interruptions, sorted by the lowest price.

If you are using Kubernetes, Karpenter seems like a great option to balance between Spot and On Demand instances. It gives you the ability to fall back to On Demand when Spot capacity is not available. We haven’t used it yet in Forter, but are strongly considering it.

Another advantage for Kubernetes is the ability to mix instance types of different sizes in an ASG. For example r6i.xlarge and r6i.2xlarge. Unlike in VMs, in Kubernetes your application doesn’t need to be aware of which instance size it is running on to utilize the resources. Extending the Spot Instance pool with more instance type sizes allows you to further reduce the risk of interruption.

Graviton (ARM)

ARM is taking the processor world by a storm, and not only in our Mac laptops. AWS homegrown Graviton processors can be both cheaper and more performant for most use cases.

In 2025, you should default to ARM for Compute

There were a few use cases where we saw Intel performing better—an especially latency sensitive Elasticsearch cluster was one such example. But in the majority of cases our costs were reduced, and performance was either identical or slightly better than on Intel machines.

Migrating from x86 to ARM

Depending on your CI & deployment stack, this can be either difficult or really difficult. It definitely won’t be easy for any organization with a substantial computing workload.

In our case this meant:

● Adding support for ARM in our production deployment pipelines (Chef for VMs, EKS for Kubernetes)

● Adding support building ARM images in CI

● Migrating each repo to build multi-arch docker images

● Adjusting installed libraries/packages to allow running on ARM

● Deploying and testing each service on Graviton instances

This has been a long journey, and we’re not done yet. If you want to learn more about our process, you can listen to this talk (Hebrew, sorry) by our own Tomer Gabel, who led this migration at Forter.

If you are a new business, we suggest supporting only Graviton processors from the start. It’s the best strategy these days, and migrating later is not fun.

Choosing a Graviton Instance type

This is not as easy as just choosing the latest generation, as the Graviton vs Intel discount is being reduced when moving from Graviton 2 to Graviton 4:

Instance type

Cost per month

Discount vs Intel

m6i.2xlarge (Intel)

$276

-

m6g.2xlarge (Graviton 2)

$221

-20%

m7g.2xlarge (Graviton 3)

$235

-15%

m8g.2xlarge (Graviton 4)

$258

-6.5%

*on demand costs, based on us-east-1 prices

AWS does claim that the newer generation CPUs have better performance, meaning you need fewer instances for the same workloads, ensuring your total costs are still reduced. While there’s some truth to that, it’s more complex than that, as in some services the bottleneck is not the CPU, but rather the memory, network or disk IO.

All this makes the equation harder. We can’t just tell engineering teams to use the latest Graviton when migrating away from Intel. The generation should be chosen on a case by case basis, taking into account the following:

● Is CPU the bottleneck? If not, you might be better off with the cheaper Graviton 2 instance.

● Do you have capacity in your team to run some experiments and choose the best instance type (and count) for your use case?

○ If so, just try all the instance types and see what works best for Cost/Performance.

○ If not, start with the cheapest Graviton 2 instance.

As an example, we’ve had some success in moving from the Intel m family into the Graviton 4 r8g family, even though they have 50% fewer CPU cores (and, of course, cost much less).

Unfortunately, the days of always switching to the newest gen are over. You’ll have to spend time benchmarking to understand what works best for you.

Cost Effective Multi Region

When we got started with our cost optimization journey, we were already running the Forter Platform on multiple AWS regions. The architecture we chose involved running our entire data pipeline on all regions, on all incoming data. This meant we were paying N times for a big chunk of our data processing.

We had to make massive changes to how we manage multiple regions at Forter. You can watch Omri Keefe’s (Hebrew) session about this for more details, but here’s the gist of how things looked before the change:

Basically, we had a naive solution, where we were relying on S3 replication to replicate raw data, which we were then processing and aggregation in all regions. We also stored all the real-time data in all regions, leading to high data processing and storage costs.

In the new iteration we moved to a hub and spoke architecture:

With this somewhat more complex architecture, the Spoke regions are only processing and storing data needed for their own region. The data is replicated with Kafka, and flows to the Hub Region’s data lake for analytical processing and ML training.

Elasticsearch

We've been using ES (Elasticsearch) since the early days at Forter, and it has become an integral part of our operations. We think it’s a great database for use cases with complex querying and analysis needs, and we use it to run giant workloads. Some of our Elasticsearch clusters had well over a 100 nodes, costing us a lot of money.

Breaking down ES Costs using Cost Explorer, has given us the following insights:

● Our EBS storage cost in Elasticsearch clusters was skyrocketing

● The EC2 instances we used for Elasticsearch were by far the biggest cost driver

● Cross Availability Zone Data Transfer between Elasticsearch nodes within the same region were a very big cost driver

Moving from EBS to local NVMe storage

AWS offers EC2 instances with very fast NVMe storage. That storage is ephemeral that doesn’t offer EBS capabilities such as detaching and attaching to a different instance, but with ES you’re aren’t likely to be using those features anyway.

Let’s look at some numbers:

Instance/storage type

Cost/TB/month

EBS - gp3

$80

EC2 - i3en.3xlarge (on demand)

$130

EC2 - i3en.3xlarge (3 year no-upfront CSP)

$65

EC2 - is4gen.2xlarge (on demand)

$110

EC2 - is4gen.2xlarge (3 year no-upfront CSP)

$55

*Prices based on us-east-1 region

As you can see in the above table, using local EC2 storage allows us to:

● Enjoy the 50% Compute Savings Plan discount on our storage as well. No such plan exists for EBS

● Get CPU and memory in addition to the storage

● Utilize really fast disks with great performance running Elasticsearch

This has worked so well for us, that we moved almost all of our ES clusters to local NVMe EC2 instances.

Optimizing Elasticsearch Storage Cost

Don’t Index what you don’t need! Tuning your Elasticsearch mapping is critical if you want efficient indexes.

Do NOT use the default dynamic mappings provided by Elasticsearch for anything in Production

By default, Elasticsearch indexes everything, including nested fields. It can create a field explosion, which is wasteful in storage, as it indexes every string as both text and keyword.

Instead, understand the access patterns of each field in the index, and create a custom mapping that matches that.

Here are some of the things we do:

● For each string field we use either text or keyword, but not both. Full text search fields should be mapped as text, but in our case the majority of string fields were used only in exact matches and aggregations, so keyword was used.

● For keyword fields that weren’t used in aggregations or sorting, we disabled the doc-values feature to gain a significant amount of storage. It turned out that some of our indexes weren’t doing aggregations or sorting at all, so we were able to disable doc-values on most fields.

● We used the Index Disk Usage APIto find which fields were taking the most storage, and further optimized our mappings. We were able to completely disable indexing on some of our biggest fields, or reduce the amount of terms we index to those fields. For example, we saw that some of the fields are holding the same data as sibling fields, and simple application-level changes enabled us to remove those fields.

Of course, changing Elasticsearch mappings for an existing field requires reindexing the data. The reindex API is your friend here, so get comfortable in using it often.

Time-based Indexes and Data Tiering

The logic of paying less for old data in S3 applies for Elasticsearch as well. Back when we started our optimization project, we had an ES cluster with one BIG index, which held all data since the beginning of time.

Having one big index for time-based data is a bad idea, because:

Retention becomes difficult, as you can’t delete entire indexes at a time. Instead, you’re forced to run heavy delete_by_query calls. This leaves your index with a lot of deleted documents, impacting performance.

Query performance is bad, as shards contain both new and old data, and you have to hit all shards to query very recent data (which is usually the most common operation).

Data tiering is impossible, as you can’t push older indexes to cheaper data tiers.

We chose to opt for daily indexes in most cases.

This change opened many avenues for improvements:

● Queries hit only the shards they need to (within the relevant time frame), providing much better latencies in our systems.

● We now use only the cheapest SSD Graviton instances (e.g. is4gen instance family) for storing older data which is queried much less often, which helped us slash one of our cluster’s costs by almost 50%.

Obviously, these are difficult changes to perform on a live production environment. The good news is that with ILM all these features are baked into ES. So while your application will probably need to change, and touching production is always a careful dance, the implementation of this system is fairly straightforward.

Cross-AZ Data Transfer

Highly available Elasticsearch clusters need to run on multiple Availability Zones. Moving data between AZs costs a $0.01 per GB, and for a large ES cluster these costs can accumulate rapidly. We’ve had a full blog post dedicated to this issue alone, where we debugged why one of our clusters was moving 3PB (!) of data every month between AZs. Remember,

Cross AZ Data Transfer is an AWS Revenue Generator

Don’t let them just print money, OK? Start by enabling transport compression, compressing the data that passes between ES nodes.

Next, add AZ awareness to your stack. It is likely that both your application and your Elasticsearch cluster run on multiple AZs. You want to ensure that application instances that run in an AZ, connect to the ES nodes running in the same AZ.

Instead of this:

Do this:

While seemingly straightforward, it is not the default. This will require carefully configuring your Elasticsearch client, so that it’s aware of which AZ the client is running on, and prefers data nodes in the same AZ. In Python, for example, this is done by configuring a node selection class in the client.

AWS NAT Gateway

Using AWS NAT Gateway is a common way to allow your applications to talk to the internet in a safe way. The price, though, is prohibitively expensive at $45 per TB. You might be thinking you’re not talking to the internet that much anyway, but think again, because

By default, sending and receiving data from AWS managed services incur NAT Gateway costs

Are you using S3, Kinesis, SQS or ECR? If so, then by default, using these services means going through the Internet, while you pay the $45/TB NAT Gateway charge. This can quickly explode. For us it reached over 2% of our cloud bill!

The solution is fairly simple—set up a VPC Endpoint for each of your heavily used AWS services. The price for using those is $4-$10/TB, so at least X4.5 cheaper than NAT Gateway. VPC Endpoints are available for the most popular AWS services like S3, SQS, ECR, Kinesis, and others.

If you’re still seeing very high NAT Gateway costs, it’s time to roll up your sleeves and enableVPC Flow Logs. With those logs, you can query[2]the flow logs with Athena, and discover the IPs that transfer the most data outside of your VPC.

This kind of investigation allowed us to track down IPs that were sending or receiving huge amounts of data to/from the internet, and either make them stop or create a VPC endpoint where applicable.

And with that, finally, we’re done!

Conclusion

All of the changes discussed above require consistently doing the right thing. This meant we had to make a deep cultural change, which we discuss here. Do read this part to understand how to make the changes stick.

All in all, this was a long and difficult journey, but the results we got were incredible:

We saw an 80% improvement to our cloud costs, saving millions of dollars, while drastically improving the company’s gross margin and valuation.

If you’d like to have an exceptional return on your investment, generating business-level impact in your company, consider following our advice. Continue reading the final part of this series, to learn how to create a deep cultural shift in your organization, and make these changes stick.

And hey, if you’d like to work with us as we continue this journey, do drop us your CV at our careers page 🙂


[1] Here is an example query that queries for usage report of a specific bucket:

 
select requestdatetime, requester from my_bucket 
where operation='REST.GET.OBJECT' and key like '2022-03%'
order by requestdatetime desc
limit 500
 

Using a query like this we discovered that a security scanner was reading old files in the bucket, moving them from the colder tiers to the most expensive one!

[2] Here’s an example query I ran -


SELECT destinationaddress, SUM(numbytes) AS
 totalbytes
FROM "default"
.vpc_flow_logs2
WHERE date = date('2024-11-18')
group by destinationaddress
order by totalbytes desc
 

</div>