🚀 Executive Summary
TL;DR: AWS EBS, while a convenient default, can cause performance bottlenecks and budget issues due to its network-attached nature and `gp2` volume limitations. Optimizing involves migrating to `gp3`, utilizing high-performance ephemeral Instance Store for specific use cases, or exploring third-party SANs for extreme demands.
🎯 Key Takeaways
- EBS is network-attached, introducing network latency and potential throttling based on the EC2 instance’s maximum bandwidth, unlike physically attached local storage.
- Migrating from `gp2` to `gp3` EBS volumes is crucial as `gp3` decouples IOPS from storage size, offering better performance and cost efficiency with a baseline of 3,000 IOPS for free.
- Instance Store (NVMe) provides extremely low-latency, high-IOPS performance by being physically attached to the host, but its ephemeral nature means all data is lost on instance stop/termination.
- For workloads exceeding EBS throughput limits or requiring advanced SAN features, third-party solutions like NetApp Cloud Volumes ONTAP or Silk can aggregate underlying cloud storage to break past single-volume AWS ceilings.
Quick Summary: AWS EBS is the default for a reason, but blindly sticking to it can throttle your database and drain your budget—here’s when to optimize, when to switch to local NVMe, and when to look outside the Amazon ecosystem entirely.
Is EBS the Best Block Storage? Or Are We Just Lazy?
I remember the specific moment I stopped trusting “defaults” in the cloud. It was 2018, Black Friday weekend. I was staring at a Datadog dashboard for prod-mongo-primary-01, watching the disk queue length climb vertically while the IOPS flatlined.
We were running on standard gp2 volumes. We thought we were safe because we had allocated 500GB, which gave us a decent IOPS baseline. But we hit the burst balance limit. The database crawled to a halt, the application started timing out, and I was sweating through my hoodie trying to provision a larger volume just to get the IOPS stripe up, waiting for the AWS optimization phase to complete. It took six hours to stabilize.
That incident taught me a painful lesson: EBS is convenient, but it is not magic. It’s a network-attached drive sitting down the hall from your server, and if you treat it like a local SSD, you’re going to get burned.
The “Why”: Physics and The Network Tax
Here is the root of the problem that juniors often miss: EBS is not physically attached to your EC2 instance.
When you write a block of data to /dev/xvdf, that data traverses the AWS network fabric to land on a storage server rack somewhere else in the Availability Zone. This introduces two unavoidable bottlenecks:
- Network Latency: It will never be as fast as a drive connected to the PCIe bus.
- The “Noisy Neighbor” & Throttling: You are limited not just by the drive’s specs, but by the EC2 instance’s maximum bandwidth. I’ve seen massive
io2volumes throttled because someone attached them to at3.medium.
We use EBS because it persists after a reboot and it’s easy to snapshot. But is it the best? Not always. Here are three ways I handle block storage when “default” isn’t cutting it.
Solution 1: The Quick Fix (Stop Using gp2)
If you are still launching EC2 instances with gp2 volumes because that’s what your Terraform modules were written with three years ago, stop. Right now.
gp3 is the modern standard. It decouples storage size from performance. On gp2, if you needed more IOPS, you had to pay for more storage capacity you didn’t need. On gp3, you get a baseline of 3,000 IOPS for free, and it is largely cheaper (around 20% per GB). Migrating is usually non-disruptive, but check your limits first.
Pro Tip: Don’t trust the console’s “Modify Volume” blindly on critical production DBs. While it says “no downtime,” heavy I/O loads during the optimization phase can increase latency enough to trigger application timeouts.
Here is a quick CLI check to find your dinosaur volumes:
# Find all gp2 volumes that should probably be gp3
aws ec2 describe-volumes \
--filters Name=volume-type,Values=gp2 \
--query "Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType}" \
--output table
Solution 2: The High-Performance Fix (Instance Store)
This is for when you need raw speed and you aren’t afraid of a little danger. This is the “War Story” fix I wish I had used for that MongoDB cluster.
Instance Store (NVMe) is physically attached to the host server. It does not go over the network. The latency is practically zero. The IOPS are insane (millions, in some cases).
The Catch: It is ephemeral. If you stop/start the instance, or if the underlying hardware fails, the data is gone forever.
I use this for:
- Worker nodes (Kubernetes)
- Redis/Memcached clusters
- NoSQL databases IF AND ONLY IF the replication strategy handles node loss gracefully (e.g., Cassandra or Mongo with ample replicas).
Mounting these requires a bit of initialization script magic because they present as raw devices on boot:
# A crude but effective snippet for user-data
# WARNING: This wipes the disk on first boot. Use with caution.
if [ ! -d "/mnt/fast_disk" ]; then
mkfs.xfs /dev/nvme1n1
mkdir -p /mnt/fast_disk
mount /dev/nvme1n1 /mnt/fast_disk
echo "/dev/nvme1n1 /mnt/fast_disk xfs defaults 0 0" >> /etc/fstab
fi
Solution 3: The “Nuclear” Option (Third-Party SANs)
Sometimes, even io2 Block Express isn’t enough, or the price tag ($0.125/GB + IOPS costs) makes your CFO scream at you. AWS EBS has strict ceilings on throughput per volume.
If you are lifting and shifting a massive legacy Oracle or SQL Server beast, or running high-performance computing (HPC) workloads, you might need to look at third-party solutions available on the Marketplace, like NetApp Cloud Volumes ONTAP or Silk.
These solutions aggregate underlying cloud storage but provide a virtualization layer that includes compression, deduplication, and—crucially—the ability to stripe across many backend volumes to break past single-volume AWS limits.
Here is a quick breakdown of when I reach for which tool:
| Storage Type | Best Use Case | The “Gotcha” |
|---|---|---|
| EBS gp3 | General purpose, Boot volumes, standard DBs. | Throughput caps at 1,000 MiB/s usually. |
| Instance Store | Caches, massive ingestion queues, scratch space. | Data vanishes on stop/stop. |
| NetApp/Silk/Pure | Enterprise legacy apps requiring SAN features. | Complexity & Licensing costs. |
EBS is great, but don’t let it be the default just because it’s there. Your architecture (and your on-call sleep schedule) deserves better.
🤖 Frequently Asked Questions
âť“ Why might EBS not be the optimal block storage choice for all workloads?
EBS is network-attached, introducing inherent network latency and potential throttling based on the EC2 instance’s bandwidth, which can limit performance for I/O-intensive applications compared to local storage.
âť“ How do EBS, Instance Store, and third-party SANs compare in terms of performance and use cases?
EBS (`gp3`) is general-purpose with throughput caps. Instance Store offers superior, near-zero latency and high IOPS but is ephemeral, suitable for caches or replicated databases. Third-party SANs like NetApp or Silk break AWS single-volume limits for HPC and legacy enterprise applications, albeit with added complexity and cost.
âť“ What is a common implementation pitfall with EBS and how can it be avoided?
A common pitfall is continuing to use `gp2` volumes, which couple IOPS to storage size. This can be avoided by migrating to `gp3` volumes, which decouple performance from size, offer a baseline of 3,000 IOPS for free, and are generally more cost-effective.
Leave a Reply