🚀 Executive Summary

TL;DR: Training computer vision models on large datasets like nuScenes from S3 often leads to GPU underutilization due to S3’s high latency for many small files. This guide offers solutions like local NVMe with s5cmd, FSx for Lustre, or WebDataset streaming to optimize data access and improve training throughput.

🎯 Key Takeaways

S3’s high latency for numerous small files, typical in large CV datasets like nuScenes, creates an IOPS bottleneck and GPU underutilization during model training.
Utilizing EC2 instance store (NVMe SSDs) with `s5cmd` for local data caching is a quick, high-IOPS solution for proof-of-concepts, despite data ephemerality.
Amazon FSx for Lustre provides a production-ready, POSIX-compliant, S3-backed file system with sub-millisecond latencies, ideal for shared, persistent access to large CV datasets.

Stop burning GPU credits while your model waits for S3 downloads; here is the definitive guide to architecting high-throughput storage for massive computer vision datasets like nuScenes on AWS.

Taming the Beast: Efficiently Processing Large CV Datasets on AWS

I still remember the first time I tried to train a simple object detection model on the full nuScenes dataset. I spun up a p3.2xlarge, feeling like a king, and pointed my PyTorch dataloader directly at an S3 bucket using boto3. I went to grab coffee. When I came back, the epoch hadn’t even finished 1%. My expensive GPU was sitting at 0% utilization, basically burning a hole in the company credit card while it waited for individual JPEGs to trickle down the network pipe.

I’ve been there. You’re staring at prod-training-cluster-01, watching your costs skyrocket while your throughput flatlines. It’s frustrating, and honestly, it’s a rite of passage for any Cloud Architect working with Computer Vision.

The “Why”: It’s Not Size, It’s Latency

Here is the hard truth: S3 is an object store, not a file system. It is incredible for throughput (downloading one 50GB file), but it is terrible for latency (downloading fifty thousand 50KB files).

Datasets like nuScenes (approx. 550GB) aren’t just “large” in volume; they are complex. You are dealing with hundreds of thousands of small sensor data files, JSON metadata, and lidar point clouds. When you try to read these one by one during a training loop, the HTTP overhead of the GET requests kills you. You aren’t bottlenecked by bandwidth; you are dying the death of a thousand cuts by IOPS (Input/Output Operations Per Second).

If you are stuck, here are three ways to fix this, ranging from “I need this running by lunch” to “We are building the next Tesla Autopilot.”

Solution 1: The Quick Fix (Local NVMe + s5cmd)

If you are a solo engineer or working on a proof-of-concept, keep it simple. Don’t over-engineer a distributed file system if you just need to iterate on a model.

The trick here is choosing an EC2 instance family that has Instance Store (ephemeral NVMe SSDs). Look at the g4dn or g5 series. These drives are physically attached to the host server. They provide insane IOPS.

The Strategy:

On boot, copy the entire dataset from S3 to the local NVMe drive. Do not use the standard AWS CLI (aws s3 cp) because it is too slow for small files. Use s5cmd, a tool written in Go that is shockingly fast.

Pro Tip: This data is ephemeral. If you stop the instance, the data vanishes. Only use this for cache/scratch space.

# 1. Mount the NVMe drive (usually usually /dev/nvme1n1)
mkfs.xfs /dev/nvme1n1
mount /dev/nvme1n1 /data

# 2. Install s5cmd (it's faster than aws cli)
go install github.com/peak/s5cmd/v2/cmd/s5cmd@latest

# 3. Hydrate the disk (This will saturate your network bandwidth)
s5cmd cp "s3://my-org-datalake/nuscenes-v1.0/*" /data/

This is “hacky” because startup time takes 10-15 minutes to copy data, but once it’s there, your training speed will fly.

Solution 2: The Permanent Fix (FSx for Lustre)

If you are running this job repeatedly, or if you have a team of data scientists accessing the same data, copying 550GB every morning is a waste of time. You need a POSIX-compliant file system that looks like a local drive but is backed by S3.

Enter Amazon FSx for Lustre. This is my go-to for production CV pipelines.

Lustre creates a high-performance file system that “lazy loads” files from S3. The first time you read a file, it pulls from S3 with high throughput. Subsequent reads come from the FSx cache. It presents a standard drive mount to your OS, so your code doesn’t need to change.

Pros	Cons
– Sub-millisecond latencies. – Seamless S3 integration. – Shared across multiple instances.	– Minimum size deployment costs money (starts around ~$15/day). – Overkill for datasets < 50GB.

I recently switched team-cv-prod to this setup, and we saw a 4x reduction in epoch times. It just works.

Solution 3: The ‘Nuclear’ Option (WebDataset / Streaming)

Sometimes, the dataset is too big for local disk (petabyte scale), or FSx is too expensive for your budget. This is where you have to change how you think about data.

Instead of reading a million small files, you tar them up into larger shards (e.g., 100MB – 1GB tar files) and stream them linearly. In the PyTorch world, this often means using the WebDataset library.

By grouping small images into large tarballs, S3 can stream the data efficiently. You never download the whole dataset; you stream it into RAM as needed.

import webdataset as wds

# Pointing directly to S3 via pipe
url = "pipe:aws s3 cp s3://my-bucket/nuscenes-shards/shard-{000000..000150}.tar -"

dataset = (
    wds.WebDataset(url)
    .shuffle(1000)
    .decode("rgb")
    .to_tuple("jpg", "json")
)

# Now your GPU stays fed without a local disk requirement
dataloader = wds.WebLoader(dataset, batch_size=64, num_workers=4)

This is the “Nuclear” option because it requires re-formatting your entire dataset and refactoring your training code. It’s a heavy lift, but it scales infinitely.

Final Thoughts

If you are just starting out with nuScenes on AWS, do yourself a favor: grab a g4dn.12xlarge, format that local NVMe drive, and use s5cmd. It’s the path of least resistance. But when your CFO Slack messages you about the S3 API costs, be ready to migrate to FSx for Lustre.

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ Why do my AWS GPUs sit idle when training with large computer vision datasets from S3?

GPUs become underutilized because S3, an object store, has high latency for individual GET requests of the many small files (e.g., images, metadata) common in CV datasets, creating an IOPS bottleneck rather than a bandwidth one.

❓ How does FSx for Lustre compare to using local NVMe instance storage for CV training?

Local NVMe is a quick, ephemeral cache for single instances, offering high IOPS but requiring data copy on boot. FSx for Lustre is a persistent, shared, POSIX-compliant file system that lazy loads from S3, providing sub-millisecond latencies suitable for production and team collaboration, albeit with higher continuous cost.

❓ What is a common implementation pitfall when accessing large CV datasets directly from S3 during training?

A common pitfall is using standard tools like `aws s3 cp` or `boto3` to read many small files directly from S3, which incurs significant HTTP overhead and latency, leading to severely underutilized GPUs. The solution involves optimizing data access via local NVMe with `s5cmd`, FSx for Lustre, or data streaming with WebDataset.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply