Cost Optimization — Basics

Cloud Cost Optimization — Basics

Most “cost reductions” are #3 first, then #1, then #2.

Idle / oversized EC2 — instances ~10% utilized.
Old snapshots & unattached EBS volumes.
Idle ELBs / NAT Gateways (NAT GW is hourly + per-GB; surprisingly costly).
Untagged resources — no one knows who owns them.
Cross-AZ traffic — between AZs has cost (often surprising).
NAT egress for AWS service calls — solve with VPC endpoints.
Public egress — biggest line item often. Cache via CloudFront, compress, batch.
CloudWatch log retention — 30d default → forever if not set.
Pre-prod environments running 24/7 — schedule shut-down.
Lambda cold start with provisioned concurrency at low traffic.
Old machine images / ECR images.
Unused RDS read replicas.
DynamoDB provisioned, not on-demand at low or spiky usage.
S3 versions and incomplete multipart uploads.
GPU sitting idle.

On-demand — full price.
Spot — 60-90% off, can be interrupted (good for batch, ML training, stateless workers).
Savings Plans / Reserved — 30-60% off for 1-3y commitment.
Committed-use discounts (GCP) similar.
Volume discounts by tier.
Egress is where cloud providers make money — avoid where possible.

Without tags, you can’t allocate cost or accountability.

Enforce via SCP / Organization Policy: deny resource creation without required tags.

Bill went up 30% this month — investigate. Cost Explorer by service; correlate with deploys / traffic; identify top contributors.
Reduce Lambda cost. Right-size memory, ARM, batch invocations, reduce concurrency, async path where possible.
DynamoDB cost spike. Check capacity mode, hot partitions, scan vs query, TTL configured.
Cross-AZ traffic high. Co-locate components, single-AZ for non-critical, gateway endpoints.
Should we use spot? For batch, CI runners, dev clusters, ML training, stateless workers — yes. For stateful single-replica DBs, no.
NAT Gateway cost. Use VPC endpoints, reduce egress, consider single shared NAT vs per-AZ depending on resilience need.