Skip to content

Cost Optimization — Basics

  1. Use less (efficiency).
  2. Pay less per unit (commitments / spot).
  3. Stop using what you don’t need (cleanup).

Most “cost reductions” are #3 first, then #1, then #2.

Common waste sources (AWS but generally applicable)

Section titled “Common waste sources (AWS but generally applicable)”
  • Idle / oversized EC2 — instances ~10% utilized.
  • Old snapshots & unattached EBS volumes.
  • Idle ELBs / NAT Gateways (NAT GW is hourly + per-GB; surprisingly costly).
  • Untagged resources — no one knows who owns them.
  • Cross-AZ traffic — between AZs has cost (often surprising).
  • NAT egress for AWS service calls — solve with VPC endpoints.
  • Public egress — biggest line item often. Cache via CloudFront, compress, batch.
  • CloudWatch log retention — 30d default → forever if not set.
  • Pre-prod environments running 24/7 — schedule shut-down.
  • Lambda cold start with provisioned concurrency at low traffic.
  • Old machine images / ECR images.
  • Unused RDS read replicas.
  • DynamoDB provisioned, not on-demand at low or spiky usage.
  • S3 versions and incomplete multipart uploads.
  • GPU sitting idle.
  • On-demand — full price.
  • Spot — 60-90% off, can be interrupted (good for batch, ML training, stateless workers).
  • Savings Plans / Reserved — 30-60% off for 1-3y commitment.
  • Committed-use discounts (GCP) similar.
  • Volume discounts by tier.
  • Egress is where cloud providers make money — avoid where possible.
  • Look at p95 CPU / memory utilization.
  • Drop to next instance size if < 50% utilized at peak.
  • ARM (Graviton on AWS, Ampere on GCP) is ~20% cheaper for many workloads.
  • Use Compute Optimizer / Recommender.
  • S3 lifecycle: Standard → IA → Glacier → Delete.
  • Block storage: gp3 over gp2 (cheaper, faster).
  • Cold data → Glacier Deep Archive (~$1/TB/mo).
  • Delete old snapshots, AMIs.
  • Compress logs (gzip / zstd).
  • VPC Endpoints for S3, DynamoDB, SSM, ECR — avoid NAT egress fees.
  • CloudFront / CDN for content — cheaper outbound and faster.
  • Co-locate services in same AZ where possible.
  • Use PrivateLink / VPC peering for cross-account/VPC traffic.
  • Compression (gzip/br) at LB.
  • Cluster autoscaler / Karpenter — provision nodes only when needed.
  • HPA for pods.
  • Right-size requests — over-requesting wastes capacity.
  • Spot node pools for batch, fault-tolerant.
  • Bin-packing — small workloads share nodes.
  • PDB + anti-affinity carefully — over-spreading wastes capacity.
  • ARM (cheaper).
  • Reduce package size (faster cold starts, faster invocation = cheaper).
  • Memory tuning — sometimes more memory = faster = cheaper net.
  • Provisioned concurrency only for hot paths.
  • Step Functions for orchestration vs chained Lambdas.
  • Aurora Serverless v2 for spiky low-volume workloads.
  • DynamoDB on-demand for unpredictable; provisioned + auto-scaling for steady.
  • Read replicas only if read load needs them.
  • Backup retention review.
  • Reserved instances on RDS for steady workloads.

Without tags, you can’t allocate cost or accountability.

  • Environment (prod / staging / dev)
  • Owner (team or email)
  • Project / Service
  • CostCenter

Enforce via SCP / Organization Policy: deny resource creation without required tags.

  • AWS: Cost Explorer, Cost & Usage Reports → Athena, Cost Anomaly Detection.
  • GCP: Billing reports, Recommender, BigQuery billing export.
  • Cross-cloud: Vantage, CloudHealth, Apptio Cloudability, Infracost.
  • Per-PR: Infracost estimates Terraform diff.
  • Show team / service-level cost dashboards.
  • Monthly cost reviews per service.
  • Per-feature unit economics ($/request, $/customer).
  • Budget alerts at 50/80/100%.
  • Anomaly alerts (sudden spikes).
  • Quarterly right-sizing review.
  1. Bill went up 30% this month — investigate. Cost Explorer by service; correlate with deploys / traffic; identify top contributors.
  2. Reduce Lambda cost. Right-size memory, ARM, batch invocations, reduce concurrency, async path where possible.
  3. DynamoDB cost spike. Check capacity mode, hot partitions, scan vs query, TTL configured.
  4. Cross-AZ traffic high. Co-locate components, single-AZ for non-critical, gateway endpoints.
  5. Should we use spot? For batch, CI runners, dev clusters, ML training, stateless workers — yes. For stateful single-replica DBs, no.
  6. NAT Gateway cost. Use VPC endpoints, reduce egress, consider single shared NAT vs per-AZ depending on resilience need.