Cost Optimization — Basics
Cloud Cost Optimization — Basics
Section titled “Cloud Cost Optimization — Basics”The 3 levers
Section titled “The 3 levers”- Use less (efficiency).
- Pay less per unit (commitments / spot).
- Stop using what you don’t need (cleanup).
Most “cost reductions” are #3 first, then #1, then #2.
Common waste sources (AWS but generally applicable)
Section titled “Common waste sources (AWS but generally applicable)”- Idle / oversized EC2 — instances ~10% utilized.
- Old snapshots & unattached EBS volumes.
- Idle ELBs / NAT Gateways (NAT GW is hourly + per-GB; surprisingly costly).
- Untagged resources — no one knows who owns them.
- Cross-AZ traffic — between AZs has cost (often surprising).
- NAT egress for AWS service calls — solve with VPC endpoints.
- Public egress — biggest line item often. Cache via CloudFront, compress, batch.
- CloudWatch log retention — 30d default → forever if not set.
- Pre-prod environments running 24/7 — schedule shut-down.
- Lambda cold start with provisioned concurrency at low traffic.
- Old machine images / ECR images.
- Unused RDS read replicas.
- DynamoDB provisioned, not on-demand at low or spiky usage.
- S3 versions and incomplete multipart uploads.
- GPU sitting idle.
Pricing patterns
Section titled “Pricing patterns”- On-demand — full price.
- Spot — 60-90% off, can be interrupted (good for batch, ML training, stateless workers).
- Savings Plans / Reserved — 30-60% off for 1-3y commitment.
- Committed-use discounts (GCP) similar.
- Volume discounts by tier.
- Egress is where cloud providers make money — avoid where possible.
Right-sizing
Section titled “Right-sizing”- Look at p95 CPU / memory utilization.
- Drop to next instance size if < 50% utilized at peak.
- ARM (Graviton on AWS, Ampere on GCP) is ~20% cheaper for many workloads.
- Use Compute Optimizer / Recommender.
Storage optimization
Section titled “Storage optimization”- S3 lifecycle: Standard → IA → Glacier → Delete.
- Block storage: gp3 over gp2 (cheaper, faster).
- Cold data → Glacier Deep Archive (~$1/TB/mo).
- Delete old snapshots, AMIs.
- Compress logs (gzip / zstd).
Network optimization
Section titled “Network optimization”- VPC Endpoints for S3, DynamoDB, SSM, ECR — avoid NAT egress fees.
- CloudFront / CDN for content — cheaper outbound and faster.
- Co-locate services in same AZ where possible.
- Use PrivateLink / VPC peering for cross-account/VPC traffic.
- Compression (gzip/br) at LB.
Containers / K8s
Section titled “Containers / K8s”- Cluster autoscaler / Karpenter — provision nodes only when needed.
- HPA for pods.
- Right-size requests — over-requesting wastes capacity.
- Spot node pools for batch, fault-tolerant.
- Bin-packing — small workloads share nodes.
- PDB + anti-affinity carefully — over-spreading wastes capacity.
Lambda / serverless
Section titled “Lambda / serverless”- ARM (cheaper).
- Reduce package size (faster cold starts, faster invocation = cheaper).
- Memory tuning — sometimes more memory = faster = cheaper net.
- Provisioned concurrency only for hot paths.
- Step Functions for orchestration vs chained Lambdas.
Database
Section titled “Database”- Aurora Serverless v2 for spiky low-volume workloads.
- DynamoDB on-demand for unpredictable; provisioned + auto-scaling for steady.
- Read replicas only if read load needs them.
- Backup retention review.
- Reserved instances on RDS for steady workloads.
Tagging discipline
Section titled “Tagging discipline”Without tags, you can’t allocate cost or accountability.
Environment(prod / staging / dev)Owner(team or email)Project/ServiceCostCenter
Enforce via SCP / Organization Policy: deny resource creation without required tags.
Visibility
Section titled “Visibility”- AWS: Cost Explorer, Cost & Usage Reports → Athena, Cost Anomaly Detection.
- GCP: Billing reports, Recommender, BigQuery billing export.
- Cross-cloud: Vantage, CloudHealth, Apptio Cloudability, Infracost.
- Per-PR: Infracost estimates Terraform diff.
FinOps practices
Section titled “FinOps practices”- Show team / service-level cost dashboards.
- Monthly cost reviews per service.
- Per-feature unit economics ($/request, $/customer).
- Budget alerts at 50/80/100%.
- Anomaly alerts (sudden spikes).
- Quarterly right-sizing review.
Common interview Qs
Section titled “Common interview Qs”- Bill went up 30% this month — investigate. Cost Explorer by service; correlate with deploys / traffic; identify top contributors.
- Reduce Lambda cost. Right-size memory, ARM, batch invocations, reduce concurrency, async path where possible.
- DynamoDB cost spike. Check capacity mode, hot partitions, scan vs query, TTL configured.
- Cross-AZ traffic high. Co-locate components, single-AZ for non-critical, gateway endpoints.
- Should we use spot? For batch, CI runners, dev clusters, ML training, stateless workers — yes. For stateful single-replica DBs, no.
- NAT Gateway cost. Use VPC endpoints, reduce egress, consider single shared NAT vs per-AZ depending on resilience need.