Skip to content

Cost Optimization — Theory

Cost is a non-functional requirement. Like reliability or latency, set targets and design for them.

Useful framings:

  • Unit economics: $/request, $/customer, $/transaction. If unit cost > unit revenue, growth is bleeding.
  • TCO: total cost of ownership, including ops time. Sometimes paying more per resource saves more in dev/ops time.
  • Build vs buy: managed service that’s 2× more expensive may save 5× in engineer time.

The “big three”:

  1. Compute (EC2, GKE, Lambda).
  2. Storage (S3, EBS, snapshots).
  3. Network egress (often the silent giant).

Egress is asymmetric: ingress free, egress charged. Cross-region, cross-AZ also charged. CDN reduces egress.

Cloud sells leftover capacity at 60-90% discount. Tradeoff: can be reclaimed in ~2 minutes notice.

Good for: batch, ML training, stateless web tier (with redundancy), CI runners, data processing.

Bad for: single-replica stateful DBs, primary user-facing without fallback.

Patterns:

  • Mixed instance pool (on-demand + spot) for resilience.
  • Multiple instance types/AZs to reduce simultaneous reclaim.
  • Karpenter (AWS) handles spot diversification automatically.
  • 1-3y commit, 30-60% discount.
  • Compute Savings Plans (AWS) cover Lambda, Fargate, EC2 across instance types/regions — most flexible.
  • Reserve only baseline load you’re confident in; spot/on-demand for the rest.
  • Watch for “regret risk”: locking in tech that you migrate off of.
  • Auto-scale aggressively down at off-peak.
  • Schedule dev/staging shutdown nights/weekends.
  • Burst with spot for spikes.
  • Cold pool for failover only.

Most analytics data: 90% never re-read. Tier:

  • Hot: < 30d → Standard.
  • Warm: < 90d → IA.
  • Cold: > 90d → Glacier / Deep Archive.

Lifecycle rules automate. Watch retrieval costs (Glacier deep retrieval is hours + $/GB).

Common surprises:

  • Cross-AZ traffic between K8s pods.
  • Database in one AZ, app in another.
  • Cross-region replication.
  • VPC peering vs PrivateLink.
  • Data transfer to internet from S3.

Solutions:

  • VPC endpoints for AWS services.
  • Co-locate components in same AZ (with HA caveats).
  • CloudFront for outbound.

You can’t optimize what you can’t see. Need:

  • Per-service / team cost (via tagging).
  • Per-feature cost (via instrumented dimensions).
  • Time-series (daily) of each component.
  • Anomaly alerting.

Common tools: Cost Explorer / GCP Billing reports / Vantage / CloudHealth.

  1. Bill grew 30% MoM — first 5 things to check. Top services in Cost Explorer; new resources; data egress; non-prod left running; CloudWatch logs.
  2. EC2 fleet is 30% utilized — fix. Right-size, autoscale, mix in spot, consolidate, savings plans on baseline.
  3. DynamoDB — when is provisioned cheaper than on-demand? Steady predictable load. On-demand is expensive per req but pays nothing idle.
  4. Spot strategy for K8s? Mixed node pools, multiple types, Karpenter, PodDisruptionBudgets, app must tolerate restarts.
  5. Lambda is killing budget. Memory tuning (more memory ≠ more cost if it finishes faster), ARM, package smaller, batch invocations, reduce log volume.
  6. K8s costs more than expected. Over-requesting CPU/memory; idle node pools; control plane fees; cross-AZ egress; orphan PVs.
  7. Cost vs reliability tradeoff for HA? Multi-AZ for prod (mandatory), single-AZ for some non-critical with fast restore.
  8. Per-feature cost attribution? Tag dimensions on metrics; correlate with billing usage report by tag.
  9. Build vs buy for X? TCO including ops; vendor lock-in; team capacity.
  • “Just spin up bigger” without measuring.
  • Untagged resources.
  • 1-year retention of all logs.
  • Public S3 + no CloudFront → expensive egress.
  • Many small NAT GWs unnecessarily.
  • Multi-region by default for non-critical.
  • Provisioned concurrency on cold endpoints.
  • Reserved instances bought before usage stabilized.
  • DBs over-provisioned to “be safe”.
  • Snapshot pile-up.