Cost Optimization — Theory
Cost Optimization — Theory (concise)
Section titled “Cost Optimization — Theory (concise)”Frame the problem
Section titled “Frame the problem”Cost is a non-functional requirement. Like reliability or latency, set targets and design for them.
Useful framings:
- Unit economics: $/request, $/customer, $/transaction. If unit cost > unit revenue, growth is bleeding.
- TCO: total cost of ownership, including ops time. Sometimes paying more per resource saves more in dev/ops time.
- Build vs buy: managed service that’s 2× more expensive may save 5× in engineer time.
Cost drivers in cloud
Section titled “Cost drivers in cloud”The “big three”:
- Compute (EC2, GKE, Lambda).
- Storage (S3, EBS, snapshots).
- Network egress (often the silent giant).
Egress is asymmetric: ingress free, egress charged. Cross-region, cross-AZ also charged. CDN reduces egress.
Spot / preemptible
Section titled “Spot / preemptible”Cloud sells leftover capacity at 60-90% discount. Tradeoff: can be reclaimed in ~2 minutes notice.
Good for: batch, ML training, stateless web tier (with redundancy), CI runners, data processing.
Bad for: single-replica stateful DBs, primary user-facing without fallback.
Patterns:
- Mixed instance pool (on-demand + spot) for resilience.
- Multiple instance types/AZs to reduce simultaneous reclaim.
- Karpenter (AWS) handles spot diversification automatically.
Commitments (Reserved / Savings Plans)
Section titled “Commitments (Reserved / Savings Plans)”- 1-3y commit, 30-60% discount.
- Compute Savings Plans (AWS) cover Lambda, Fargate, EC2 across instance types/regions — most flexible.
- Reserve only baseline load you’re confident in; spot/on-demand for the rest.
- Watch for “regret risk”: locking in tech that you migrate off of.
Capacity vs demand alignment
Section titled “Capacity vs demand alignment”- Auto-scale aggressively down at off-peak.
- Schedule dev/staging shutdown nights/weekends.
- Burst with spot for spikes.
- Cold pool for failover only.
Storage tiering
Section titled “Storage tiering”Most analytics data: 90% never re-read. Tier:
- Hot: < 30d → Standard.
- Warm: < 90d → IA.
- Cold: > 90d → Glacier / Deep Archive.
Lifecycle rules automate. Watch retrieval costs (Glacier deep retrieval is hours + $/GB).
Egress traps
Section titled “Egress traps”Common surprises:
- Cross-AZ traffic between K8s pods.
- Database in one AZ, app in another.
- Cross-region replication.
- VPC peering vs PrivateLink.
- Data transfer to internet from S3.
Solutions:
- VPC endpoints for AWS services.
- Co-locate components in same AZ (with HA caveats).
- CloudFront for outbound.
Observability of cost
Section titled “Observability of cost”You can’t optimize what you can’t see. Need:
- Per-service / team cost (via tagging).
- Per-feature cost (via instrumented dimensions).
- Time-series (daily) of each component.
- Anomaly alerting.
Common tools: Cost Explorer / GCP Billing reports / Vantage / CloudHealth.
Common interview Qs
Section titled “Common interview Qs”- Bill grew 30% MoM — first 5 things to check. Top services in Cost Explorer; new resources; data egress; non-prod left running; CloudWatch logs.
- EC2 fleet is 30% utilized — fix. Right-size, autoscale, mix in spot, consolidate, savings plans on baseline.
- DynamoDB — when is provisioned cheaper than on-demand? Steady predictable load. On-demand is expensive per req but pays nothing idle.
- Spot strategy for K8s? Mixed node pools, multiple types, Karpenter, PodDisruptionBudgets, app must tolerate restarts.
- Lambda is killing budget. Memory tuning (more memory ≠ more cost if it finishes faster), ARM, package smaller, batch invocations, reduce log volume.
- K8s costs more than expected. Over-requesting CPU/memory; idle node pools; control plane fees; cross-AZ egress; orphan PVs.
- Cost vs reliability tradeoff for HA? Multi-AZ for prod (mandatory), single-AZ for some non-critical with fast restore.
- Per-feature cost attribution? Tag dimensions on metrics; correlate with billing usage report by tag.
- Build vs buy for X? TCO including ops; vendor lock-in; team capacity.
Anti-patterns
Section titled “Anti-patterns”- “Just spin up bigger” without measuring.
- Untagged resources.
- 1-year retention of all logs.
- Public S3 + no CloudFront → expensive egress.
- Many small NAT GWs unnecessarily.
- Multi-region by default for non-critical.
- Provisioned concurrency on cold endpoints.
- Reserved instances bought before usage stabilized.
- DBs over-provisioned to “be safe”.
- Snapshot pile-up.