Cost Optimization — Theory

Cost Optimization — Theory (concise)

Cost is a non-functional requirement. Like reliability or latency, set targets and design for them.

Useful framings:

Unit economics: $/request, $/customer, $/transaction. If unit cost > unit revenue, growth is bleeding.
TCO: total cost of ownership, including ops time. Sometimes paying more per resource saves more in dev/ops time.
Build vs buy: managed service that’s 2× more expensive may save 5× in engineer time.

The “big three”:

Egress is asymmetric: ingress free, egress charged. Cross-region, cross-AZ also charged. CDN reduces egress.

Cloud sells leftover capacity at 60-90% discount. Tradeoff: can be reclaimed in ~2 minutes notice.

Good for: batch, ML training, stateless web tier (with redundancy), CI runners, data processing.

Bad for: single-replica stateful DBs, primary user-facing without fallback.

Patterns:

1-3y commit, 30-60% discount.
Compute Savings Plans (AWS) cover Lambda, Fargate, EC2 across instance types/regions — most flexible.
Reserve only baseline load you’re confident in; spot/on-demand for the rest.
Watch for “regret risk”: locking in tech that you migrate off of.

Most analytics data: 90% never re-read. Tier:

Lifecycle rules automate. Watch retrieval costs (Glacier deep retrieval is hours + $/GB).

Common surprises:

Solutions:

You can’t optimize what you can’t see. Need:

Common tools: Cost Explorer / GCP Billing reports / Vantage / CloudHealth.

Bill grew 30% MoM — first 5 things to check. Top services in Cost Explorer; new resources; data egress; non-prod left running; CloudWatch logs.
EC2 fleet is 30% utilized — fix. Right-size, autoscale, mix in spot, consolidate, savings plans on baseline.
DynamoDB — when is provisioned cheaper than on-demand? Steady predictable load. On-demand is expensive per req but pays nothing idle.
Spot strategy for K8s? Mixed node pools, multiple types, Karpenter, PodDisruptionBudgets, app must tolerate restarts.
Lambda is killing budget. Memory tuning (more memory ≠ more cost if it finishes faster), ARM, package smaller, batch invocations, reduce log volume.
K8s costs more than expected. Over-requesting CPU/memory; idle node pools; control plane fees; cross-AZ egress; orphan PVs.
Cost vs reliability tradeoff for HA? Multi-AZ for prod (mandatory), single-AZ for some non-critical with fast restore.
Per-feature cost attribution? Tag dimensions on metrics; correlate with billing usage report by tag.
Build vs buy for X? TCO including ops; vendor lock-in; team capacity.