Cost Optimization — Practical
Cost Optimization — Practical patterns
Section titled “Cost Optimization — Practical patterns”Quick wins audit (AWS)
Section titled “Quick wins audit (AWS)”# unattached EBS volumesaws ec2 describe-volumes --filters Name=status,Values=available \ --query 'Volumes[].[VolumeId,Size,CreateTime]' --output table
# old snapshotsaws ec2 describe-snapshots --owner-ids self \ --query 'Snapshots[?StartTime<`2025-01-01`].[SnapshotId,StartTime,VolumeSize]' --output table
# idle ELBs (no traffic last 7d)aws elbv2 describe-load-balancers --query 'LoadBalancers[].LoadBalancerArn' --output text | while read arn; do bytes=$(aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB \ --metric-name ProcessedBytes --statistics Sum --period 604800 \ --start-time $(date -u -d '7 days ago' +%FT%T) --end-time $(date -u +%FT%T) \ --dimensions Name=LoadBalancer,Value=$(echo $arn | awk -F'loadbalancer/' '{print $2}') \ --query 'Datapoints[0].Sum' --output text) [[ "$bytes" == "None" || -z "$bytes" ]] && echo "Idle: $arn" done
# untagged ec2aws ec2 describe-instances \ --query 'Reservations[].Instances[?!not_null(Tags[?Key==`Owner`])].InstanceId'
# old AMIsaws ec2 describe-images --owners self \ --query 'Images[?CreationDate<`2024-01-01`].[ImageId,CreationDate,Name]'S3 cleanup
Section titled “S3 cleanup”# incomplete multipart uploadsaws s3api list-multipart-uploads --bucket my-bucketaws s3api abort-multipart-upload --bucket b --key k --upload-id ID
# enable lifecycle to auto-cleanaws s3api put-bucket-lifecycle-configuration --bucket b --lifecycle-configuration file://lc.json{ "Rules": [ { "ID": "abort-multipart", "Status": "Enabled", "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 } }, { "ID": "expire-old-versions", "Status": "Enabled", "NoncurrentVersionExpiration": { "NoncurrentDays": 30 } }, { "ID": "tier", "Status": "Enabled", "Filter": { "Prefix": "logs/" }, "Transitions": [ { "Days": 30, "StorageClass": "STANDARD_IA" }, { "Days": 90, "StorageClass": "GLACIER" } ], "Expiration": { "Days": 365 } } ]}CloudWatch logs retention
Section titled “CloudWatch logs retention”# default = forever (expensive)aws logs put-retention-policy --log-group-name /aws/lambda/fn --retention-in-days 30
# all log groupsaws logs describe-log-groups --query 'logGroups[].logGroupName' --output text | tr '\t' '\n' | while read g; do aws logs put-retention-policy --log-group-name "$g" --retention-in-days 30; doneVPC endpoints (avoid NAT egress)
Section titled “VPC endpoints (avoid NAT egress)”resource "aws_vpc_endpoint" "s3" { vpc_id = var.vpc_id service_name = "com.amazonaws.eu-west-1.s3" vpc_endpoint_type = "Gateway" route_table_ids = var.private_route_table_ids}
resource "aws_vpc_endpoint" "ssm" { vpc_id = var.vpc_id service_name = "com.amazonaws.eu-west-1.ssm" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids private_dns_enabled = true security_group_ids = [aws_security_group.endpoint.id]}Other useful endpoints: dynamodb, ssm, secretsmanager, ecr.api, ecr.dkr, logs, sts, kms.
Schedule dev/staging shutdown
Section titled “Schedule dev/staging shutdown”# AWS Lambda invoked by EventBridge cron 19:00 weekdaysimport boto3ec2 = boto3.client('ec2')def lambda_handler(event, ctx): ids = [i['InstanceId'] for r in ec2.describe_instances(Filters=[{'Name':'tag:Env','Values':['dev','staging']}])['Reservations'] for i in r['Instances'] if i['State']['Name']=='running'] if ids: ec2.stop_instances(InstanceIds=ids)Wake up at 8am with another cron.
Karpenter (Kubernetes auto-provisioning, AWS)
Section titled “Karpenter (Kubernetes auto-provisioning, AWS)”apiVersion: karpenter.sh/v1kind: NodePoolmetadata: { name: default }spec: template: spec: requirements: - key: kubernetes.io/arch operator: In values: [amd64, arm64] - key: karpenter.sh/capacity-type operator: In values: [spot, on-demand] - key: node.kubernetes.io/instance-type operator: In values: [m6a.large, m6g.large, m7a.large, m7g.large] nodeClassRef: { name: default } disruption: consolidationPolicy: WhenUnderutilized expireAfter: 720h limits: cpu: 1000DynamoDB capacity mode switch
Section titled “DynamoDB capacity mode switch”# go on-demand for spiky / unknown loadaws dynamodb update-table --table-name x --billing-mode PAY_PER_REQUEST
# switch back to provisioned w/ auto-scaling once steadyaws dynamodb update-table --table-name x \ --billing-mode PROVISIONED \ --provisioned-throughput ReadCapacityUnits=10,WriteCapacityUnits=10Lambda right-sizing
Section titled “Lambda right-sizing”Use AWS Compute Optimizer or lambda-power-tuning step function. Typical: 512MB-1024MB sweet spot for I/O-bound; more for CPU-bound.
# arm64 = ~20% cheaper, same codeaws lambda update-function-configuration --function-name fn --architectures arm64Infracost (per-PR estimates)
Section titled “Infracost (per-PR estimates)”- uses: infracost/actions/setup@v3 with: { api-key: ${{ secrets.INFRACOST_API_KEY }} }- run: | infracost breakdown --path . --format json --out-file breakdown.json infracost output --path breakdown.json --format github-comment > comment.md- run: | gh pr comment ${{ github.event.pull_request.number }} -F comment.mdTagging enforcement (SCP)
Section titled “Tagging enforcement (SCP)”{ "Effect": "Deny", "Action": ["ec2:RunInstances", "rds:CreateDBInstance"], "Resource": "*", "Condition": { "Null": { "aws:RequestTag/Owner": "true" } }}Track per-feature cost via custom metrics
Section titled “Track per-feature cost via custom metrics”Emit dimensions on metrics representing the feature; cross-reference billing.
metrics.add('FeatureRequest', 1, { feature: 'imageProcessing', tier: 'premium' });Then attribute infrastructure cost roughly by % of metric volume per dimension.
Useful tools
Section titled “Useful tools”- AWS Cost Explorer / Compute Optimizer / Trusted Advisor.
- Vantage, Cloudability, CloudHealth, Datadog Cost — multi-cloud.
- Infracost — per-PR Terraform cost.
- Kubecost / OpenCost — K8s cost attribution.
- Karpenter — autoscaling on AWS.
- CloudCustodian — policy-driven cleanup.
Canonical “30-day cost cleanup” punch-list
Section titled “Canonical “30-day cost cleanup” punch-list”- Tag all resources by Owner / Env / Service.
- CloudWatch Logs retention ≤ 30d for non-compliance logs.
- S3 lifecycle: tier + expire + abort multipart.
- Delete unattached EBS, unused snapshots, old AMIs.
- Stop dev/staging out-of-hours.
- VPC endpoints for S3 + DynamoDB.
- Move stateless workers to spot.
- Move ARM-friendly workloads to Graviton.
- Buy Savings Plans for baseline compute.
- Set per-team budget alerts.
- Review NAT GW usage, consolidate if HA isn’t critical.
- Cost dashboards per service.