Skip to content

Cost Optimization — Practical

Terminal window
# unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available \
--query 'Volumes[].[VolumeId,Size,CreateTime]' --output table
# old snapshots
aws ec2 describe-snapshots --owner-ids self \
--query 'Snapshots[?StartTime<`2025-01-01`].[SnapshotId,StartTime,VolumeSize]' --output table
# idle ELBs (no traffic last 7d)
aws elbv2 describe-load-balancers --query 'LoadBalancers[].LoadBalancerArn' --output text |
while read arn; do
bytes=$(aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB \
--metric-name ProcessedBytes --statistics Sum --period 604800 \
--start-time $(date -u -d '7 days ago' +%FT%T) --end-time $(date -u +%FT%T) \
--dimensions Name=LoadBalancer,Value=$(echo $arn | awk -F'loadbalancer/' '{print $2}') \
--query 'Datapoints[0].Sum' --output text)
[[ "$bytes" == "None" || -z "$bytes" ]] && echo "Idle: $arn"
done
# untagged ec2
aws ec2 describe-instances \
--query 'Reservations[].Instances[?!not_null(Tags[?Key==`Owner`])].InstanceId'
# old AMIs
aws ec2 describe-images --owners self \
--query 'Images[?CreationDate<`2024-01-01`].[ImageId,CreationDate,Name]'
Terminal window
# incomplete multipart uploads
aws s3api list-multipart-uploads --bucket my-bucket
aws s3api abort-multipart-upload --bucket b --key k --upload-id ID
# enable lifecycle to auto-clean
aws s3api put-bucket-lifecycle-configuration --bucket b --lifecycle-configuration file://lc.json
{
"Rules": [
{
"ID": "abort-multipart",
"Status": "Enabled",
"AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
},
{
"ID": "expire-old-versions",
"Status": "Enabled",
"NoncurrentVersionExpiration": { "NoncurrentDays": 30 }
},
{
"ID": "tier",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" }
],
"Expiration": { "Days": 365 }
}
]
}
Terminal window
# default = forever (expensive)
aws logs put-retention-policy --log-group-name /aws/lambda/fn --retention-in-days 30
# all log groups
aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text | tr '\t' '\n' |
while read g; do aws logs put-retention-policy --log-group-name "$g" --retention-in-days 30; done
resource "aws_vpc_endpoint" "s3" {
vpc_id = var.vpc_id
service_name = "com.amazonaws.eu-west-1.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = var.private_route_table_ids
}
resource "aws_vpc_endpoint" "ssm" {
vpc_id = var.vpc_id
service_name = "com.amazonaws.eu-west-1.ssm"
vpc_endpoint_type = "Interface"
subnet_ids = var.private_subnet_ids
private_dns_enabled = true
security_group_ids = [aws_security_group.endpoint.id]
}

Other useful endpoints: dynamodb, ssm, secretsmanager, ecr.api, ecr.dkr, logs, sts, kms.

# AWS Lambda invoked by EventBridge cron 19:00 weekdays
import boto3
ec2 = boto3.client('ec2')
def lambda_handler(event, ctx):
ids = [i['InstanceId']
for r in ec2.describe_instances(Filters=[{'Name':'tag:Env','Values':['dev','staging']}])['Reservations']
for i in r['Instances'] if i['State']['Name']=='running']
if ids: ec2.stop_instances(InstanceIds=ids)

Wake up at 8am with another cron.

Karpenter (Kubernetes auto-provisioning, AWS)

Section titled “Karpenter (Kubernetes auto-provisioning, AWS)”
apiVersion: karpenter.sh/v1
kind: NodePool
metadata: { name: default }
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: [amd64, arm64]
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]
- key: node.kubernetes.io/instance-type
operator: In
values: [m6a.large, m6g.large, m7a.large, m7g.large]
nodeClassRef: { name: default }
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
limits:
cpu: 1000
Terminal window
# go on-demand for spiky / unknown load
aws dynamodb update-table --table-name x --billing-mode PAY_PER_REQUEST
# switch back to provisioned w/ auto-scaling once steady
aws dynamodb update-table --table-name x \
--billing-mode PROVISIONED \
--provisioned-throughput ReadCapacityUnits=10,WriteCapacityUnits=10

Use AWS Compute Optimizer or lambda-power-tuning step function. Typical: 512MB-1024MB sweet spot for I/O-bound; more for CPU-bound.

Terminal window
# arm64 = ~20% cheaper, same code
aws lambda update-function-configuration --function-name fn --architectures arm64
- uses: infracost/actions/setup@v3
with: { api-key: ${{ secrets.INFRACOST_API_KEY }} }
- run: |
infracost breakdown --path . --format json --out-file breakdown.json
infracost output --path breakdown.json --format github-comment > comment.md
- run: |
gh pr comment ${{ github.event.pull_request.number }} -F comment.md
{
"Effect": "Deny",
"Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/Owner": "true"
}
}
}

Emit dimensions on metrics representing the feature; cross-reference billing.

metrics.add('FeatureRequest', 1, { feature: 'imageProcessing', tier: 'premium' });

Then attribute infrastructure cost roughly by % of metric volume per dimension.

  • AWS Cost Explorer / Compute Optimizer / Trusted Advisor.
  • Vantage, Cloudability, CloudHealth, Datadog Cost — multi-cloud.
  • Infracost — per-PR Terraform cost.
  • Kubecost / OpenCost — K8s cost attribution.
  • Karpenter — autoscaling on AWS.
  • CloudCustodian — policy-driven cleanup.

Canonical “30-day cost cleanup” punch-list

Section titled “Canonical “30-day cost cleanup” punch-list”
  • Tag all resources by Owner / Env / Service.
  • CloudWatch Logs retention ≤ 30d for non-compliance logs.
  • S3 lifecycle: tier + expire + abort multipart.
  • Delete unattached EBS, unused snapshots, old AMIs.
  • Stop dev/staging out-of-hours.
  • VPC endpoints for S3 + DynamoDB.
  • Move stateless workers to spot.
  • Move ARM-friendly workloads to Graviton.
  • Buy Savings Plans for baseline compute.
  • Set per-team budget alerts.
  • Review NAT GW usage, consolidate if HA isn’t critical.
  • Cost dashboards per service.