Skip to content

Cost Optimization — Practical

Cost Optimization — Practical patterns

Quick wins audit (AWS)

# unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available \
  --query 'Volumes[].[VolumeId,Size,CreateTime]' --output table

# old snapshots
aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[?StartTime<`2025-01-01`].[SnapshotId,StartTime,VolumeSize]' --output table

# idle ELBs (no traffic last 7d)
aws elbv2 describe-load-balancers --query 'LoadBalancers[].LoadBalancerArn' --output text |
  while read arn; do
    bytes=$(aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB \
      --metric-name ProcessedBytes --statistics Sum --period 604800 \
      --start-time $(date -u -d '7 days ago' +%FT%T) --end-time $(date -u +%FT%T) \
      --dimensions Name=LoadBalancer,Value=$(echo $arn | awk -F'loadbalancer/' '{print $2}') \
      --query 'Datapoints[0].Sum' --output text)
    [[ "$bytes" == "None" || -z "$bytes" ]] && echo "Idle: $arn"
  done

# untagged ec2
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?!not_null(Tags[?Key==`Owner`])].InstanceId'

# old AMIs
aws ec2 describe-images --owners self \
  --query 'Images[?CreationDate<`2024-01-01`].[ImageId,CreationDate,Name]'

S3 cleanup

# incomplete multipart uploads
aws s3api list-multipart-uploads --bucket my-bucket
aws s3api abort-multipart-upload --bucket b --key k --upload-id ID

# enable lifecycle to auto-clean
aws s3api put-bucket-lifecycle-configuration --bucket b --lifecycle-configuration file://lc.json

{
  "Rules": [
    {
      "ID": "abort-multipart",
      "Status": "Enabled",
      "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
    },
    {
      "ID": "expire-old-versions",
      "Status": "Enabled",
      "NoncurrentVersionExpiration": { "NoncurrentDays": 30 }
    },
    {
      "ID": "tier",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        { "Days": 30,  "StorageClass": "STANDARD_IA" },
        { "Days": 90,  "StorageClass": "GLACIER" }
      ],
      "Expiration": { "Days": 365 }
    }
  ]
}

CloudWatch logs retention

# default = forever (expensive)
aws logs put-retention-policy --log-group-name /aws/lambda/fn --retention-in-days 30

# all log groups
aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text | tr '\t' '\n' |
  while read g; do aws logs put-retention-policy --log-group-name "$g" --retention-in-days 30; done

VPC endpoints (avoid NAT egress)

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = var.vpc_id
  service_name      = "com.amazonaws.eu-west-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = var.private_route_table_ids
}

resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.eu-west-1.ssm"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.endpoint.id]
}

Other useful endpoints: dynamodb, ssm, secretsmanager, ecr.api, ecr.dkr, logs, sts, kms.

Schedule dev/staging shutdown

# AWS Lambda invoked by EventBridge cron 19:00 weekdays
import boto3
ec2 = boto3.client('ec2')
def lambda_handler(event, ctx):
    ids = [i['InstanceId']
           for r in ec2.describe_instances(Filters=[{'Name':'tag:Env','Values':['dev','staging']}])['Reservations']
           for i in r['Instances'] if i['State']['Name']=='running']
    if ids: ec2.stop_instances(InstanceIds=ids)

Wake up at 8am with another cron.

Karpenter (Kubernetes auto-provisioning, AWS)

apiVersion: karpenter.sh/v1
kind: NodePool
metadata: { name: default }
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64, arm64]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot, on-demand]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [m6a.large, m6g.large, m7a.large, m7g.large]
      nodeClassRef: { name: default }
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
  limits:
    cpu: 1000

DynamoDB capacity mode switch

# go on-demand for spiky / unknown load
aws dynamodb update-table --table-name x --billing-mode PAY_PER_REQUEST

# switch back to provisioned w/ auto-scaling once steady
aws dynamodb update-table --table-name x \
  --billing-mode PROVISIONED \
  --provisioned-throughput ReadCapacityUnits=10,WriteCapacityUnits=10

Lambda right-sizing

Use AWS Compute Optimizer or lambda-power-tuning step function. Typical: 512MB-1024MB sweet spot for I/O-bound; more for CPU-bound.

# arm64 = ~20% cheaper, same code
aws lambda update-function-configuration --function-name fn --architectures arm64

Infracost (per-PR estimates)

- uses: infracost/actions/setup@v3
  with: { api-key: ${{ secrets.INFRACOST_API_KEY }} }
- run: |
    infracost breakdown --path . --format json --out-file breakdown.json
    infracost output --path breakdown.json --format github-comment > comment.md
- run: |
    gh pr comment ${{ github.event.pull_request.number }} -F comment.md

Tagging enforcement (SCP)

{
  "Effect": "Deny",
  "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
  "Resource": "*",
  "Condition": {
    "Null": {
      "aws:RequestTag/Owner": "true"
    }
  }
}

Track per-feature cost via custom metrics

Emit dimensions on metrics representing the feature; cross-reference billing.

metrics.add('FeatureRequest', 1, { feature: 'imageProcessing', tier: 'premium' });

Then attribute infrastructure cost roughly by % of metric volume per dimension.

Useful tools

AWS Cost Explorer / Compute Optimizer / Trusted Advisor.
Vantage, Cloudability, CloudHealth, Datadog Cost — multi-cloud.
Infracost — per-PR Terraform cost.
Kubecost / OpenCost — K8s cost attribution.
Karpenter — autoscaling on AWS.
CloudCustodian — policy-driven cleanup.

Canonical “30-day cost cleanup” punch-list