AWS — Theory

AWS — Theory (interview deep-dive)

When to choose what (compute)

Need	Best fit
Long-running stateful service	EC2 / ECS-EC2
Containers, hands-off	Fargate
Sub-second tasks, bursty	Lambda
Need K8s API everywhere	EKS
One-off batch jobs	AWS Batch / Step Functions + Lambda
Static site / SPA	S3 + CloudFront

Don’t pick Lambda for: long-running jobs (>15min), low-latency requirements where cold start matters, sustained high traffic (Fargate often cheaper).

RDS vs DynamoDB

RDS: ACID, joins, complex queries, schemas, mature ORMs. Vertical scaling primary; read replicas for reads. Aurora scales storage to 128TB.
DynamoDB: serverless, single-digit ms p99, horizontal scaling, flexible schema. Limits: querying outside of partition key + sort key needs GSI; no joins; eventual or strong reads (strong costs 2× RCU).
DynamoDB shines for: known access patterns, very high write throughput, multi-region active-active (Global Tables), session/token stores.
RDS shines for: relational data, ad-hoc analytics, transactions across rows, mature reporting.

VPC mental model

VPC = your slice of AWS network with chosen CIDR.
Subnet = portion in one AZ. Public = route table has IGW; Private = no direct internet; Isolated = no NAT either.
NAT Gateway = private subnet’s egress to internet (managed, expensive — single biggest networking cost surprise).
Security Group = whitelist (inbound + outbound), stateful (responses auto-allowed).
NACL = both allow and deny, stateless, used rarely.
VPC Endpoints save NAT cost for AWS services.

IAM evaluation

For a request to an AWS resource:

Authenticate principal.
Check organization SCP — must allow.
Check resource policy (if any) — explicit allow can grant cross-account.
Check identity policy — must allow.
Check permissions boundary (cap) — must allow.
Any explicit Deny anywhere = denied.

Best practices:

Roles for everything (no long-lived keys).
Principle of least privilege.
IAM Access Analyzer to find unused.
MFA + strong session policies.

S3 deep notes

Eventually consistent? Strongly consistent for all ops since 2020.
Latency: ~10-50ms per request. For high RPS, randomize prefixes (was needed pre-partitioning improvement; less now but still helps for LIST throughput).
Multipart upload for >100MB. Parallel parts.
Lifecycle: transition Standard → IA after 30d, IA → Glacier after 90d, expire 365d.
Versioning + MFA Delete = ransomware protection.
Object Lock (compliance / governance) for immutable backups.
Pre-signed URLs for time-limited access without IAM.
Server-side encryption: SSE-S3, SSE-KMS (audit trail), SSE-C (you provide key).

DynamoDB deep notes

Partition key = hash → which physical partition. Hot key = throttling.
Sort key + partition key = composite primary; supports range queries within partition.
GSI — alternate access pattern; eventually consistent only.
LSI — share partition key, alternate sort key; only at table creation.
On-demand pricing: pay per request. Provisioned: throughput unit + auto-scaling.
DAX = managed cache (write-through).
TTL attribute auto-deletes records (within 48h).
Streams = CDC; trigger Lambda.
Single-table design — one table for many entities, distinguished by pk patterns. Common in mature DynamoDB use.

Lambda deep notes

Cold start: ~100ms-1s typical, longer for VPC-attached, JVM, .NET. Mitigate: provisioned concurrency, lighter runtimes (Node, Python).
VPC Lambda is fine since hyperplane ENIs.
Concurrency: account-level (default 1000 burst, 100/sec ramp). Reserved per-function.
Idempotency: every invocation can retry — design accordingly.
Lambda + SQS: SQS pulls, Lambda scales up to MaximumConcurrency workers.
Layers for shared code/binaries.
Architecture: x86_64 vs arm64 (Graviton). arm64 ~20% cheaper.

Multi-AZ vs multi-region

Multi-AZ: same region, different DCs. Default for HA. Cheap.
Multi-region: disaster recovery, latency, compliance. Expensive (data transfer, Aurora Global, S3 replication, DDB Global Tables).
RTO/RPO drive the choice.

Common interview Qs

Design a serverless image-processing pipeline. S3 PUT → Lambda (resize/thumbnail) → S3 → DDB (metadata) → CloudFront. Use SQS for backpressure if Lambda concurrency matters.
EC2 instance can’t reach the internet. Check route table → IGW (public) or NAT (private), SG egress, NACL, public IP, DNS resolution.
High Lambda cold starts during traffic spikes. Provisioned concurrency, lighter runtime, smaller package, snapstart for Java.
Designed RDS Multi-AZ for HA — what does it actually do? Synchronous standby in another AZ. Failover ~60s. Doesn’t scale reads (use read replicas).
DynamoDB — design table for tweets feed. PK = userId, SK = timestamp. GSI by hashtag with timestamp. Watch for hot partitions on celebrities.
S3 cost ballooning, what to check? Old versions, multipart upload remnants, missing lifecycle, request cost, data transfer out, KMS calls.
EKS vs ECS — when each? EKS if K8s API/ecosystem matters or multi-cloud. ECS for simpler AWS-only with Fargate.
How do you secure secrets for Lambda? Secrets Manager / Parameter Store; fetch at init; rotate via Lambda extension; encrypt env vars with KMS.
Compliance: only EU users’ data must stay in EU. Region-specific deployment, S3 bucket region constraints, DDB Global Table excluding non-EU regions, IAM SCP.
CloudFront in front of API Gateway — why or why not? Edge caching for cacheable responses, DDoS shield, single CDN footprint. Skip if all responses are user-specific and uncacheable.

Cost levers (always asked)

Right-sizing (Compute Optimizer recommendations).
Savings Plans / Reserved capacity for steady workloads.
Spot for interruptible.
Graviton (arm64) ~20% cheaper.
S3 lifecycle.
Delete unattached EBS, old snapshots, idle ELBs.
VPC Endpoints to avoid NAT cost.
CloudFront for outbound bandwidth (often cheaper than S3 directly).
CloudWatch logs retention + filter / sample.
Reduce inter-AZ traffic where avoidable.

Anti-patterns

Long-lived IAM access keys committed in repos.
Wide-open security groups (0.0.0.0/0 except for ALB/CDN).
One huge VPC for all environments.
One Lambda doing everything.
DynamoDB without thinking about access patterns.
Public S3 bucket for “convenience”.
No backups / restore tests.