CI/CD — Theory
CI/CD — Theory
Section titled “CI/CD — Theory”What “good” looks like
Section titled “What “good” looks like”- Every commit produces a deployable artifact.
- Tests + scans run automatically; broken builds block merge.
- Same artifact promoted dev → staging → prod (no rebuild per env).
- Deploys are scriptable, reversible, observable.
- Production deploy is a non-event (no all-hands required).
Trunk-based development vs feature branches
Section titled “Trunk-based development vs feature branches”Trunk-based: short-lived feature branches (<1 day), merge to main fast, hide unfinished work behind feature flags. Forces:
- Small PRs.
- Fast CI (10-15 min).
- Frequent integration.
- Feature flags for gradual rollout.
Long-lived branches drift, merge hell, integration delays. Avoid except for hotfix lanes.
Pipeline design principles
Section titled “Pipeline design principles”- Fast feedback: lint/unit < 5 min. Block merge on these. Slow tests can run async.
- Fail fast: stop pipeline on first failure.
- Parallelize independent stages.
- Cache aggressively: deps, build artifacts, Docker layers.
- Idempotent: re-run same SHA → same result.
- Reproducible: pinned versions of tools, deps, base images.
Blue/Green vs Canary
Section titled “Blue/Green vs Canary”| Blue/Green | Canary | |
|---|---|---|
| Risk | low (full rollback) | low + smaller blast radius |
| Cost | 2x infra during cut | small extra |
| Speed | instant cutover | gradual ramp |
| Use | infrequent deploys, big release | continuous, observe metrics |
Canary needs:
- Routing layer that splits traffic by % (Istio, ALB weighted, Argo Rollouts).
- Real-time metrics for go/no-go (error rate, latency).
- Automated rollback if SLI degrades.
Feature flags
Section titled “Feature flags”Decouple deploy from release. Code in production, gated by flag, enable per user/segment.
Tools: LaunchDarkly, Unleash, Flagsmith, GrowthBook, Split.
Pitfalls:
- Flag debt — old flags accumulating.
- Conditional logic spaghetti.
- No flag in tests → bug missed.
Deployment safety mechanisms
Section titled “Deployment safety mechanisms”- Health checks before sending traffic.
- Pre-flight smoke tests: hit critical endpoints post-deploy.
- Auto-rollback on metric breach (5xx rate, latency).
- Progressive delivery: canary → 1% → 10% → 50% → 100%.
- Database migration safety: forward-compatible (don’t drop column same release; rename in two steps).
- Maintenance mode for incompatible changes (rare).
Database migration patterns
Section titled “Database migration patterns”Never deploy code requiring schema before schema exists. Rule of thumb: schema first, code second; rollback in reverse.
Patterns:
- Expand-contract: add column → dual-write → backfill → switch reads → remove old. Spans multiple releases.
- Online schema change tools: gh-ost, pt-online-schema-change for big tables.
- Lock-aware migrations (PG: avoid
ACCESS EXCLUSIVEon big tables in business hours).
Supply chain security
Section titled “Supply chain security”- SBOM (CycloneDX, SPDX): inventory of components.
- Signed artifacts: cosign / Sigstore.
- Provenance attestations (SLSA framework).
- Pinned dependencies: lockfiles + reproducible builds.
- Vuln scanning: Snyk, Trivy, Dependabot, Renovate.
- Secrets scanning: gitleaks, trufflehog.
Common interview Qs
Section titled “Common interview Qs”- What’s the difference between continuous delivery and continuous deployment? Delivery = always shippable; deploy is button. Deployment = automatic.
- Pipeline takes 1h — how to speed up? Parallelize, cache, split slow tests, run E2E nightly only, smaller artifacts, faster runners.
- Deploy went bad — rollback strategy? Blue/green flip; revert deployment; revert image tag; re-deploy previous artifact.
- How do you ensure prod deploys are safe? Tests + scans + canary + auto-rollback on metric breach + observability + feature flags for risky features.
- Same code crashes prod, works in staging — why? Env diff (config, secrets, scale, traffic shape, feature flag, third-party rate limit). Reproduce with prod-like load.
- CI passing but test broken — why? Test no-op or wrong assertion. Mutation testing helps detect.
- You see flaky tests. Action plan? Quarantine, investigate root cause (timing, shared state, env), fix, then unquarantine.
- Secrets in CI: best practice? OIDC to cloud, no long-lived keys; provider secrets store; rotate regularly; never log; encrypt at rest.
- DORA metrics — pick one we should improve and how. Lead time: smaller PRs, trunk-based, faster CI.
- How would you do zero-downtime DB migration? Expand-contract + backwards-compatible code.
Common pitfalls
Section titled “Common pitfalls”- One mega-pipeline doing everything → slow, brittle.
- Shared mutable test env → flaky.
- Manual approval steps that nobody understands.
- “Special” prod-only flags / configs → drift, surprise breakage.
- No production-like load test before traffic flip.
- Branch policies bypassed by admins.
- Ad-hoc hotfix deploys without same gates.
- Long-lived release branches → integration debt.
Deep dive — image scanning + SBOM + signing
Section titled “Deep dive — image scanning + SBOM + signing”Three supply-chain concerns:
- Known CVEs in your image — solved by scanners like Trivy (Aqua, OSS, scans OS packages, language deps, secrets, IaC misconfigs), Grype (Anchore), Snyk, Docker Scout.
- Inventory of what’s in the image — solved by an SBOM in SPDX or CycloneDX format, generated by Syft or
trivy sbom. The SBOM is a JSON document listing every package, version, license, hash; store it as a build artifact so when a future CVE drops you can grep your SBOM archive and know which images are affected without re-scanning. - Provenance — that the image you pulled is the one you built — solved by cosign (Sigstore) signing. Modern keyless cosign uses OIDC (the same identity GitHub Actions issues for AWS) to bind a short-lived cert to the build identity and records the signature in the Rekor transparency log; verifiers check against Rekor without managing private keys.
Why this matters in 2024–2026
Section titled “Why this matters in 2024–2026”SolarWinds, the event-stream npm hijack, the xz-utils backdoor (Mar 2024) all exploited trust in the build/distribution chain. The US Executive Order 14028 and EU CRA push SBOM and signed artifacts toward mandatory for government suppliers.
Gotchas
Section titled “Gotchas”- Scan on every push and on a daily schedule — new CVEs land against unchanged images all the time.
ignore-unfixed: truefilters out CVEs with no upstream patch (otherwise the queue is unworkable).- SBOM only useful if archived and queryable — store in artifact storage with image digest as key.
- Cosign keyless signatures expire (cert is short-lived) but Rekor proof is permanent — verification still works years later.
- Trivy DB updates daily; pin DB version in air-gapped environments.
Sources: trivy.dev, docs.sigstore.dev/cosign/signing/overview, snyk.io/blog/10-docker-image-security-best-practices.
Deep dive — GitHub Actions
Section titled “Deep dive — GitHub Actions”Triggers (on:)
Section titled “Triggers (on:)”| Trigger | Purpose |
|---|---|
push | Branch/path filters |
pull_request | Use pull_request_target with extreme caution — runs against the base repo with secrets |
workflow_dispatch | Manual, with typed inputs |
schedule | Cron, UTC |
repository_dispatch | External trigger via REST API |
workflow_call | Makes the workflow reusable |
workflow_run | Chain after another workflow |
Matrix builds
Section titled “Matrix builds”Generate one job per combination of variables. Use include to add specific extra combinations, exclude to remove ones. fail-fast: false to let all matrix legs finish when one fails (essential for cross-version test suites). max-parallel to throttle.
Reusable workflows vs composite actions
Section titled “Reusable workflows vs composite actions”- Reusable workflows — full workflows called via
uses: org/repo/.github/workflows/x.yml@refat the job level. Get their own runner, can havesecrets:andinputs:, limited to nesting depth 10. - Composite actions — step-level reusable units that run inside an existing job. Lighter, no separate runner, ideal for packaging a sequence of steps.
Concurrency groups
Section titled “Concurrency groups”Prevent overlapping runs on the same logical resource (e.g., one prod deploy at a time). cancel-in-progress: true kills superseded runs (great for PR builds where only the latest matters).
Gotchas
Section titled “Gotchas”actions/setup-nodewithcache: 'npm'auto-usesactions/cachekeyed onpackage-lock.json— don’t add a redundant cache step.- Default
GITHUB_TOKENpermissions changed in 2023 to read-only; declarepermissions:per job explicitly. pull_requestfrom forks does NOT have access to repo secrets (security boundary).- Schedules on
on: scheduleonly run on the default branch’s workflow file. - Concurrency groups are case-insensitive —
prod==Prod.
Sources: docs.github.com/en/actions/using-workflows.
Deep dive — OIDC to AWS (modern way)
Section titled “Deep dive — OIDC to AWS (modern way)”Why long-lived AWS keys are out
Section titled “Why long-lived AWS keys are out”Storing AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY as a GitHub Secret means a static credential is replicated outside AWS, never auto-rotated, and grants whatever IAM policy is attached for as long as it exists. If a workflow logs it, a malicious action exfiltrates it, or an admin leaves the org, you’re rotating manually.
How OIDC flips this
Section titled “How OIDC flips this”GitHub acts as an OIDC identity provider; AWS IAM trusts that provider; for each workflow run GitHub mints a short-lived JWT (5–15 min validity) whose sub claim identifies the exact repo, branch, environment, or job; aws-actions/configure-aws-credentials@v4 exchanges that JWT for short-lived STS credentials (default 1 hour, capped by role’s MaxSessionDuration).
No secrets stored, credentials auto-expire, and the IAM trust policy can scope access by repo + branch + environment.
Trust policy is the security boundary
Section titled “Trust policy is the security boundary”Use StringEquals on sub to pin it to one ref (repo:org/repo:ref:refs/heads/main) or one environment (repo:org/repo:environment:prod). Wildcards via StringLike are convenient but dangerous — repo:org/repo:* lets a PR from any branch assume the role.
Same model works for GCP (Workload Identity Federation, audience = a numeric pool ID) and Azure (federated credentials on an AAD app registration).
Gotchas
Section titled “Gotchas”- Forgetting
permissions: id-token: write→ “credentials could not be loaded” — the job can’t request a JWT. StringLikewithrepo:org/repo:*is the #1 OIDC misconfiguration; pin to specificref:orenvironment:.role-session-nameshows up in CloudTrail — make it traceable (includegithub.run_id).- Cross-account: the role you assume must be in the target account; trust GitHub’s OIDC provider in that account too.
- Per-environment roles + GitHub
environment:protection rules give you human approval gates on top of OIDC.
Q: Why is OIDC better than storing AWS access keys as GitHub Secrets?
Three reasons. (1) No long-lived credentials anywhere — STS tokens expire in an hour. (2) Trust is scoped by GitHub’s signed JWT claims, so an IAM trust policy can require sub = repo:org/repo:environment:prod — a leaked token from a feature branch can’t deploy to prod. (3) Auditability: every assumption shows up in CloudTrail with role-session-name tied to a specific GitHub run ID. Operationally there’s nothing to rotate, nothing to re-store when team members leave, and nothing to leak in logs.
Q: Walk through configuring it end-to-end.
In AWS: create an IAM OIDC identity provider for https://token.actions.githubusercontent.com with audience sts.amazonaws.com. Create an IAM role with a trust policy that allows sts:AssumeRoleWithWebIdentity from that provider, with a Condition pinning sub to repo:org/repo:environment:prod (and aud to sts.amazonaws.com). Attach the least-privilege policy the deploy needs. In GitHub: set permissions: id-token: write, use aws-actions/configure-aws-credentials@v4 with role-to-assume, and protect the prod environment with required reviewers.
Deep dive — Jenkins (in case asked)
Section titled “Deep dive — Jenkins (in case asked)”Declarative vs scripted
Section titled “Declarative vs scripted”Declarative pipelines wrap everything in a pipeline { } block with strict structure (agent, stages, steps, post) — easier to read, easier to lint, the recommended default since 2017. Scripted pipelines are raw Groovy with imperative control flow — more flexible (loops, try/catch around stages) but harder to maintain.
Shared libraries
Section titled “Shared libraries”Live in a separate repo with vars/ (each .groovy file becomes a global pipeline step like deployToK8s()), src/ (regular Groovy classes on the classpath), and resources/ (bundled files loadable via libraryResource). Loaded via @Library('my-lib@v1.2') _.
Credentials binding
Section titled “Credentials binding”The credentials() helper: environment { AWS_CREDS = credentials('aws-prod') } exposes AWS_CREDS_USR / AWS_CREDS_PSW masked in logs.
post { success {} failure {} always {} } runs after the pipeline regardless of outcome — perfect for Slack notifications and artifact cleanup.
Jenkins vs GitHub Actions
Section titled “Jenkins vs GitHub Actions”Jenkins wins when you need self-hosted control, deep plugin integration with on-prem tooling (older artifact repos, hardware test rigs, regulated environments where SaaS CI is forbidden), or complex pipelines that pre-date GHA. GHA wins on zero infra, native GitHub integration, marketplace ecosystem, and OIDC-everywhere. Regulated/government shops often run Jenkins on-prem behind their firewall for compliance even if GitHub Actions is allowed for public-facing code.
Gotchas
Section titled “Gotchas”- Plugin sprawl is Jenkins’ biggest liability — every plugin is an attack surface and a future upgrade pain.
agent anyon a single-node Jenkins serializes all builds.parallelinside declarative requires every parallel branch to succeed unlessfailFast false.- Shared library versions cache aggressively — pin to a SHA, not
main, for reproducibility.
Sources: jenkins.io/doc/book/pipeline/syntax, jenkins.io/doc/book/pipeline/shared-libraries.
Deep dive — deployment strategies
Section titled “Deep dive — deployment strategies”Rolling — Kubernetes’ default Deployment strategy. Replace pods N at a time controlled by maxSurge (extras allowed above desired) and maxUnavailable (pods allowed to go missing). Simple, no extra infra, no traffic splitting — but old and new versions serve traffic concurrently mid-rollout, so you must support N and N+1 schemas simultaneously. Rollback is “revert the deployment,” which itself is another rolling update (slow).
Blue/green — runs two complete environments (blue = current, green = candidate); cut traffic at the load balancer in a single switch (DNS, ALB target group swap, Service selector flip). Rollback is instant (flip back). Cost: 2× infra during cutover. Schema migrations still need expand/contract (the database has only one copy). Best for stateful systems where you want zero-mixing of versions.
Canary — routes a small percentage (1%, 5%, 10%) of traffic to the new version, observes SLO metrics (error rate, p99 latency), progressively ramps up — or auto-aborts on regression. Needs a traffic-splitting layer: a service mesh (Istio, Linkerd), an ALB with weighted target groups, or a controller like Argo Rollouts or Flagger that automates the weight steps and integrates with Prometheus for analysis.
Feature flags (LaunchDarkly, Unleash, Flagsmith, OpenFeature as the vendor-neutral spec) decouple deploy from release: code ships dark to production, then is enabled per-user, per-segment, or globally without a redeploy. Combine with canary for “deploy continuously, release deliberately.”
Gotchas
Section titled “Gotchas”- Database migrations break all three strategies if not expand/contract (add column nullable → deploy code that writes both → backfill → deploy code that reads new → drop old).
- Canary metrics need enough traffic — for a 5% canary on 100 req/min you’ll wait hours for signal.
- Blue/green doubles cost and doubles in-flight connections during cutover (drain old).
- Feature flag debt: every flag is a permanent
ifuntil removed; budget for cleanup.
Q: When would you pick blue/green over canary?
Three cases. (1) Low-traffic services where canary slices have no statistical power. (2) Stateful apps where running N and N+1 simultaneously is unsafe (incompatible binary protocols, sticky sessions, in-memory state). (3) Compliance environments needing a single, auditable cutover moment with named approvers. Otherwise canary wins because it limits blast radius — at 5% traffic, a bad deploy hits 5% of users, not 50% mid-rolling-update.
Q: Tell me how you’d ship a breaking schema change without downtime.
Expand/contract over multiple deploys. Deploy 1: add the new column nullable, code writes both old and new, reads old. Deploy 2 (after backfill): code reads new, still writes both. Deploy 3: code stops writing old. Deploy 4: drop old column. Each deploy is independently reversible because the previous version still works. Combine with feature flags so the read-switch is a runtime toggle, not a deploy.
Sources: kubernetes.io/docs/concepts/workloads/controllers/deployment, argoproj.github.io/argo-rollouts/features/canary, martinfowler.com/bliki/BlueGreenDeployment, openfeature.dev.
Closing framing — supply chain + governance
Section titled “Closing framing — supply chain + governance”The defensible governance/supply-chain story for regulated systems:
- OIDC (no static keys in CI).
- Signed images + SBOM (supply-chain provenance).
- Expand/contract migrations + canary (zero-downtime within compliance windows).
- Jenkins-on-prem option (sovereignty for air-gapped or regulated networks).
- Pinned image digests + admission control (reproducibility + tamper resistance).
Regulators care less about raw deploy frequency and more about auditability and blast-radius control. Frame answers around those.