Skip to content

CI/CD — Theory

  • Every commit produces a deployable artifact.
  • Tests + scans run automatically; broken builds block merge.
  • Same artifact promoted dev → staging → prod (no rebuild per env).
  • Deploys are scriptable, reversible, observable.
  • Production deploy is a non-event (no all-hands required).

Trunk-based development vs feature branches

Section titled “Trunk-based development vs feature branches”

Trunk-based: short-lived feature branches (<1 day), merge to main fast, hide unfinished work behind feature flags. Forces:

  • Small PRs.
  • Fast CI (10-15 min).
  • Frequent integration.
  • Feature flags for gradual rollout.

Long-lived branches drift, merge hell, integration delays. Avoid except for hotfix lanes.

  • Fast feedback: lint/unit < 5 min. Block merge on these. Slow tests can run async.
  • Fail fast: stop pipeline on first failure.
  • Parallelize independent stages.
  • Cache aggressively: deps, build artifacts, Docker layers.
  • Idempotent: re-run same SHA → same result.
  • Reproducible: pinned versions of tools, deps, base images.
Blue/GreenCanary
Risklow (full rollback)low + smaller blast radius
Cost2x infra during cutsmall extra
Speedinstant cutovergradual ramp
Useinfrequent deploys, big releasecontinuous, observe metrics

Canary needs:

  • Routing layer that splits traffic by % (Istio, ALB weighted, Argo Rollouts).
  • Real-time metrics for go/no-go (error rate, latency).
  • Automated rollback if SLI degrades.

Decouple deploy from release. Code in production, gated by flag, enable per user/segment.

Tools: LaunchDarkly, Unleash, Flagsmith, GrowthBook, Split.

Pitfalls:

  • Flag debt — old flags accumulating.
  • Conditional logic spaghetti.
  • No flag in tests → bug missed.
  • Health checks before sending traffic.
  • Pre-flight smoke tests: hit critical endpoints post-deploy.
  • Auto-rollback on metric breach (5xx rate, latency).
  • Progressive delivery: canary → 1% → 10% → 50% → 100%.
  • Database migration safety: forward-compatible (don’t drop column same release; rename in two steps).
  • Maintenance mode for incompatible changes (rare).

Never deploy code requiring schema before schema exists. Rule of thumb: schema first, code second; rollback in reverse.

Patterns:

  • Expand-contract: add column → dual-write → backfill → switch reads → remove old. Spans multiple releases.
  • Online schema change tools: gh-ost, pt-online-schema-change for big tables.
  • Lock-aware migrations (PG: avoid ACCESS EXCLUSIVE on big tables in business hours).
  • SBOM (CycloneDX, SPDX): inventory of components.
  • Signed artifacts: cosign / Sigstore.
  • Provenance attestations (SLSA framework).
  • Pinned dependencies: lockfiles + reproducible builds.
  • Vuln scanning: Snyk, Trivy, Dependabot, Renovate.
  • Secrets scanning: gitleaks, trufflehog.
  1. What’s the difference between continuous delivery and continuous deployment? Delivery = always shippable; deploy is button. Deployment = automatic.
  2. Pipeline takes 1h — how to speed up? Parallelize, cache, split slow tests, run E2E nightly only, smaller artifacts, faster runners.
  3. Deploy went bad — rollback strategy? Blue/green flip; revert deployment; revert image tag; re-deploy previous artifact.
  4. How do you ensure prod deploys are safe? Tests + scans + canary + auto-rollback on metric breach + observability + feature flags for risky features.
  5. Same code crashes prod, works in staging — why? Env diff (config, secrets, scale, traffic shape, feature flag, third-party rate limit). Reproduce with prod-like load.
  6. CI passing but test broken — why? Test no-op or wrong assertion. Mutation testing helps detect.
  7. You see flaky tests. Action plan? Quarantine, investigate root cause (timing, shared state, env), fix, then unquarantine.
  8. Secrets in CI: best practice? OIDC to cloud, no long-lived keys; provider secrets store; rotate regularly; never log; encrypt at rest.
  9. DORA metrics — pick one we should improve and how. Lead time: smaller PRs, trunk-based, faster CI.
  10. How would you do zero-downtime DB migration? Expand-contract + backwards-compatible code.
  • One mega-pipeline doing everything → slow, brittle.
  • Shared mutable test env → flaky.
  • Manual approval steps that nobody understands.
  • “Special” prod-only flags / configs → drift, surprise breakage.
  • No production-like load test before traffic flip.
  • Branch policies bypassed by admins.
  • Ad-hoc hotfix deploys without same gates.
  • Long-lived release branches → integration debt.

Deep dive — image scanning + SBOM + signing

Section titled “Deep dive — image scanning + SBOM + signing”

Three supply-chain concerns:

  1. Known CVEs in your image — solved by scanners like Trivy (Aqua, OSS, scans OS packages, language deps, secrets, IaC misconfigs), Grype (Anchore), Snyk, Docker Scout.
  2. Inventory of what’s in the image — solved by an SBOM in SPDX or CycloneDX format, generated by Syft or trivy sbom. The SBOM is a JSON document listing every package, version, license, hash; store it as a build artifact so when a future CVE drops you can grep your SBOM archive and know which images are affected without re-scanning.
  3. Provenance — that the image you pulled is the one you built — solved by cosign (Sigstore) signing. Modern keyless cosign uses OIDC (the same identity GitHub Actions issues for AWS) to bind a short-lived cert to the build identity and records the signature in the Rekor transparency log; verifiers check against Rekor without managing private keys.

SolarWinds, the event-stream npm hijack, the xz-utils backdoor (Mar 2024) all exploited trust in the build/distribution chain. The US Executive Order 14028 and EU CRA push SBOM and signed artifacts toward mandatory for government suppliers.

  • Scan on every push and on a daily schedule — new CVEs land against unchanged images all the time.
  • ignore-unfixed: true filters out CVEs with no upstream patch (otherwise the queue is unworkable).
  • SBOM only useful if archived and queryable — store in artifact storage with image digest as key.
  • Cosign keyless signatures expire (cert is short-lived) but Rekor proof is permanent — verification still works years later.
  • Trivy DB updates daily; pin DB version in air-gapped environments.

Sources: trivy.dev, docs.sigstore.dev/cosign/signing/overview, snyk.io/blog/10-docker-image-security-best-practices.


TriggerPurpose
pushBranch/path filters
pull_requestUse pull_request_target with extreme caution — runs against the base repo with secrets
workflow_dispatchManual, with typed inputs
scheduleCron, UTC
repository_dispatchExternal trigger via REST API
workflow_callMakes the workflow reusable
workflow_runChain after another workflow

Generate one job per combination of variables. Use include to add specific extra combinations, exclude to remove ones. fail-fast: false to let all matrix legs finish when one fails (essential for cross-version test suites). max-parallel to throttle.

  • Reusable workflows — full workflows called via uses: org/repo/.github/workflows/x.yml@ref at the job level. Get their own runner, can have secrets: and inputs:, limited to nesting depth 10.
  • Composite actions — step-level reusable units that run inside an existing job. Lighter, no separate runner, ideal for packaging a sequence of steps.

Prevent overlapping runs on the same logical resource (e.g., one prod deploy at a time). cancel-in-progress: true kills superseded runs (great for PR builds where only the latest matters).

  • actions/setup-node with cache: 'npm' auto-uses actions/cache keyed on package-lock.json — don’t add a redundant cache step.
  • Default GITHUB_TOKEN permissions changed in 2023 to read-only; declare permissions: per job explicitly.
  • pull_request from forks does NOT have access to repo secrets (security boundary).
  • Schedules on on: schedule only run on the default branch’s workflow file.
  • Concurrency groups are case-insensitive — prod == Prod.

Sources: docs.github.com/en/actions/using-workflows.


Storing AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY as a GitHub Secret means a static credential is replicated outside AWS, never auto-rotated, and grants whatever IAM policy is attached for as long as it exists. If a workflow logs it, a malicious action exfiltrates it, or an admin leaves the org, you’re rotating manually.

GitHub acts as an OIDC identity provider; AWS IAM trusts that provider; for each workflow run GitHub mints a short-lived JWT (5–15 min validity) whose sub claim identifies the exact repo, branch, environment, or job; aws-actions/configure-aws-credentials@v4 exchanges that JWT for short-lived STS credentials (default 1 hour, capped by role’s MaxSessionDuration).

No secrets stored, credentials auto-expire, and the IAM trust policy can scope access by repo + branch + environment.

Use StringEquals on sub to pin it to one ref (repo:org/repo:ref:refs/heads/main) or one environment (repo:org/repo:environment:prod). Wildcards via StringLike are convenient but dangerous — repo:org/repo:* lets a PR from any branch assume the role.

Same model works for GCP (Workload Identity Federation, audience = a numeric pool ID) and Azure (federated credentials on an AAD app registration).

  • Forgetting permissions: id-token: write → “credentials could not be loaded” — the job can’t request a JWT.
  • StringLike with repo:org/repo:* is the #1 OIDC misconfiguration; pin to specific ref: or environment:.
  • role-session-name shows up in CloudTrail — make it traceable (include github.run_id).
  • Cross-account: the role you assume must be in the target account; trust GitHub’s OIDC provider in that account too.
  • Per-environment roles + GitHub environment: protection rules give you human approval gates on top of OIDC.

Q: Why is OIDC better than storing AWS access keys as GitHub Secrets?

Three reasons. (1) No long-lived credentials anywhere — STS tokens expire in an hour. (2) Trust is scoped by GitHub’s signed JWT claims, so an IAM trust policy can require sub = repo:org/repo:environment:prod — a leaked token from a feature branch can’t deploy to prod. (3) Auditability: every assumption shows up in CloudTrail with role-session-name tied to a specific GitHub run ID. Operationally there’s nothing to rotate, nothing to re-store when team members leave, and nothing to leak in logs.

Q: Walk through configuring it end-to-end.

In AWS: create an IAM OIDC identity provider for https://token.actions.githubusercontent.com with audience sts.amazonaws.com. Create an IAM role with a trust policy that allows sts:AssumeRoleWithWebIdentity from that provider, with a Condition pinning sub to repo:org/repo:environment:prod (and aud to sts.amazonaws.com). Attach the least-privilege policy the deploy needs. In GitHub: set permissions: id-token: write, use aws-actions/configure-aws-credentials@v4 with role-to-assume, and protect the prod environment with required reviewers.


Declarative pipelines wrap everything in a pipeline { } block with strict structure (agent, stages, steps, post) — easier to read, easier to lint, the recommended default since 2017. Scripted pipelines are raw Groovy with imperative control flow — more flexible (loops, try/catch around stages) but harder to maintain.

Live in a separate repo with vars/ (each .groovy file becomes a global pipeline step like deployToK8s()), src/ (regular Groovy classes on the classpath), and resources/ (bundled files loadable via libraryResource). Loaded via @Library('my-lib@v1.2') _.

The credentials() helper: environment { AWS_CREDS = credentials('aws-prod') } exposes AWS_CREDS_USR / AWS_CREDS_PSW masked in logs.

post { success {} failure {} always {} } runs after the pipeline regardless of outcome — perfect for Slack notifications and artifact cleanup.

Jenkins wins when you need self-hosted control, deep plugin integration with on-prem tooling (older artifact repos, hardware test rigs, regulated environments where SaaS CI is forbidden), or complex pipelines that pre-date GHA. GHA wins on zero infra, native GitHub integration, marketplace ecosystem, and OIDC-everywhere. Regulated/government shops often run Jenkins on-prem behind their firewall for compliance even if GitHub Actions is allowed for public-facing code.

  • Plugin sprawl is Jenkins’ biggest liability — every plugin is an attack surface and a future upgrade pain.
  • agent any on a single-node Jenkins serializes all builds.
  • parallel inside declarative requires every parallel branch to succeed unless failFast false.
  • Shared library versions cache aggressively — pin to a SHA, not main, for reproducibility.

Sources: jenkins.io/doc/book/pipeline/syntax, jenkins.io/doc/book/pipeline/shared-libraries.


Rolling — Kubernetes’ default Deployment strategy. Replace pods N at a time controlled by maxSurge (extras allowed above desired) and maxUnavailable (pods allowed to go missing). Simple, no extra infra, no traffic splitting — but old and new versions serve traffic concurrently mid-rollout, so you must support N and N+1 schemas simultaneously. Rollback is “revert the deployment,” which itself is another rolling update (slow).

Blue/green — runs two complete environments (blue = current, green = candidate); cut traffic at the load balancer in a single switch (DNS, ALB target group swap, Service selector flip). Rollback is instant (flip back). Cost: 2× infra during cutover. Schema migrations still need expand/contract (the database has only one copy). Best for stateful systems where you want zero-mixing of versions.

Canary — routes a small percentage (1%, 5%, 10%) of traffic to the new version, observes SLO metrics (error rate, p99 latency), progressively ramps up — or auto-aborts on regression. Needs a traffic-splitting layer: a service mesh (Istio, Linkerd), an ALB with weighted target groups, or a controller like Argo Rollouts or Flagger that automates the weight steps and integrates with Prometheus for analysis.

Feature flags (LaunchDarkly, Unleash, Flagsmith, OpenFeature as the vendor-neutral spec) decouple deploy from release: code ships dark to production, then is enabled per-user, per-segment, or globally without a redeploy. Combine with canary for “deploy continuously, release deliberately.”

  • Database migrations break all three strategies if not expand/contract (add column nullable → deploy code that writes both → backfill → deploy code that reads new → drop old).
  • Canary metrics need enough traffic — for a 5% canary on 100 req/min you’ll wait hours for signal.
  • Blue/green doubles cost and doubles in-flight connections during cutover (drain old).
  • Feature flag debt: every flag is a permanent if until removed; budget for cleanup.

Q: When would you pick blue/green over canary?

Three cases. (1) Low-traffic services where canary slices have no statistical power. (2) Stateful apps where running N and N+1 simultaneously is unsafe (incompatible binary protocols, sticky sessions, in-memory state). (3) Compliance environments needing a single, auditable cutover moment with named approvers. Otherwise canary wins because it limits blast radius — at 5% traffic, a bad deploy hits 5% of users, not 50% mid-rolling-update.

Q: Tell me how you’d ship a breaking schema change without downtime.

Expand/contract over multiple deploys. Deploy 1: add the new column nullable, code writes both old and new, reads old. Deploy 2 (after backfill): code reads new, still writes both. Deploy 3: code stops writing old. Deploy 4: drop old column. Each deploy is independently reversible because the previous version still works. Combine with feature flags so the read-switch is a runtime toggle, not a deploy.

Sources: kubernetes.io/docs/concepts/workloads/controllers/deployment, argoproj.github.io/argo-rollouts/features/canary, martinfowler.com/bliki/BlueGreenDeployment, openfeature.dev.


Closing framing — supply chain + governance

Section titled “Closing framing — supply chain + governance”

The defensible governance/supply-chain story for regulated systems:

  • OIDC (no static keys in CI).
  • Signed images + SBOM (supply-chain provenance).
  • Expand/contract migrations + canary (zero-downtime within compliance windows).
  • Jenkins-on-prem option (sovereignty for air-gapped or regulated networks).
  • Pinned image digests + admission control (reproducibility + tamper resistance).

Regulators care less about raw deploy frequency and more about auditability and blast-radius control. Frame answers around those.