CI/CD — Theory

What “good” looks like

Every commit produces a deployable artifact.
Tests + scans run automatically; broken builds block merge.
Same artifact promoted dev → staging → prod (no rebuild per env).
Deploys are scriptable, reversible, observable.
Production deploy is a non-event (no all-hands required).

Trunk-based development vs feature branches

Trunk-based: short-lived feature branches (<1 day), merge to main fast, hide unfinished work behind feature flags. Forces:

Small PRs.
Fast CI (10-15 min).
Frequent integration.
Feature flags for gradual rollout.

Long-lived branches drift, merge hell, integration delays. Avoid except for hotfix lanes.

Pipeline design principles

Fast feedback: lint/unit < 5 min. Block merge on these. Slow tests can run async.
Fail fast: stop pipeline on first failure.
Parallelize independent stages.
Cache aggressively: deps, build artifacts, Docker layers.
Idempotent: re-run same SHA → same result.
Reproducible: pinned versions of tools, deps, base images.

Blue/Green vs Canary

	Blue/Green	Canary
Risk	low (full rollback)	low + smaller blast radius
Cost	2x infra during cut	small extra
Speed	instant cutover	gradual ramp
Use	infrequent deploys, big release	continuous, observe metrics

Canary needs:

Routing layer that splits traffic by % (Istio, ALB weighted, Argo Rollouts).
Real-time metrics for go/no-go (error rate, latency).
Automated rollback if SLI degrades.

Feature flags

Decouple deploy from release. Code in production, gated by flag, enable per user/segment.

Tools: LaunchDarkly, Unleash, Flagsmith, GrowthBook, Split.

Pitfalls:

Flag debt — old flags accumulating.
Conditional logic spaghetti.
No flag in tests → bug missed.

Deployment safety mechanisms

Health checks before sending traffic.
Pre-flight smoke tests: hit critical endpoints post-deploy.
Auto-rollback on metric breach (5xx rate, latency).
Progressive delivery: canary → 1% → 10% → 50% → 100%.
Database migration safety: forward-compatible (don’t drop column same release; rename in two steps).
Maintenance mode for incompatible changes (rare).

Database migration patterns

Never deploy code requiring schema before schema exists. Rule of thumb: schema first, code second; rollback in reverse.

Patterns:

Expand-contract: add column → dual-write → backfill → switch reads → remove old. Spans multiple releases.
Online schema change tools: gh-ost, pt-online-schema-change for big tables.
Lock-aware migrations (PG: avoid ACCESS EXCLUSIVE on big tables in business hours).

Supply chain security

SBOM (CycloneDX, SPDX): inventory of components.
Signed artifacts: cosign / Sigstore.
Provenance attestations (SLSA framework).
Pinned dependencies: lockfiles + reproducible builds.
Vuln scanning: Snyk, Trivy, Dependabot, Renovate.
Secrets scanning: gitleaks, trufflehog.

Common interview Qs

What’s the difference between continuous delivery and continuous deployment? Delivery = always shippable; deploy is button. Deployment = automatic.
Pipeline takes 1h — how to speed up? Parallelize, cache, split slow tests, run E2E nightly only, smaller artifacts, faster runners.
Deploy went bad — rollback strategy? Blue/green flip; revert deployment; revert image tag; re-deploy previous artifact.
How do you ensure prod deploys are safe? Tests + scans + canary + auto-rollback on metric breach + observability + feature flags for risky features.
Same code crashes prod, works in staging — why? Env diff (config, secrets, scale, traffic shape, feature flag, third-party rate limit). Reproduce with prod-like load.
CI passing but test broken — why? Test no-op or wrong assertion. Mutation testing helps detect.
You see flaky tests. Action plan? Quarantine, investigate root cause (timing, shared state, env), fix, then unquarantine.
Secrets in CI: best practice? OIDC to cloud, no long-lived keys; provider secrets store; rotate regularly; never log; encrypt at rest.
DORA metrics — pick one we should improve and how. Lead time: smaller PRs, trunk-based, faster CI.
How would you do zero-downtime DB migration? Expand-contract + backwards-compatible code.

Common pitfalls

One mega-pipeline doing everything → slow, brittle.
Shared mutable test env → flaky.
Manual approval steps that nobody understands.
“Special” prod-only flags / configs → drift, surprise breakage.
No production-like load test before traffic flip.
Branch policies bypassed by admins.
Ad-hoc hotfix deploys without same gates.
Long-lived release branches → integration debt.

Deep dive — image scanning + SBOM + signing

Three supply-chain concerns:

Known CVEs in your image — solved by scanners like Trivy (Aqua, OSS, scans OS packages, language deps, secrets, IaC misconfigs), Grype (Anchore), Snyk, Docker Scout.
Inventory of what’s in the image — solved by an SBOM in SPDX or CycloneDX format, generated by Syft or trivy sbom. The SBOM is a JSON document listing every package, version, license, hash; store it as a build artifact so when a future CVE drops you can grep your SBOM archive and know which images are affected without re-scanning.
Provenance — that the image you pulled is the one you built — solved by cosign (Sigstore) signing. Modern keyless cosign uses OIDC (the same identity GitHub Actions issues for AWS) to bind a short-lived cert to the build identity and records the signature in the Rekor transparency log; verifiers check against Rekor without managing private keys.

Why this matters in 2024–2026

SolarWinds, the event-stream npm hijack, the xz-utils backdoor (Mar 2024) all exploited trust in the build/distribution chain. The US Executive Order 14028 and EU CRA push SBOM and signed artifacts toward mandatory for government suppliers.

Gotchas

Scan on every push and on a daily schedule — new CVEs land against unchanged images all the time.
ignore-unfixed: true filters out CVEs with no upstream patch (otherwise the queue is unworkable).
SBOM only useful if archived and queryable — store in artifact storage with image digest as key.
Cosign keyless signatures expire (cert is short-lived) but Rekor proof is permanent — verification still works years later.
Trivy DB updates daily; pin DB version in air-gapped environments.

Sources: trivy.dev, docs.sigstore.dev/cosign/signing/overview, snyk.io/blog/10-docker-image-security-best-practices.

Deep dive — GitHub Actions

Triggers (`on:`)

Trigger	Purpose
`push`	Branch/path filters
`pull_request`	Use `pull_request_target` with extreme caution — runs against the base repo with secrets
`workflow_dispatch`	Manual, with typed `inputs`
`schedule`	Cron, UTC
`repository_dispatch`	External trigger via REST API
`workflow_call`	Makes the workflow reusable
`workflow_run`	Chain after another workflow

Matrix builds

Generate one job per combination of variables. Use include to add specific extra combinations, exclude to remove ones. fail-fast: false to let all matrix legs finish when one fails (essential for cross-version test suites). max-parallel to throttle.

Reusable workflows vs composite actions

Reusable workflows — full workflows called via uses: org/repo/.github/workflows/x.yml@ref at the job level. Get their own runner, can have secrets: and inputs:, limited to nesting depth 10.
Composite actions — step-level reusable units that run inside an existing job. Lighter, no separate runner, ideal for packaging a sequence of steps.

Concurrency groups

Prevent overlapping runs on the same logical resource (e.g., one prod deploy at a time). cancel-in-progress: true kills superseded runs (great for PR builds where only the latest matters).

Gotchas

actions/setup-node with cache: 'npm' auto-uses actions/cache keyed on package-lock.json — don’t add a redundant cache step.
Default GITHUB_TOKEN permissions changed in 2023 to read-only; declare permissions: per job explicitly.
pull_request from forks does NOT have access to repo secrets (security boundary).
Schedules on on: schedule only run on the default branch’s workflow file.
Concurrency groups are case-insensitive — prod == Prod.

Sources: docs.github.com/en/actions/using-workflows.

Deep dive — OIDC to AWS (modern way)

Why long-lived AWS keys are out

Storing AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY as a GitHub Secret means a static credential is replicated outside AWS, never auto-rotated, and grants whatever IAM policy is attached for as long as it exists. If a workflow logs it, a malicious action exfiltrates it, or an admin leaves the org, you’re rotating manually.

How OIDC flips this

GitHub acts as an OIDC identity provider; AWS IAM trusts that provider; for each workflow run GitHub mints a short-lived JWT (5–15 min validity) whose sub claim identifies the exact repo, branch, environment, or job; aws-actions/configure-aws-credentials@v4 exchanges that JWT for short-lived STS credentials (default 1 hour, capped by role’s MaxSessionDuration).

No secrets stored, credentials auto-expire, and the IAM trust policy can scope access by repo + branch + environment.

Trust policy is the security boundary

Use StringEquals on sub to pin it to one ref (repo:org/repo:ref:refs/heads/main) or one environment (repo:org/repo:environment:prod). Wildcards via StringLike are convenient but dangerous — repo:org/repo:* lets a PR from any branch assume the role.

Same model works for GCP (Workload Identity Federation, audience = a numeric pool ID) and Azure (federated credentials on an AAD app registration).

Gotchas

Forgetting permissions: id-token: write → “credentials could not be loaded” — the job can’t request a JWT.
StringLike with repo:org/repo:* is the #1 OIDC misconfiguration; pin to specific ref: or environment:.
role-session-name shows up in CloudTrail — make it traceable (include github.run_id).
Cross-account: the role you assume must be in the target account; trust GitHub’s OIDC provider in that account too.
Per-environment roles + GitHub environment: protection rules give you human approval gates on top of OIDC.

Q: Why is OIDC better than storing AWS access keys as GitHub Secrets?

Three reasons. (1) No long-lived credentials anywhere — STS tokens expire in an hour. (2) Trust is scoped by GitHub’s signed JWT claims, so an IAM trust policy can require sub = repo:org/repo:environment:prod — a leaked token from a feature branch can’t deploy to prod. (3) Auditability: every assumption shows up in CloudTrail with role-session-name tied to a specific GitHub run ID. Operationally there’s nothing to rotate, nothing to re-store when team members leave, and nothing to leak in logs.

Q: Walk through configuring it end-to-end.

In AWS: create an IAM OIDC identity provider for https://token.actions.githubusercontent.com with audience sts.amazonaws.com. Create an IAM role with a trust policy that allows sts:AssumeRoleWithWebIdentity from that provider, with a Condition pinning sub to repo:org/repo:environment:prod (and aud to sts.amazonaws.com). Attach the least-privilege policy the deploy needs. In GitHub: set permissions: id-token: write, use aws-actions/configure-aws-credentials@v4 with role-to-assume, and protect the prod environment with required reviewers.

Deep dive — Jenkins (in case asked)

Declarative vs scripted

Declarative pipelines wrap everything in a pipeline { } block with strict structure (agent, stages, steps, post) — easier to read, easier to lint, the recommended default since 2017. Scripted pipelines are raw Groovy with imperative control flow — more flexible (loops, try/catch around stages) but harder to maintain.

Shared libraries

Live in a separate repo with vars/ (each .groovy file becomes a global pipeline step like deployToK8s()), src/ (regular Groovy classes on the classpath), and resources/ (bundled files loadable via libraryResource). Loaded via @Library('my-lib@v1.2') _.

Credentials binding

The credentials() helper: environment { AWS_CREDS = credentials('aws-prod') } exposes AWS_CREDS_USR / AWS_CREDS_PSW masked in logs.

post { success {} failure {} always {} } runs after the pipeline regardless of outcome — perfect for Slack notifications and artifact cleanup.

Jenkins vs GitHub Actions

Jenkins wins when you need self-hosted control, deep plugin integration with on-prem tooling (older artifact repos, hardware test rigs, regulated environments where SaaS CI is forbidden), or complex pipelines that pre-date GHA. GHA wins on zero infra, native GitHub integration, marketplace ecosystem, and OIDC-everywhere. Regulated/government shops often run Jenkins on-prem behind their firewall for compliance even if GitHub Actions is allowed for public-facing code.

Gotchas

Plugin sprawl is Jenkins’ biggest liability — every plugin is an attack surface and a future upgrade pain.
agent any on a single-node Jenkins serializes all builds.
parallel inside declarative requires every parallel branch to succeed unless failFast false.
Shared library versions cache aggressively — pin to a SHA, not main, for reproducibility.

Sources: jenkins.io/doc/book/pipeline/syntax, jenkins.io/doc/book/pipeline/shared-libraries.

Deep dive — deployment strategies

Rolling — Kubernetes’ default Deployment strategy. Replace pods N at a time controlled by maxSurge (extras allowed above desired) and maxUnavailable (pods allowed to go missing). Simple, no extra infra, no traffic splitting — but old and new versions serve traffic concurrently mid-rollout, so you must support N and N+1 schemas simultaneously. Rollback is “revert the deployment,” which itself is another rolling update (slow).

Blue/green — runs two complete environments (blue = current, green = candidate); cut traffic at the load balancer in a single switch (DNS, ALB target group swap, Service selector flip). Rollback is instant (flip back). Cost: 2× infra during cutover. Schema migrations still need expand/contract (the database has only one copy). Best for stateful systems where you want zero-mixing of versions.

Canary — routes a small percentage (1%, 5%, 10%) of traffic to the new version, observes SLO metrics (error rate, p99 latency), progressively ramps up — or auto-aborts on regression. Needs a traffic-splitting layer: a service mesh (Istio, Linkerd), an ALB with weighted target groups, or a controller like Argo Rollouts or Flagger that automates the weight steps and integrates with Prometheus for analysis.

Feature flags (LaunchDarkly, Unleash, Flagsmith, OpenFeature as the vendor-neutral spec) decouple deploy from release: code ships dark to production, then is enabled per-user, per-segment, or globally without a redeploy. Combine with canary for “deploy continuously, release deliberately.”

Gotchas

Database migrations break all three strategies if not expand/contract (add column nullable → deploy code that writes both → backfill → deploy code that reads new → drop old).
Canary metrics need enough traffic — for a 5% canary on 100 req/min you’ll wait hours for signal.
Blue/green doubles cost and doubles in-flight connections during cutover (drain old).
Feature flag debt: every flag is a permanent if until removed; budget for cleanup.

Q: When would you pick blue/green over canary?

Three cases. (1) Low-traffic services where canary slices have no statistical power. (2) Stateful apps where running N and N+1 simultaneously is unsafe (incompatible binary protocols, sticky sessions, in-memory state). (3) Compliance environments needing a single, auditable cutover moment with named approvers. Otherwise canary wins because it limits blast radius — at 5% traffic, a bad deploy hits 5% of users, not 50% mid-rolling-update.

Q: Tell me how you’d ship a breaking schema change without downtime.

Expand/contract over multiple deploys. Deploy 1: add the new column nullable, code writes both old and new, reads old. Deploy 2 (after backfill): code reads new, still writes both. Deploy 3: code stops writing old. Deploy 4: drop old column. Each deploy is independently reversible because the previous version still works. Combine with feature flags so the read-switch is a runtime toggle, not a deploy.

Sources: kubernetes.io/docs/concepts/workloads/controllers/deployment, argoproj.github.io/argo-rollouts/features/canary, martinfowler.com/bliki/BlueGreenDeployment, openfeature.dev.

Closing framing — supply chain + governance

The defensible governance/supply-chain story for regulated systems:

OIDC (no static keys in CI).
Signed images + SBOM (supply-chain provenance).
Expand/contract migrations + canary (zero-downtime within compliance windows).
Jenkins-on-prem option (sovereignty for air-gapped or regulated networks).
Pinned image digests + admission control (reproducibility + tamper resistance).

Regulators care less about raw deploy frequency and more about auditability and blast-radius control. Frame answers around those.

CI/CD — Theory

CI/CD — Theory

What “good” looks like

Trunk-based development vs feature branches

Pipeline design principles

Blue/Green vs Canary

Feature flags

Deployment safety mechanisms

Database migration patterns

Supply chain security

Common interview Qs

Common pitfalls

Deep dive — image scanning + SBOM + signing

Why this matters in 2024–2026

Gotchas

Deep dive — GitHub Actions

Triggers (on:)

Matrix builds

Reusable workflows vs composite actions

Concurrency groups

Gotchas

Deep dive — OIDC to AWS (modern way)

Why long-lived AWS keys are out

How OIDC flips this

Trust policy is the security boundary

Gotchas

Deep dive — Jenkins (in case asked)

Declarative vs scripted

Shared libraries

Credentials binding

Jenkins vs GitHub Actions

Gotchas

Deep dive — deployment strategies

Gotchas

Closing framing — supply chain + governance

Triggers (`on:`)