Docker — Theory

Docker — Theory (interview deep-dive)

What’s actually happening (Linux primitives)

Containers are Linux processes with:

Namespaces — isolation: PID (process tree), NET (interfaces), MNT (filesystem), IPC, UTS (hostname), USER (uid mapping), CGROUP.
Cgroups (v2) — resource limits: CPU, memory, IO, PIDs, network.
Capabilities — granular root powers (NET_ADMIN, SYS_ADMIN). Containers usually drop most.
Seccomp — syscall allow-list profile.
AppArmor / SELinux — MAC profiles.
Union filesystem (overlay2 default) — layered FS, copy-on-write.

docker is just a frontend; container runtime is containerd + runc (OCI runtime).

OCI standards

OCI Image Spec — image format.
OCI Runtime Spec — what a runtime must do.
OCI Distribution Spec — registry API.

So you can run images built by Docker on containerd, cri-o, podman, etc. K8s uses CRI-O / containerd, not Docker, since 1.24.

Image internals

Image = list of layer digests + config JSON (env, cmd, ports).
Each layer = tar gzip of files added/changed/deleted.
docker history img shows layer chain.
Pulling: only missing layers. Sharing across images saves disk.

Build cache rules

A layer is reused if instruction + previous layer match.
COPY looks at file checksums.
RUN is opaque — Docker doesn’t know what changed inside, only the command + previous state.
Buildx + cache mounts let RUN reuse package caches across builds.

ENTRYPOINT vs CMD

	ENTRYPOINT	CMD
Form	`["bin","arg"]`	`["arg"]` (or `["bin","arg"]`)
Override on `docker run`	`--entrypoint`	trailing args
Typical use	the executable	default args

Pattern:

ENTRYPOINT ["python", "-m", "app"]
CMD ["serve"]                             # default subcommand
# docker run img migrate                  # → python -m app migrate

Init / signal handling

PID 1 in container has special signal semantics. Most images don’t have a proper init → SIGTERM not forwarded → containers don’t shut down gracefully.

Solutions:

docker run --init adds tini.
tini / dumb-init as ENTRYPOINT.
App-level signal handlers.

In Kubernetes, this matters for graceful shutdown during rolling updates.

Logging

App writes to stdout/stderr.
Docker logging driver collects: json-file (default), syslog, journald, fluentd, awslogs, gelf, splunk.
In K8s: container stdout → kubelet → log collector (Fluent Bit, Vector) → backend (Loki, ES, CloudWatch).

Don’t log to files inside container — they’re ephemeral.

Storage drivers

overlay2 is default. Old: aufs (deprecated), devicemapper (slow), btrfs.

Layer count limit ~125. Squash periodically if too deep.

Security model

Containers are not strong security boundaries. Risks:

Kernel exploit escapes the container.
Misconfigured --privileged / extra caps.
Mounted Docker socket → trivial host takeover.
Running as root inside maps to root outside (without user namespaces).

Hardening:

Non-root user.
Read-only filesystem (--read-only + tmpfs for /tmp).
Drop all capabilities, add only needed.
Seccomp default profile.
Minimal base image (distroless / scratch).
Image signing (cosign).
Vuln scan (Trivy, Grype) in CI.

For higher isolation: gVisor, Kata Containers, Firecracker.

Production readiness — checklist

Non-root user.
Specific image tag (digest in prod).
HEALTHCHECK or liveness/readiness in K8s.
Graceful shutdown handler.
Resource limits set.
Logs to stdout.
Config via env / mounted secrets, not baked.
Image scanned, signed.
Layers minimized.

Common interview Qs

Container vs VM. Shared kernel; namespaces+cgroups vs hardware virt.
Where do container processes show up on the host? As regular processes — ps -ef sees them, but PIDs differ inside.
How does layered FS work? Copy-on-write — read from lower layers; modify writes to top layer.
Why is latest tag bad for prod? Non-reproducible; same tag, different content over time.
Image bloated — how to shrink? Multi-stage, slim base, .dockerignore, dedupe deps, remove caches in same RUN.
Container memory limit OOMs randomly. Set requests/limits in K8s; use --memory-swap=-1 to disable swap; analyze actual usage; consider Node --max-old-space-size.
Sigterm not delivered to app. PID 1 issue — use init.
Difference between RUN apt install -y x && rm -rf /var/lib/apt/lists/* and not? Cleaning in the same RUN avoids leaving the cache in that layer (it’s there forever otherwise).
What’s a sidecar pattern? Companion container in the same pod sharing volumes/network — proxy, log shipper, encryption.
Multi-arch images — how built? docker buildx build --platform linux/amd64,linux/arm64 --push.

Anti-patterns

Running database in container with bind mount to NFS without testing.
One image trying to be many roles via env-driven branching.
Using docker exec as deployment workflow.
Privileged containers without justification.
Mounting /var/run/docker.sock to a service container.
Big monolithic compose file as production config.

Deep dive — multi-stage builds (why & how)

Multi-stage builds let you use multiple FROM statements in a single Dockerfile, each starting a new build stage you can selectively copy artifacts from using COPY --from=<stage>. The point: keep heavy build-time tooling (compilers, dev dependencies, test frameworks, source code) out of the final runtime image.

Docker’s docs: the goal is creating “a tiny production image with nothing but the binary inside.” A second benefit specific to BuildKit (default builder since Docker 23.0): BuildKit only builds stages that the target stage actually depends on, whereas the legacy builder ran every preceding stage regardless.

BuildKit cache & secret mounts extend multi-stage further:

RUN --mount=type=cache,target=/root/.npm persists the npm cache across builds without baking it into a layer — repeat builds skip the network entirely for unchanged deps.
--mount=type=secret,id=npm_token exposes a secret as a file inside the RUN step only; it never lands in any layer or image history.

Standard Node.js pattern is three stages: deps (install with npm ci), build (run tsc / vite build), runtime (copy only dist/ + production node_modules onto a minimal base).

Gotchas

COPY --from=build copies file ownership too; if build ran as root and runtime is non-root, you get permission errors. Use COPY --chown=1000:1000.
npm prune --omit=dev must run after the build (TS compiler is in devDeps).
BuildKit cache mounts are not shared across CI runners by default — on GitHub Actions you need actions/cache or cache-from=type=gha.
--mount=type=secret requires DOCKER_BUILDKIT=1 and Dockerfile syntax >= 1.2.
Don’t COPY . . before installing deps — you’ll bust the npm install layer on every code change.

Q: Walk me through a multi-stage Dockerfile for a Node.js TypeScript service. Why three stages?

Stage 1 installs all deps with npm ci so the layer is cached on lockfile changes only. Stage 2 builds TypeScript using those deps, then prunes devDependencies. Stage 3 starts from a minimal/distroless base and copies only dist/ + production node_modules. Result: ~150 MB image with no tsc, no npm, no shell — smaller pull, smaller attack surface, and the build tooling is provably absent in production.

Q: How do you handle a private npm registry token without leaking it?

Never use ARG or ENV for secrets; both end up in image history. Use BuildKit’s --mount=type=secret,id=npm_token so the token is mounted as a tmpfs file inside the RUN step only. In CI, pass --secret id=npm_token,env=NPM_TOKEN. Verify with docker history that the token isn’t in any layer.

Sources: docs.docker.com/build/building/multi-stage, docs.docker.com/build/building/secrets, docs.docker.com/build/cache/optimize.

Deep dive — distroless + minimal images

Google’s distroless images (gcr.io/distroless/nodejs20-debian12, gcr.io/distroless/static-debian13, etc.) contain only your application and its language runtime — no shell, no package manager, no busybox, no apt.

Sizes: static-debian13 ~2 MiB; Alpine ~5 MiB; Debian slim ~70+ MiB.

The security thesis: every binary in the image is a potential CVE; removing them improves “the signal-to-noise of CVE scanners” and shrinks the attack surface (an attacker who pops a shell into a distroless container can’t wget, curl, or even sh -c). Used by Kubernetes itself (since v1.15), Knative, Tekton.

Trade-offs

Alpine uses musl libc instead of glibc, which breaks native modules compiled for glibc (bcrypt, sharp, node-canvas are common pain points — they either need an alpine-specific build or won’t load). Distroless uses glibc, so native modules work, but you can’t docker exec a shell into a running container for debugging — use the :debug tag (which adds busybox) only in non-prod or use ephemeral debug containers (kubectl debug).

Chainguard Images (built on the Wolfi distro) are a newer alternative: glibc-based, signed with cosign, ship with SBOM by default, target near-zero CVEs at release time. De facto choice when you want distroless + supply-chain provenance out of the box.

Gotchas

Distroless has no shell — CMD "node server.js" (string form) fails; use exec form CMD ["node", "server.js"] or ["server.js"] for the nodejs variant.
No ls, no cat, no sh for debugging — use kubectl debug --image=busybox --target=app pod/x.
Alpine + Node native modules (bcrypt, sharp) frequently fail at runtime; pre-build with the node:20-alpine builder or switch base.
Sizes: node:20 (full Debian) ≈ 1 GB; node:20-slim ≈ 240 MB; node:20-alpine ≈ 150 MB; gcr.io/distroless/nodejs20 ≈ 150 MB. Slim+distroless wins on glibc compatibility.
Distroless images use UID 65532 by default named nonroot; align your file ownership.

Sources: github.com/GoogleContainerTools/distroless, edu.chainguard.dev/chainguard/chainguard-images/overview.

Deep dive — non-root + security hardening

Containers default to running as UID 0 (root) inside the namespace. Even with user namespaces, a root-in-container exploit chain is still strictly more dangerous than non-root.

Dockerfile layer

USER 1000 (or a named user created via RUN useradd).

Runtime layer

--read-only mounts the rootfs read-only (force writes to explicit tmpfs mounts).
--cap-drop=ALL --cap-add=NET_BIND_SERVICE drops every Linux capability and re-adds only what you need.
--security-opt=no-new-privileges sets the kernel no_new_privs flag so a setuid binary can’t escalate.
--security-opt seccomp=profile.json restricts which syscalls the kernel allows.

Kubernetes equivalents

Live under securityContext at pod and container level. The production-grade baseline matches the Pod Security Standards “restricted” profile.

Gotchas

readOnlyRootFilesystem: true breaks any app writing to /tmp, /var/log, or framework caches — mount emptyDir volumes for those paths.
Binding to ports < 1024 needs CAP_NET_BIND_SERVICE (or just listen on 8080 and let the Service map 80 → 8080).
runAsNonRoot: true enforced by kubelet only checks the image’s USER metadata or runAsUser; an image with USER 0 will fail to start (good).
allowPrivilegeEscalation is forced to true if container is privileged or has CAP_SYS_ADMIN.
npm’s ~/.npm cache dir defaults to /root/.npm; if you switch to UID 1000, set npm_config_cache=/app/.npm or it silently degrades.

Sources: kubernetes.io/docs/tasks/configure-pod-container/security-context, docs.docker.com/engine/security.

Deep dive — build optimization

Layer caching is order-sensitive. Docker hashes each instruction + its inputs; the first instruction whose hash differs from the cache invalidates every subsequent layer. Order Dockerfile commands from least to most frequently changed: base image → OS packages → language deps (COPY package*.json && npm ci) → application source (COPY . .) → build command. The classic mistake is COPY . . before npm ci — every code change re-downloads node_modules.

.dockerignore is not optional: without it COPY . . ships your .git, node_modules, .env, coverage/, test fixtures, and Dockerfile itself into the build context (slow upload to the daemon, cache busts, possible secret leaks).

BuildKit cache mounts persist npm/pip/apt caches outside the layer — the cache survives even when the install layer is invalidated.

Multi-arch builds via docker buildx build --platform linux/amd64,linux/arm64 produce a single manifest list that points to per-arch images; clients pull the right one. Cross-compile when possible (Go, Rust trivially); QEMU emulation works for any language but is 5–20× slower for compute-heavy steps.

Gotchas

COPY package.json package-lock.json ./ (two files) — if you forget the lockfile, npm ci fails.
BuildKit cache mounts are per-builder-instance; on ephemeral CI runners use cache-from/to type=gha or type=registry.
Multi-arch via QEMU compiling C/C++ is brutally slow — use native ARM runners or cross-compile.
Apple Silicon developers building images locally produce arm64-only images that fail in production on x86 — always test the platform you’ll deploy.

Sources: docs.docker.com/build/cache/optimize, docs.docker.com/build/building/multi-platform, docs.docker.com/develop/develop-images/dockerfile_best-practices.

Deep dive — production checklist

Every container service heading to production should pass this gate:

Item	What & why
Liveness probe	Detect deadlocks → kubelet restarts pod. HTTP `/healthz` returning 200 once event loop is responsive. Don’t check downstream deps here (one DB blip → restart loop).
Readiness probe	Detect “not ready for traffic” (warmup, full queue) → removed from Service endpoints. Check downstream deps here.
Startup probe	For slow-start apps (JVM, large model loads) → suppresses liveness until startup completes.
Graceful shutdown	Trap `SIGTERM`, mark readiness false, stop accepting new conns, finish in-flight, exit. K8s sends SIGTERM, waits `terminationGracePeriodSeconds` (default 30), then SIGKILL. A small `preStop` delay can help load balancers drain, but distroless images have no shell, so use an HTTP hook, app-level drain endpoint, or native lifecycle sleep where supported.
Resource requests + limits	`requests` reserves capacity for the scheduler; `limits` cap usage. CPU limit throttles (rarely kills); memory limit OOMKills. Set `requests = p95 actual usage`, `limits = 1.5–2× requests`.
stdout/stderr logging	Don’t write log files inside the container. Log to stdout; the container runtime captures it.
Structured JSON logs	One JSON object per line with `level`, `ts`, `traceId`, `msg`, `service` fields → ingested by Loki/Datadog/CloudWatch with zero parsing.
No secrets in image	Use BuildKit `--mount=type=secret` at build time and K8s Secrets / external secret managers (AWS Secrets Manager via CSI driver, Vault) at runtime.
Image scan + SBOM per build	Trivy gate on HIGH/CRITICAL fixed CVEs; SBOM stored as artifact keyed on image digest.
Signed image	`cosign sign` keyless via OIDC; deploy admission controller (Kyverno, Connaisseur) verifies signatures before pods start.
Pinned base image	`node:20.11.1-bookworm-slim@sha256:abc…` — tag + digest. Reproducible builds, no surprise updates.
Non-root, read-only rootfs, drop ALL caps	See non-root + security section above.

Sources: kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes, kubernetes.io/docs/concepts/workloads/controllers/deployment.