Docker — Theory
Docker — Theory (interview deep-dive)
Section titled “Docker — Theory (interview deep-dive)”What’s actually happening (Linux primitives)
Section titled “What’s actually happening (Linux primitives)”Containers are Linux processes with:
- Namespaces — isolation: PID (process tree), NET (interfaces), MNT (filesystem), IPC, UTS (hostname), USER (uid mapping), CGROUP.
- Cgroups (v2) — resource limits: CPU, memory, IO, PIDs, network.
- Capabilities — granular root powers (NET_ADMIN, SYS_ADMIN). Containers usually drop most.
- Seccomp — syscall allow-list profile.
- AppArmor / SELinux — MAC profiles.
- Union filesystem (overlay2 default) — layered FS, copy-on-write.
docker is just a frontend; container runtime is containerd + runc (OCI runtime).
OCI standards
Section titled “OCI standards”- OCI Image Spec — image format.
- OCI Runtime Spec — what a runtime must do.
- OCI Distribution Spec — registry API.
So you can run images built by Docker on containerd, cri-o, podman, etc. K8s uses CRI-O / containerd, not Docker, since 1.24.
Image internals
Section titled “Image internals”- Image = list of layer digests + config JSON (env, cmd, ports).
- Each layer = tar gzip of files added/changed/deleted.
docker history imgshows layer chain.- Pulling: only missing layers. Sharing across images saves disk.
Build cache rules
Section titled “Build cache rules”- A layer is reused if instruction + previous layer match.
COPYlooks at file checksums.RUNis opaque — Docker doesn’t know what changed inside, only the command + previous state.- Buildx + cache mounts let
RUNreuse package caches across builds.
ENTRYPOINT vs CMD
Section titled “ENTRYPOINT vs CMD”| ENTRYPOINT | CMD | |
|---|---|---|
| Form | ["bin","arg"] | ["arg"] (or ["bin","arg"]) |
Override on docker run | --entrypoint | trailing args |
| Typical use | the executable | default args |
Pattern:
ENTRYPOINT ["python", "-m", "app"]CMD ["serve"] # default subcommand# docker run img migrate # → python -m app migrateInit / signal handling
Section titled “Init / signal handling”PID 1 in container has special signal semantics. Most images don’t have a proper init → SIGTERM not forwarded → containers don’t shut down gracefully.
Solutions:
docker run --initadds tini.tini/dumb-initas ENTRYPOINT.- App-level signal handlers.
In Kubernetes, this matters for graceful shutdown during rolling updates.
Logging
Section titled “Logging”- App writes to stdout/stderr.
- Docker logging driver collects: json-file (default), syslog, journald, fluentd, awslogs, gelf, splunk.
- In K8s: container stdout → kubelet → log collector (Fluent Bit, Vector) → backend (Loki, ES, CloudWatch).
Don’t log to files inside container — they’re ephemeral.
Storage drivers
Section titled “Storage drivers”overlay2 is default. Old: aufs (deprecated), devicemapper (slow), btrfs.
Layer count limit ~125. Squash periodically if too deep.
Security model
Section titled “Security model”Containers are not strong security boundaries. Risks:
- Kernel exploit escapes the container.
- Misconfigured
--privileged/ extra caps. - Mounted Docker socket → trivial host takeover.
- Running as root inside maps to root outside (without user namespaces).
Hardening:
- Non-root user.
- Read-only filesystem (
--read-only+tmpfsfor /tmp). - Drop all capabilities, add only needed.
- Seccomp default profile.
- Minimal base image (distroless / scratch).
- Image signing (cosign).
- Vuln scan (Trivy, Grype) in CI.
For higher isolation: gVisor, Kata Containers, Firecracker.
Production readiness — checklist
Section titled “Production readiness — checklist”- Non-root user.
- Specific image tag (digest in prod).
- HEALTHCHECK or liveness/readiness in K8s.
- Graceful shutdown handler.
- Resource limits set.
- Logs to stdout.
- Config via env / mounted secrets, not baked.
- Image scanned, signed.
- Layers minimized.
Common interview Qs
Section titled “Common interview Qs”- Container vs VM. Shared kernel; namespaces+cgroups vs hardware virt.
- Where do container processes show up on the host? As regular processes —
ps -efsees them, but PIDs differ inside. - How does layered FS work? Copy-on-write — read from lower layers; modify writes to top layer.
- Why is
latesttag bad for prod? Non-reproducible; same tag, different content over time. - Image bloated — how to shrink? Multi-stage, slim base,
.dockerignore, dedupe deps, remove caches in same RUN. - Container memory limit OOMs randomly. Set requests/limits in K8s; use
--memory-swap=-1to disable swap; analyze actual usage; consider Node--max-old-space-size. - Sigterm not delivered to app. PID 1 issue — use init.
- Difference between
RUN apt install -y x && rm -rf /var/lib/apt/lists/*and not? Cleaning in the same RUN avoids leaving the cache in that layer (it’s there forever otherwise). - What’s a sidecar pattern? Companion container in the same pod sharing volumes/network — proxy, log shipper, encryption.
- Multi-arch images — how built?
docker buildx build --platform linux/amd64,linux/arm64 --push.
Anti-patterns
Section titled “Anti-patterns”- Running database in container with bind mount to NFS without testing.
- One image trying to be many roles via env-driven branching.
- Using
docker execas deployment workflow. - Privileged containers without justification.
- Mounting
/var/run/docker.sockto a service container. - Big monolithic compose file as production config.
Deep dive — multi-stage builds (why & how)
Section titled “Deep dive — multi-stage builds (why & how)”Multi-stage builds let you use multiple FROM statements in a single Dockerfile, each starting a new build stage you can selectively copy artifacts from using COPY --from=<stage>. The point: keep heavy build-time tooling (compilers, dev dependencies, test frameworks, source code) out of the final runtime image.
Docker’s docs: the goal is creating “a tiny production image with nothing but the binary inside.” A second benefit specific to BuildKit (default builder since Docker 23.0): BuildKit only builds stages that the target stage actually depends on, whereas the legacy builder ran every preceding stage regardless.
BuildKit cache & secret mounts extend multi-stage further:
RUN --mount=type=cache,target=/root/.npmpersists the npm cache across builds without baking it into a layer — repeat builds skip the network entirely for unchanged deps.--mount=type=secret,id=npm_tokenexposes a secret as a file inside the RUN step only; it never lands in any layer or image history.
Standard Node.js pattern is three stages: deps (install with npm ci), build (run tsc / vite build), runtime (copy only dist/ + production node_modules onto a minimal base).
Gotchas
Section titled “Gotchas”COPY --from=buildcopies file ownership too; ifbuildran as root and runtime is non-root, you get permission errors. UseCOPY --chown=1000:1000.npm prune --omit=devmust run after the build (TS compiler is in devDeps).- BuildKit cache mounts are not shared across CI runners by default — on GitHub Actions you need
actions/cacheorcache-from=type=gha. --mount=type=secretrequires DOCKER_BUILDKIT=1 and Dockerfile syntax>= 1.2.- Don’t
COPY . .before installing deps — you’ll bust the npm install layer on every code change.
Q: Walk me through a multi-stage Dockerfile for a Node.js TypeScript service. Why three stages?
Stage 1 installs all deps with npm ci so the layer is cached on lockfile changes only. Stage 2 builds TypeScript using those deps, then prunes devDependencies. Stage 3 starts from a minimal/distroless base and copies only dist/ + production node_modules. Result: ~150 MB image with no tsc, no npm, no shell — smaller pull, smaller attack surface, and the build tooling is provably absent in production.
Q: How do you handle a private npm registry token without leaking it?
Never use ARG or ENV for secrets; both end up in image history. Use BuildKit’s --mount=type=secret,id=npm_token so the token is mounted as a tmpfs file inside the RUN step only. In CI, pass --secret id=npm_token,env=NPM_TOKEN. Verify with docker history that the token isn’t in any layer.
Sources: docs.docker.com/build/building/multi-stage, docs.docker.com/build/building/secrets, docs.docker.com/build/cache/optimize.
Deep dive — distroless + minimal images
Section titled “Deep dive — distroless + minimal images”Google’s distroless images (gcr.io/distroless/nodejs20-debian12, gcr.io/distroless/static-debian13, etc.) contain only your application and its language runtime — no shell, no package manager, no busybox, no apt.
Sizes: static-debian13 ~2 MiB; Alpine ~5 MiB; Debian slim ~70+ MiB.
The security thesis: every binary in the image is a potential CVE; removing them improves “the signal-to-noise of CVE scanners” and shrinks the attack surface (an attacker who pops a shell into a distroless container can’t wget, curl, or even sh -c). Used by Kubernetes itself (since v1.15), Knative, Tekton.
Trade-offs
Section titled “Trade-offs”Alpine uses musl libc instead of glibc, which breaks native modules compiled for glibc (bcrypt, sharp, node-canvas are common pain points — they either need an alpine-specific build or won’t load). Distroless uses glibc, so native modules work, but you can’t docker exec a shell into a running container for debugging — use the :debug tag (which adds busybox) only in non-prod or use ephemeral debug containers (kubectl debug).
Chainguard Images (built on the Wolfi distro) are a newer alternative: glibc-based, signed with cosign, ship with SBOM by default, target near-zero CVEs at release time. De facto choice when you want distroless + supply-chain provenance out of the box.
Gotchas
Section titled “Gotchas”- Distroless has no shell —
CMD "node server.js"(string form) fails; use exec formCMD ["node", "server.js"]or["server.js"]for the nodejs variant. - No
ls, nocat, noshfor debugging — usekubectl debug --image=busybox --target=app pod/x. - Alpine + Node native modules (
bcrypt,sharp) frequently fail at runtime; pre-build with thenode:20-alpinebuilder or switch base. - Sizes:
node:20(full Debian) ≈ 1 GB;node:20-slim≈ 240 MB;node:20-alpine≈ 150 MB;gcr.io/distroless/nodejs20≈ 150 MB. Slim+distroless wins on glibc compatibility. - Distroless images use UID 65532 by default named
nonroot; align your file ownership.
Sources: github.com/GoogleContainerTools/distroless, edu.chainguard.dev/chainguard/chainguard-images/overview.
Deep dive — non-root + security hardening
Section titled “Deep dive — non-root + security hardening”Containers default to running as UID 0 (root) inside the namespace. Even with user namespaces, a root-in-container exploit chain is still strictly more dangerous than non-root.
Dockerfile layer
Section titled “Dockerfile layer”USER 1000 (or a named user created via RUN useradd).
Runtime layer
Section titled “Runtime layer”--read-onlymounts the rootfs read-only (force writes to explicittmpfsmounts).--cap-drop=ALL --cap-add=NET_BIND_SERVICEdrops every Linux capability and re-adds only what you need.--security-opt=no-new-privilegessets the kernelno_new_privsflag so a setuid binary can’t escalate.--security-opt seccomp=profile.jsonrestricts which syscalls the kernel allows.
Kubernetes equivalents
Section titled “Kubernetes equivalents”Live under securityContext at pod and container level. The production-grade baseline matches the Pod Security Standards “restricted” profile.
Gotchas
Section titled “Gotchas”readOnlyRootFilesystem: truebreaks any app writing to/tmp,/var/log, or framework caches — mountemptyDirvolumes for those paths.- Binding to ports < 1024 needs
CAP_NET_BIND_SERVICE(or just listen on 8080 and let the Service map 80 → 8080). runAsNonRoot: trueenforced by kubelet only checks the image’s USER metadata orrunAsUser; an image withUSER 0will fail to start (good).allowPrivilegeEscalationis forced totrueif container is privileged or hasCAP_SYS_ADMIN.- npm’s
~/.npmcache dir defaults to/root/.npm; if you switch to UID 1000, setnpm_config_cache=/app/.npmor it silently degrades.
Sources: kubernetes.io/docs/tasks/configure-pod-container/security-context, docs.docker.com/engine/security.
Deep dive — build optimization
Section titled “Deep dive — build optimization”Layer caching is order-sensitive. Docker hashes each instruction + its inputs; the first instruction whose hash differs from the cache invalidates every subsequent layer. Order Dockerfile commands from least to most frequently changed: base image → OS packages → language deps (COPY package*.json && npm ci) → application source (COPY . .) → build command. The classic mistake is COPY . . before npm ci — every code change re-downloads node_modules.
.dockerignore is not optional: without it COPY . . ships your .git, node_modules, .env, coverage/, test fixtures, and Dockerfile itself into the build context (slow upload to the daemon, cache busts, possible secret leaks).
BuildKit cache mounts persist npm/pip/apt caches outside the layer — the cache survives even when the install layer is invalidated.
Multi-arch builds via docker buildx build --platform linux/amd64,linux/arm64 produce a single manifest list that points to per-arch images; clients pull the right one. Cross-compile when possible (Go, Rust trivially); QEMU emulation works for any language but is 5–20× slower for compute-heavy steps.
Gotchas
Section titled “Gotchas”COPY package.json package-lock.json ./(two files) — if you forget the lockfile,npm cifails.- BuildKit cache mounts are per-builder-instance; on ephemeral CI runners use
cache-from/to type=ghaortype=registry. - Multi-arch via QEMU compiling C/C++ is brutally slow — use native ARM runners or cross-compile.
- Apple Silicon developers building images locally produce arm64-only images that fail in production on x86 — always test the platform you’ll deploy.
Sources: docs.docker.com/build/cache/optimize, docs.docker.com/build/building/multi-platform, docs.docker.com/develop/develop-images/dockerfile_best-practices.
Deep dive — production checklist
Section titled “Deep dive — production checklist”Every container service heading to production should pass this gate:
| Item | What & why |
|---|---|
| Liveness probe | Detect deadlocks → kubelet restarts pod. HTTP /healthz returning 200 once event loop is responsive. Don’t check downstream deps here (one DB blip → restart loop). |
| Readiness probe | Detect “not ready for traffic” (warmup, full queue) → removed from Service endpoints. Check downstream deps here. |
| Startup probe | For slow-start apps (JVM, large model loads) → suppresses liveness until startup completes. |
| Graceful shutdown | Trap SIGTERM, mark readiness false, stop accepting new conns, finish in-flight, exit. K8s sends SIGTERM, waits terminationGracePeriodSeconds (default 30), then SIGKILL. A small preStop delay can help load balancers drain, but distroless images have no shell, so use an HTTP hook, app-level drain endpoint, or native lifecycle sleep where supported. |
| Resource requests + limits | requests reserves capacity for the scheduler; limits cap usage. CPU limit throttles (rarely kills); memory limit OOMKills. Set requests = p95 actual usage, limits = 1.5–2× requests. |
| stdout/stderr logging | Don’t write log files inside the container. Log to stdout; the container runtime captures it. |
| Structured JSON logs | One JSON object per line with level, ts, traceId, msg, service fields → ingested by Loki/Datadog/CloudWatch with zero parsing. |
| No secrets in image | Use BuildKit --mount=type=secret at build time and K8s Secrets / external secret managers (AWS Secrets Manager via CSI driver, Vault) at runtime. |
| Image scan + SBOM per build | Trivy gate on HIGH/CRITICAL fixed CVEs; SBOM stored as artifact keyed on image digest. |
| Signed image | cosign sign keyless via OIDC; deploy admission controller (Kyverno, Connaisseur) verifies signatures before pods start. |
| Pinned base image | node:20.11.1-bookworm-slim@sha256:abc… — tag + digest. Reproducible builds, no surprise updates. |
| Non-root, read-only rootfs, drop ALL caps | See non-root + security section above. |
Sources: kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes, kubernetes.io/docs/concepts/workloads/controllers/deployment.