Skip to content

Docker — Theory

What’s actually happening (Linux primitives)

Section titled “What’s actually happening (Linux primitives)”

Containers are Linux processes with:

  • Namespaces — isolation: PID (process tree), NET (interfaces), MNT (filesystem), IPC, UTS (hostname), USER (uid mapping), CGROUP.
  • Cgroups (v2) — resource limits: CPU, memory, IO, PIDs, network.
  • Capabilities — granular root powers (NET_ADMIN, SYS_ADMIN). Containers usually drop most.
  • Seccomp — syscall allow-list profile.
  • AppArmor / SELinux — MAC profiles.
  • Union filesystem (overlay2 default) — layered FS, copy-on-write.

docker is just a frontend; container runtime is containerd + runc (OCI runtime).

  • OCI Image Spec — image format.
  • OCI Runtime Spec — what a runtime must do.
  • OCI Distribution Spec — registry API.

So you can run images built by Docker on containerd, cri-o, podman, etc. K8s uses CRI-O / containerd, not Docker, since 1.24.

  • Image = list of layer digests + config JSON (env, cmd, ports).
  • Each layer = tar gzip of files added/changed/deleted.
  • docker history img shows layer chain.
  • Pulling: only missing layers. Sharing across images saves disk.
  • A layer is reused if instruction + previous layer match.
  • COPY looks at file checksums.
  • RUN is opaque — Docker doesn’t know what changed inside, only the command + previous state.
  • Buildx + cache mounts let RUN reuse package caches across builds.
ENTRYPOINTCMD
Form["bin","arg"]["arg"] (or ["bin","arg"])
Override on docker run--entrypointtrailing args
Typical usethe executabledefault args

Pattern:

ENTRYPOINT ["python", "-m", "app"]
CMD ["serve"] # default subcommand
# docker run img migrate # → python -m app migrate

PID 1 in container has special signal semantics. Most images don’t have a proper init → SIGTERM not forwarded → containers don’t shut down gracefully.

Solutions:

  • docker run --init adds tini.
  • tini / dumb-init as ENTRYPOINT.
  • App-level signal handlers.

In Kubernetes, this matters for graceful shutdown during rolling updates.

  • App writes to stdout/stderr.
  • Docker logging driver collects: json-file (default), syslog, journald, fluentd, awslogs, gelf, splunk.
  • In K8s: container stdout → kubelet → log collector (Fluent Bit, Vector) → backend (Loki, ES, CloudWatch).

Don’t log to files inside container — they’re ephemeral.

overlay2 is default. Old: aufs (deprecated), devicemapper (slow), btrfs.

Layer count limit ~125. Squash periodically if too deep.

Containers are not strong security boundaries. Risks:

  • Kernel exploit escapes the container.
  • Misconfigured --privileged / extra caps.
  • Mounted Docker socket → trivial host takeover.
  • Running as root inside maps to root outside (without user namespaces).

Hardening:

  • Non-root user.
  • Read-only filesystem (--read-only + tmpfs for /tmp).
  • Drop all capabilities, add only needed.
  • Seccomp default profile.
  • Minimal base image (distroless / scratch).
  • Image signing (cosign).
  • Vuln scan (Trivy, Grype) in CI.

For higher isolation: gVisor, Kata Containers, Firecracker.

  • Non-root user.
  • Specific image tag (digest in prod).
  • HEALTHCHECK or liveness/readiness in K8s.
  • Graceful shutdown handler.
  • Resource limits set.
  • Logs to stdout.
  • Config via env / mounted secrets, not baked.
  • Image scanned, signed.
  • Layers minimized.
  1. Container vs VM. Shared kernel; namespaces+cgroups vs hardware virt.
  2. Where do container processes show up on the host? As regular processes — ps -ef sees them, but PIDs differ inside.
  3. How does layered FS work? Copy-on-write — read from lower layers; modify writes to top layer.
  4. Why is latest tag bad for prod? Non-reproducible; same tag, different content over time.
  5. Image bloated — how to shrink? Multi-stage, slim base, .dockerignore, dedupe deps, remove caches in same RUN.
  6. Container memory limit OOMs randomly. Set requests/limits in K8s; use --memory-swap=-1 to disable swap; analyze actual usage; consider Node --max-old-space-size.
  7. Sigterm not delivered to app. PID 1 issue — use init.
  8. Difference between RUN apt install -y x && rm -rf /var/lib/apt/lists/* and not? Cleaning in the same RUN avoids leaving the cache in that layer (it’s there forever otherwise).
  9. What’s a sidecar pattern? Companion container in the same pod sharing volumes/network — proxy, log shipper, encryption.
  10. Multi-arch images — how built? docker buildx build --platform linux/amd64,linux/arm64 --push.
  • Running database in container with bind mount to NFS without testing.
  • One image trying to be many roles via env-driven branching.
  • Using docker exec as deployment workflow.
  • Privileged containers without justification.
  • Mounting /var/run/docker.sock to a service container.
  • Big monolithic compose file as production config.

Deep dive — multi-stage builds (why & how)

Section titled “Deep dive — multi-stage builds (why & how)”

Multi-stage builds let you use multiple FROM statements in a single Dockerfile, each starting a new build stage you can selectively copy artifacts from using COPY --from=<stage>. The point: keep heavy build-time tooling (compilers, dev dependencies, test frameworks, source code) out of the final runtime image.

Docker’s docs: the goal is creating “a tiny production image with nothing but the binary inside.” A second benefit specific to BuildKit (default builder since Docker 23.0): BuildKit only builds stages that the target stage actually depends on, whereas the legacy builder ran every preceding stage regardless.

BuildKit cache & secret mounts extend multi-stage further:

  • RUN --mount=type=cache,target=/root/.npm persists the npm cache across builds without baking it into a layer — repeat builds skip the network entirely for unchanged deps.
  • --mount=type=secret,id=npm_token exposes a secret as a file inside the RUN step only; it never lands in any layer or image history.

Standard Node.js pattern is three stages: deps (install with npm ci), build (run tsc / vite build), runtime (copy only dist/ + production node_modules onto a minimal base).

  • COPY --from=build copies file ownership too; if build ran as root and runtime is non-root, you get permission errors. Use COPY --chown=1000:1000.
  • npm prune --omit=dev must run after the build (TS compiler is in devDeps).
  • BuildKit cache mounts are not shared across CI runners by default — on GitHub Actions you need actions/cache or cache-from=type=gha.
  • --mount=type=secret requires DOCKER_BUILDKIT=1 and Dockerfile syntax >= 1.2.
  • Don’t COPY . . before installing deps — you’ll bust the npm install layer on every code change.

Q: Walk me through a multi-stage Dockerfile for a Node.js TypeScript service. Why three stages?

Stage 1 installs all deps with npm ci so the layer is cached on lockfile changes only. Stage 2 builds TypeScript using those deps, then prunes devDependencies. Stage 3 starts from a minimal/distroless base and copies only dist/ + production node_modules. Result: ~150 MB image with no tsc, no npm, no shell — smaller pull, smaller attack surface, and the build tooling is provably absent in production.

Q: How do you handle a private npm registry token without leaking it?

Never use ARG or ENV for secrets; both end up in image history. Use BuildKit’s --mount=type=secret,id=npm_token so the token is mounted as a tmpfs file inside the RUN step only. In CI, pass --secret id=npm_token,env=NPM_TOKEN. Verify with docker history that the token isn’t in any layer.

Sources: docs.docker.com/build/building/multi-stage, docs.docker.com/build/building/secrets, docs.docker.com/build/cache/optimize.


Google’s distroless images (gcr.io/distroless/nodejs20-debian12, gcr.io/distroless/static-debian13, etc.) contain only your application and its language runtime — no shell, no package manager, no busybox, no apt.

Sizes: static-debian13 ~2 MiB; Alpine ~5 MiB; Debian slim ~70+ MiB.

The security thesis: every binary in the image is a potential CVE; removing them improves “the signal-to-noise of CVE scanners” and shrinks the attack surface (an attacker who pops a shell into a distroless container can’t wget, curl, or even sh -c). Used by Kubernetes itself (since v1.15), Knative, Tekton.

Alpine uses musl libc instead of glibc, which breaks native modules compiled for glibc (bcrypt, sharp, node-canvas are common pain points — they either need an alpine-specific build or won’t load). Distroless uses glibc, so native modules work, but you can’t docker exec a shell into a running container for debugging — use the :debug tag (which adds busybox) only in non-prod or use ephemeral debug containers (kubectl debug).

Chainguard Images (built on the Wolfi distro) are a newer alternative: glibc-based, signed with cosign, ship with SBOM by default, target near-zero CVEs at release time. De facto choice when you want distroless + supply-chain provenance out of the box.

  • Distroless has no shell — CMD "node server.js" (string form) fails; use exec form CMD ["node", "server.js"] or ["server.js"] for the nodejs variant.
  • No ls, no cat, no sh for debugging — use kubectl debug --image=busybox --target=app pod/x.
  • Alpine + Node native modules (bcrypt, sharp) frequently fail at runtime; pre-build with the node:20-alpine builder or switch base.
  • Sizes: node:20 (full Debian) ≈ 1 GB; node:20-slim ≈ 240 MB; node:20-alpine ≈ 150 MB; gcr.io/distroless/nodejs20 ≈ 150 MB. Slim+distroless wins on glibc compatibility.
  • Distroless images use UID 65532 by default named nonroot; align your file ownership.

Sources: github.com/GoogleContainerTools/distroless, edu.chainguard.dev/chainguard/chainguard-images/overview.


Deep dive — non-root + security hardening

Section titled “Deep dive — non-root + security hardening”

Containers default to running as UID 0 (root) inside the namespace. Even with user namespaces, a root-in-container exploit chain is still strictly more dangerous than non-root.

USER 1000 (or a named user created via RUN useradd).

  • --read-only mounts the rootfs read-only (force writes to explicit tmpfs mounts).
  • --cap-drop=ALL --cap-add=NET_BIND_SERVICE drops every Linux capability and re-adds only what you need.
  • --security-opt=no-new-privileges sets the kernel no_new_privs flag so a setuid binary can’t escalate.
  • --security-opt seccomp=profile.json restricts which syscalls the kernel allows.

Live under securityContext at pod and container level. The production-grade baseline matches the Pod Security Standards “restricted” profile.

  • readOnlyRootFilesystem: true breaks any app writing to /tmp, /var/log, or framework caches — mount emptyDir volumes for those paths.
  • Binding to ports < 1024 needs CAP_NET_BIND_SERVICE (or just listen on 8080 and let the Service map 80 → 8080).
  • runAsNonRoot: true enforced by kubelet only checks the image’s USER metadata or runAsUser; an image with USER 0 will fail to start (good).
  • allowPrivilegeEscalation is forced to true if container is privileged or has CAP_SYS_ADMIN.
  • npm’s ~/.npm cache dir defaults to /root/.npm; if you switch to UID 1000, set npm_config_cache=/app/.npm or it silently degrades.

Sources: kubernetes.io/docs/tasks/configure-pod-container/security-context, docs.docker.com/engine/security.


Layer caching is order-sensitive. Docker hashes each instruction + its inputs; the first instruction whose hash differs from the cache invalidates every subsequent layer. Order Dockerfile commands from least to most frequently changed: base image → OS packages → language deps (COPY package*.json && npm ci) → application source (COPY . .) → build command. The classic mistake is COPY . . before npm ci — every code change re-downloads node_modules.

.dockerignore is not optional: without it COPY . . ships your .git, node_modules, .env, coverage/, test fixtures, and Dockerfile itself into the build context (slow upload to the daemon, cache busts, possible secret leaks).

BuildKit cache mounts persist npm/pip/apt caches outside the layer — the cache survives even when the install layer is invalidated.

Multi-arch builds via docker buildx build --platform linux/amd64,linux/arm64 produce a single manifest list that points to per-arch images; clients pull the right one. Cross-compile when possible (Go, Rust trivially); QEMU emulation works for any language but is 5–20× slower for compute-heavy steps.

  • COPY package.json package-lock.json ./ (two files) — if you forget the lockfile, npm ci fails.
  • BuildKit cache mounts are per-builder-instance; on ephemeral CI runners use cache-from/to type=gha or type=registry.
  • Multi-arch via QEMU compiling C/C++ is brutally slow — use native ARM runners or cross-compile.
  • Apple Silicon developers building images locally produce arm64-only images that fail in production on x86 — always test the platform you’ll deploy.

Sources: docs.docker.com/build/cache/optimize, docs.docker.com/build/building/multi-platform, docs.docker.com/develop/develop-images/dockerfile_best-practices.


Every container service heading to production should pass this gate:

ItemWhat & why
Liveness probeDetect deadlocks → kubelet restarts pod. HTTP /healthz returning 200 once event loop is responsive. Don’t check downstream deps here (one DB blip → restart loop).
Readiness probeDetect “not ready for traffic” (warmup, full queue) → removed from Service endpoints. Check downstream deps here.
Startup probeFor slow-start apps (JVM, large model loads) → suppresses liveness until startup completes.
Graceful shutdownTrap SIGTERM, mark readiness false, stop accepting new conns, finish in-flight, exit. K8s sends SIGTERM, waits terminationGracePeriodSeconds (default 30), then SIGKILL. A small preStop delay can help load balancers drain, but distroless images have no shell, so use an HTTP hook, app-level drain endpoint, or native lifecycle sleep where supported.
Resource requests + limitsrequests reserves capacity for the scheduler; limits cap usage. CPU limit throttles (rarely kills); memory limit OOMKills. Set requests = p95 actual usage, limits = 1.5–2× requests.
stdout/stderr loggingDon’t write log files inside the container. Log to stdout; the container runtime captures it.
Structured JSON logsOne JSON object per line with level, ts, traceId, msg, service fields → ingested by Loki/Datadog/CloudWatch with zero parsing.
No secrets in imageUse BuildKit --mount=type=secret at build time and K8s Secrets / external secret managers (AWS Secrets Manager via CSI driver, Vault) at runtime.
Image scan + SBOM per buildTrivy gate on HIGH/CRITICAL fixed CVEs; SBOM stored as artifact keyed on image digest.
Signed imagecosign sign keyless via OIDC; deploy admission controller (Kyverno, Connaisseur) verifies signatures before pods start.
Pinned base imagenode:20.11.1-bookworm-slim@sha256:abc… — tag + digest. Reproducible builds, no surprise updates.
Non-root, read-only rootfs, drop ALL capsSee non-root + security section above.

Sources: kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes, kubernetes.io/docs/concepts/workloads/controllers/deployment.