Walk me through one graphics-related testing project from your resume end-to-end: the goal, architecture, your exact contributions, key technical decisions, test coverage strategy, performance/correctness validation, and results. What trade-offs did you make, what went wrong, and what would you do differently now?
Quick Answer: This question evaluates leadership, project ownership, systems thinking, graphics test engineering skills, test strategy design, performance and correctness validation, and technical decision-making, and is commonly asked to assess a candidate's ability to communicate complex technical work, justify trade-offs, and demonstrate measurable impact from a real project. It belongs to the Behavioral & Leadership category within graphics and test engineering for software engineers, testing system architecture, testing methodologies, tooling and metrics, and probes both conceptual understanding and practical application through an end-to-end concrete example.
Solution
# Example answer: Graphics image-based regression and performance harness for a Vulkan renderer
## Situation and goal
- Situation: Our renderer shipped to multiple platforms and GPUs. Visual regressions and performance drops were slipping into releases because tests were manual and device coverage was thin.
- Goal: Build an automated, deterministic test harness that:
- Catches visual regressions end-to-end (not just unit tests).
- Detects performance regressions with statistical confidence.
- Scales across a matrix of GPUs, drivers, and OS versions.
- Integrates with CI for per-commit gates and nightly deep coverage.
Success metrics:
- Reduce manual visual QA hours by >50%.
- Catch >80% of regressions pre-merge (measured via post-hoc incident analysis).
- Keep flake rate <1% across the farm.
## Architecture and tech stack
- Orchestrator (Python): schedules tests, collects results, computes stats, posts CI status.
- Test runner (C++): thin harness around the Vulkan backend to render scenes offscreen with pinned pipeline states and deterministic seeds. Exposes GPU timestamp queries per pass.
- Baseline store: object storage with content-addressed golden images and metadata (scene config, camera, exposure, tone map, seeds, driver hash).
- Comparator: computes PSNR and SSIM, supports alpha-aware diffs, ROI masking, and color-space-aware comparisons (linear vs sRGB).
- Device farm: bare-metal nodes with different GPUs/drivers. Agents claim jobs based on capability tags (e.g., sample rates, ray tracing support).
- Debug tooling integration: validation layers, shader compilation checks (glslang/spirv-val), and capture-on-fail (frame capture tool) for triage.
Data flow:
1) CI triggers per-commit smoke tests; nightly runs full matrix.
2) Runner renders scenes to offscreen framebuffers; logs GPU timestamps and counters.
3) Images compared to goldens; thresholds evaluate pass/fail.
4) Failures auto-attach captures and logs to a bug template; PRs are gated.
## My contributions
- Designed the end-to-end architecture, wrote the RFC, and aligned rendering, QA, and infra teams on requirements (determinism, thresholds, change-control for baselines).
- Implemented the comparator module with SSIM, PSNR, and alpha-aware diffs. Fixed pitfalls around linear/sRGB mismatches and pre-multiplied alpha.
- Added GPU instrumentation: per-pass timestamp queries and stable percentile-based performance gating.
- Built the scene/test generator: 150+ canonical scenes covering MSAA, shadow maps, PBR BRDFs, HDR tonemapping, clustered lighting, TAA disabled for determinism, motion vectors, and texture filtering modes.
- Created flake mitigation: deterministic seeds, vsync off, fixed timestep, disabled fast-math where needed, and repeated runs with bootstrap CIs for performance.
- Stood up the farm scheduler and tagging system; added automatic capture-on-failure and triage labels (e.g., sRGB mismatch, precision, z-fighting, driver crash).
## Key technical decisions and alternatives
1) End-to-end image diffs vs unit/module tests
- Decision: Do both, but gate on image-based end-to-end; maintain a small set of shader unit tests with a CPU/compute reference for critical math (e.g., BRDF lobes).
- Rationale: End-to-end catches integration issues (state, barriers, color space). Unit tests help isolate precision and math.
2) Golden images strategy
- Option A: Per-GPU goldens. Option B: Single canonical golden from a reference path (CPU or high-precision compute), with per-GPU tolerances.
- Decision: B. Per-GPU goldens exploded storage/maintenance. We used a canonical baseline with device-specific thresholds for PSNR/SSIM.
3) Tolerance and determinism
- Decision: Use PSNR ≥ 40 dB and SSIM ≥ 0.99 for pass, with ROI masks to ignore known non-deterministic regions (analytic AA edges). Seed all noise and disable TAA in test scenes.
4) Performance gating policy
- Decision: Gating on P95 frame time change >8% with 10-run replicates per scene-config; nightly runs cover the full matrix, PR runs cover a subset.
5) Storage and bandwidth
- Decision: Object storage with zstd-compressed PNGs and dedup via perceptual hashing. We also stored deltas for minor baseline shifts.
## Test coverage strategy
- Functional matrix:
- Rasterization: MSAA variants, depth formats, blending modes, stencil ops.
- Texturing: mip bias, anisotropic filtering, wrap modes, sRGB/linear assets.
- Lighting: PBR BRDFs, IBL, clustered/forward+, shadow cascades, PCF variations.
- Post: HDR tone mapping, bloom, exposure, gamma; deterministic exposure locked for tests.
- Geometry: instancing, skinning, tessellation off/on (where supported).
- Property-based tests:
- Invariants: energy ≤ input for non-emissive surfaces; monotonicity of luminance as roughness increases for certain BRDF choices; rotation invariance for isotropic materials.
- Bounds: NaN/Inf guards; color ∈ [0, 1] post-tonemap.
- Negative tests:
- Invalid descriptor bindings, OOB indices (expect validation errors not device loss).
- Intentional missing barriers to ensure validation catches hazards.
- Image-based diffs:
- Goldens in linear space; comparisons done after tone mapping for user-visible results; region masks for temporal instability.
Coverage accounting:
- Feature-to-test traceability in a matrix; gaps tracked as tech debt with target dates.
## Performance and correctness validation
- Correctness metrics:
- PSNR: PSNR = 10 log10(MAX^2 / MSE). Example: For 8-bit channels, MAX = 255. If MSE = 4, PSNR ≈ 10 log10(255^2 / 4) ≈ 10 log10(16256.25) ≈ 42.1 dB (passes our 40 dB threshold).
- SSIM: windowed SSIM aggregated over the frame; threshold 0.99. We use linear color values and handle premultiplied alpha.
- Performance metrics:
- GPU timestamps per pass; frame time percentiles (P50, P95, P99).
- Replicated runs (n=10) and bootstrap confidence intervals for relative changes.
- Sample-size intuition: with CV ≈ 2–3%, n=10 gives ~±2–3% half-width CI at 95% for percentiles; sufficient to detect ≥8% regressions.
- Tooling:
- Validation layers, shader compilation checks (spirv-val), capture-on-fail, and optional hardware counters via vendor-agnostic APIs where available.
Guardrails:
- Baseline change control requires an RFC and visual diff review; all thresholds versioned.
- PR runs are fast (smoke + sentinel scenes); nightly covers the full matrix to cap compute costs.
## Results and impact
- Built 1,200 test scenes/configs across 5 GPU families and 3 OSes.
- Reduced manual visual QA effort by ~70% (8 hours → ~2.5 hours per release).
- Pre-merge regression catch rate rose from ~35% to ~88% (tracked over 3 release cycles).
- Flake rate dropped from ~6% to <1% via determinism fixes and statistical gating.
- Discovered and fixed:
- 40+ shader precision/NaN bugs (e.g., roughness=0 edge cases, divide-by-zero in specular term).
- 2 depth pre-pass state issues (barrier/order problems causing flicker).
- Several platform-specific driver issues (e.g., inconsistent derivative behavior at extreme LODs), triaged with captures and minimized repros.
- Performance wins: Identified two costly passes; refactor cut P95 frame time by 12% on mid-tier GPUs in a representative scene.
## What went wrong and how we mitigated it
- Non-deterministic diffs from sRGB/linear mismatches and premultiplied alpha:
- Fix: Standardized asset IO to linear, explicit color transforms, alpha-aware comparator.
- Temporal instability from TAA and auto-exposure:
- Fix: Disabled TAA/auto-exposure in test configs; locked camera and seeds.
- Driver churn caused baseline drift on the farm:
- Fix: Pinned driver versions for CI lanes; new driver lanes run nightly-only until baselines are re-approved.
- Storage blow-up from goldens:
- Fix: Perceptual deduping, delta storage, and pruning obsolete baselines via retention policies.
- Flaky perf gates due to thermal throttling on shared nodes:
- Fix: Isolated power/thermal profiles and pre-warm cooldown windows; capped concurrent jobs per host.
## Trade-offs
- Single canonical golden with per-GPU tolerances simplified maintenance but required careful threshold tuning. We accepted rare false positives at first to avoid misses.
- Smaller PR test set vs full matrix: optimized developer velocity at the cost of pushing some catching power to nightly runs.
- Image-based testing catches integration issues but can be opaque. We invested in intermediate attachment dumps and automated stage-bisection to aid triage.
## What I would do differently now
- Adopt pairwise/combinatorial test generation earlier to cover feature interactions more efficiently.
- Add a minimal CPU or high-precision compute reference renderer for a handful of scenes to reduce tolerance debates.
- Integrate standardized conformance suites alongside our content-driven tests to cover spec edges.
- Build automatic root-cause bisection: when diffs appear, capture intermediate buffers by stage and run a binary search across passes.
- Expand to mobile/thermally constrained devices with power/thermal budgets and long-run stability tests.
## Takeaways
- For graphics testing at scale, determinism, explicit color management, and statistical performance gating are non-negotiable.
- Balance maintenance cost (goldens, storage) with signal (thresholds, references). Invest in triage to keep the system developer-friendly.
- Treat baselines and thresholds as code: version, review, and roll them out deliberately.