We're in beta · Starting with US & Canada · Shipping weekly — your feedback shapes RiseMe
Vizcom logo
Vizcom Verified
AI, Design Software, SaaS, Machine Learning

Senior Platform & Reliability Engineer (SRE)

San Francisco, California, United StatesOnsiteFull TimeSenior$200,000–$250,000 /yrPosted 2 months agoVisa sponsorship available

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Sign up to see compensation estimate

About Vizcom
Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure.
We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale.
Role Mission
Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.
Compensation
$200,000 – $250,000 base salary + meaningful equity
What You’ll Own

  • Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.
  • Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
  • Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
  • Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.
  • Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
  • Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
  • Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits We’re Looking For

  • Calm, structured incident commander under pressure.
  • Thinks in failure modes and blast radius by default.
  • Pragmatic: can stabilize quickly, then implement durable fixes.
  • High ownership and strong written communication.

First 90 Days

  • Establish baseline reliability metrics and identify top platform risks.
  • Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).
  • Deliver high-impact hardening fixes across probes/startup paths/queue safety.
  • Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to Jordan@vizcom.com :

  • what failed,
  • how you contained it,
  • what permanent fixes you shipped, and measured.
Ready to apply?
You'll be redirected to Vizcom's application page.