Software Engineer - Site Reliability Engineering
Role summary
Zoox is looking for a Site Reliability Engineer to ensure the availability, performance, and resilience of services powering autonomous vehicle development and operations. This role involves owning the full lifecycle of services, from design to production deployment and continuous improvement, with a strong emphasis on automation. The engineer will work with systems processing large data volumes and supporting compute-intensive pipelines on CPUs and GPUs. A background in SRE, large-scale distributed systems, cloud platforms (AWS, GCP, Azure), IaC tools, Kubernetes, core infrastructure knowledge (networking, storage, databases), and programming skills (Python, Go, C/C++, Java) is required.
In this role, you will:
Architect and optimize scalable systems: You will design, implement, and continuously improve highly reliable infrastructure, directly impacting the success and safety of Zoox's autonomous vehicle platform.
Build proactive monitoring solutions: You will develop advanced monitoring, alerting, and reporting tools to ensure potential issues are identified and resolved before they affect production.
Collaborate across engineering: You will partner closely with software engineering teams to elevate our system architecture, streamline deployment processes, and drive automation initiatives.
Lead incident resolution: You will conduct thorough root cause analyses on production issues and rapidly deploy corrective actions to maintain a resilient and stable environment.
Ensure business continuity: You will safeguard the company's operations by designing and implementing robust disaster recovery plans to keep the Zoox fleet running smoothly under any circumstances.
Qualifications
SRE & Distributed Systems Experience: 5+ years of experience in site reliability engineering or a similar role, with a strong, objective background in managing large-scale distributed systems.
Cloud & Infrastructure as Code (IaC): Proven experience operating within major cloud platforms (AWS, GCP, or Azure) and utilizing IaC tools like Terraform, Ansible, Salt, or CloudFormation.
Container Orchestration: Technical expertise in deploying, managing, and scaling systems using container orchestration technologies such as Kubernetes.
Core Infrastructure Knowledge: Deep, foundational understanding of networking protocols, storage solutions, and database technologies.
Programming Proficiency: Strong, demonstrable programming and scripting skills in languages such as Python, Go, C/C++, or Java.
Bonus Qualifications
Experience in the automotive or autonomous vehicle industry.
Knowledge of security best practices and compliance requirements.
Similar roles
Senior Site Reliability EngineeringRBC · Toronto, Ontario, Canada · Hybrid
Director of Site Reliability EngineeringJPMorganChase · Palo Alto, California, United States · Onsite- Director of Site Reliability EngineeringHarrison Clarke · San Francisco, California, United States · Hybrid
- Site Reliability EngineeringZoox · Foster City, California, United States · Hybrid
- Director of Site Reliability EngineeringSolutionForge Systems · United States · Remote