Site Reliability Engineer (Production Systems)
Role summary
A high-impact infrastructure initiative is seeking experienced Site Reliability Engineers to enhance the reliability and resilience of production systems supporting AI development. The role involves designing incident scenarios, performing root cause analysis, evaluating system observability, and contributing to reliability frameworks. Success requires deep operational insight and the ability to translate real-world challenges into structured problem environments. The position is remote, requires US-based candidates, and emphasizes hands-on experience with production systems, incident response, and core SRE technologies.
Description:
A high-impact infrastructure initiative focused on improving the reliability and resilience of production-grade systems is seeking experienced Site Reliability Engineers. This work contributes to advancing AI systems designed to reason about real-world operational challenges, including system failures and infrastructure performance.
This opportunity is ideal for individuals with strong hands-on experience in production environments, particularly those who have operated within high-availability systems and participated in on-call rotations. Candidates with a background in diagnosing complex outages and improving system observability will be well-aligned.
The work involves designing and evaluating realistic incident scenarios, performing root cause analysis, and contributing to system reliability frameworks. Success in this role depends on deep operational insight and the ability to translate real-world engineering challenges into structured problem environments.
Responsibilities:
Design and document realistic production incident scenarios
Perform detailed root cause analysis on simulated system failures
Evaluate system behavior across monitoring and alerting frameworks
Develop scenarios involving capacity planning and system scaling
Review and refine incident response and post-mortem processes
Analyze infrastructure reliability across distributed systems
Collaborate on improving AI understanding of operational best practices
Requirements:
3+ years of experience in Site Reliability Engineering, DevOps, or production engineering
Hands-on experience managing production systems with uptime and SLA requirements
Direct involvement in on-call rotations and incident response workflows
Strong experience conducting structured root cause analysis (RCA)
Proficiency with observability tools such as Prometheus, Grafana, Datadog, or PagerDuty
Deep understanding of Linux systems and networking fundamentals (TCP/IP, DNS, load balancing)
Experience with containerization and orchestration tools such as Kubernetes and Docker
Familiarity with infrastructure-as-code tools (Terraform, Pulumi, or CloudFormation)
Experience building or maintaining CI/CD pipelines
Strong debugging skills across application and system layers
Ability to work independently in a remote, asynchronous environment
Must be based in the United States
Preferred: Experience contributing to system design documentation or training datasets for AI systems