Site Reliability Engineer (Production Systems)

United StatesRemoteContractPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

A high-impact infrastructure initiative is seeking experienced Site Reliability Engineers to enhance the reliability and resilience of production systems supporting AI development. The role involves designing incident scenarios, performing root cause analysis, evaluating system observability, and contributing to reliability frameworks. Success requires deep operational insight and the ability to translate real-world challenges into structured problem environments. The position is remote, requires US-based candidates, and emphasizes hands-on experience with production systems, incident response, and core SRE technologies.

Description:

A high-impact infrastructure initiative focused on improving the reliability and resilience of production-grade systems is seeking experienced Site Reliability Engineers. This work contributes to advancing AI systems designed to reason about real-world operational challenges, including system failures and infrastructure performance.

This opportunity is ideal for individuals with strong hands-on experience in production environments, particularly those who have operated within high-availability systems and participated in on-call rotations. Candidates with a background in diagnosing complex outages and improving system observability will be well-aligned.

The work involves designing and evaluating realistic incident scenarios, performing root cause analysis, and contributing to system reliability frameworks. Success in this role depends on deep operational insight and the ability to translate real-world engineering challenges into structured problem environments.

Responsibilities:

Design and document realistic production incident scenarios

Perform detailed root cause analysis on simulated system failures

Evaluate system behavior across monitoring and alerting frameworks

Develop scenarios involving capacity planning and system scaling

Review and refine incident response and post-mortem processes

Analyze infrastructure reliability across distributed systems

Collaborate on improving AI understanding of operational best practices

Requirements:

3+ years of experience in Site Reliability Engineering, DevOps, or production engineering

Hands-on experience managing production systems with uptime and SLA requirements

Direct involvement in on-call rotations and incident response workflows

Strong experience conducting structured root cause analysis (RCA)

Proficiency with observability tools such as Prometheus, Grafana, Datadog, or PagerDuty

Deep understanding of Linux systems and networking fundamentals (TCP/IP, DNS, load balancing)

Experience with containerization and orchestration tools such as Kubernetes and Docker

Familiarity with infrastructure-as-code tools (Terraform, Pulumi, or CloudFormation)

Experience building or maintaining CI/CD pipelines

Strong debugging skills across application and system layers

Ability to work independently in a remote, asynchronous environment

Must be based in the United States

Preferred: Experience contributing to system design documentation or training datasets for AI systems

Ready to apply?

You'll be redirected to The UVA VEC's application page.