Site Reliability Engineer

Rancho Cordova, California, United StatesOnsiteFull Time$90,000–$150,000 /yrPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

FarmGPU is seeking a Site Reliability Engineer to ensure the stability and performance of its GPU-powered cloud computing infrastructure. This on-site role in Rancho Cordova, CA, involves monitoring and maintaining production systems, troubleshooting Linux-based bare-metal environments, and improving automation using tools like Ansible, Python, and bash. The engineer will manage configurations, coordinate hardware maintenance, support customer-facing workloads, and develop reliability metrics. Responsibilities include participating in 24/7 on-call rotations. The ideal candidate has strong Linux skills, experience with monitoring tools like Grafana and Prometheus, and a solid understanding of distributed systems and datacenter networking.

About FarmGPU

FarmGPU is redefining the future of
GPU-powered cloud computing
, delivering cost-effective, scalable, high-performance GPU infrastructure tailored for AI developers, startups, and enterprises globally. Our vertically integrated platform transforms data centers into AI-optimized facilities, accelerates storage-intensive training and inference workflows, and delivers on-demand compute via strategic partnerships such as with RunPod Secure Cloud. With sustainability, performance, and innovation at our core, we challenge the status quo of traditional cloud providers.

As we scale our infrastructure to support high-bandwidth, low-latency AI workloads, we're seeking a
Site Reliability Engineer
to keep our production GPU clusters, storage systems, and datacenter network running reliably. You'll work alongside our senior engineering team—using automation tooling, dashboards, and runbooks they've built—while contributing your own improvements to reduce toil and improve operational consistency across our Rancho Cordova facility.

What You'll Do

- Monitor and maintain production systems
across GPU servers, storage, and networking using Grafana dashboards, alerting pipelines, and documented runbooks; respond to incidents and escalate appropriately.
- Troubleshoot and resolve issues
on Linux-based bare-metal systems: service failures, hardware faults, network degradation, and storage anomalies.
- Execute and improve automation
using existing Ansible playbooks and Python/bash scripts; identify operational gaps and contribute improvements to reduce manual intervention.
- Manage configuration and deployments
across the server fleet using pull-based configuration management tooling, ensuring consistency and auditability.
- Coordinate hardware maintenance
: node replacements, firmware updates, drive swaps, and hands-on rack-level operations in our datacenter.
- Support production reliability
for customer-facing GPU compute workloads hosted on RunPod Secure Cloud and direct enterprise deployments.
- Develop and track SLIs and SLOs
in partnership with the engineering team to measure and improve service reliability.
- Participate in on-call and shift rotations
, including evenings, nights, and weekends as part of 24/7 operations coverage.

What You Bring

- Strong working knowledge of
Linux systems
—comfortable with the command line, process/service management, log analysis, and hands-on troubleshooting in a production environment.
- Experience with
monitoring and observability tools
, particularly Grafana and Prometheus—able to navigate dashboards, interpret metric trends, and act on alerts.
- Proficiency in
scripting and automation
: Python and/or bash for operational task automation; experience running Ansible playbooks in production.
- Solid understanding of
distributed system concepts
and the ability to troubleshoot complex issues across multiple layers of the stack.
- Familiarity with
datacenter networking fundamentals
: IP addressing, VLANs, switching, OSI layers 3/4—enough to diagnose and resolve common connectivity issues.
- Experience with
bare-metal server environments
, including hardware diagnostics, BMC/IPMI management, and routine maintenance.
- Working knowledge of
containerization
: Docker and/or Kubernetes at an operational level.
- Solid
troubleshooting methodology
and attention to detail; comfortable following and improving documented runbooks.
- Willingness to work
on-site in Rancho Cordova, CA
, including
shift rotations covering evenings, nights, and weekends
.

Preferred Qualifications

- 3+ years in a production
SRE, DevOps, or infrastructure operations
role.
- Experience implementing and tracking
SLIs and SLOs
for production services.
- Familiarity with
GPU server environments
(NVIDIA H100/H200/B200) or HPC infrastructure.
- Experience with
storage platforms
such as NVMe, NAS, or VAST Data in a production setting.
- Exposure to
security and compliance practices
: secret management, access control, Linux hardening, SOC 2 familiarity.
- Experience with
cloud platforms
(AWS, GCP, or Azure) or hybrid datacenter/cloud environments.
- Relevant certifications such as RHCSA, CKA, or AWS Certified DevOps Engineer.

Why FarmGPU?

- Hands-on work with cutting-edge hardware
—you'll operate some of the most advanced AI compute infrastructure available.
- Strong technical team
—our senior SREs and software engineers have deep expertise and are invested in building solid operational practices.
- High ownership
—your work directly impacts the reliability of customer AI workloads.
- Located in
Rancho Cordova, CA
, in the heart of a growing AI and robotics ecosystem.

Compensation

- $90,000 - $150,000 base salary
- This is a full-time,
on-site position in Rancho Cordova, CA
. Remote work is not available for this role.

Ready to apply?

You'll be redirected to FarmGPU's application page.

Is this role right for you?

Role summary

Similar roles