SITE RELIABILITY ENGINEER
Compensation estimateAI
See base, equity, bonus, and total comp estimates for this role — free, no credit card.
Sign up to see compensation estimate#### Site Reliability Engineer (SRE)
#### Engineer Reliability into the Systems That Move the Nation’s Food Supply
#### Who We Are
#### US Cold owns and operates one of the most complex temperature-controlled logistics networks in North America. Every day, our systems coordinate the storage and movement of food at national scale across a network of state-of-the-art distribution centers, including multiple highly automated warehouse facilities.
#### We continue to advance our core warehouse and logistics platforms. Our current focus is on modular, event-driven, API-first and cloud architectures. We continue to enhance reliability and accelerate engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue to drive operational excellence at our facilities.
#### If you want to build durable systems that operate in the physical world at scale, this is that opportunity.
#### The Role
#### The Site Reliability Engineer is a founding member of US Cold’s SRE practice.
#### This role exists to move the organization from reactive operations to engineered reliability. You will study how our most critical systems fail — particularly our Phenix WMS and facility automation interfaces — and design controls, automation, and observability that reduce incidents over time.
#### Success in this role means fewer false alerts, faster recovery, less manual intervention, and systems that heal themselves when possible.
#### You will work closely with application, infrastructure, and operations teams and participate directly in on‑call and incident response.
#### What You Will Own
- Reliability of the Phenix WMS and its integration with facility automation systems (robotics, conveyors, and control interfaces)
- Definition and implementation of SLIs and SLOs that measure meaningful system health, not just availability
- Observability across the full stack, correlating cloud services, APIs, and on‑premise facility operations
- Automation to eliminate operational toil, including patching, data corrections, restarts, and recovery tasks
- Development of self‑healing behaviors for common failure modes
- Participation in on‑call rotations and leadership of blameless post‑incident reviews
- Design and execution of disaster recovery tests across SaaS, cloud, and on‑premise environments
#### This is hands‑on reliability engineering. The systems you improve will directly impact daily warehouse operations.
#### Technical Environment
- Hybrid environments spanning cloud and on‑premise infrastructure
- Azure cloud services
- Warehouse Management Systems (Phenix WMS) and facility automation interfaces
- Observability tooling across logs, metrics, and alerting
- Automation using Python, PowerShell, Bash, or Ansible
- CI/CD tools and modern deployment practices
- Exposure to containerized and distributed systems environments
#### What We’re Looking For
- 3+ years of experience in SRE, DevOps, Systems Engineering, or related roles
- Strong Linux and Windows systems administration and troubleshooting skills
- Hands‑on experience with automation and scripting
- Experience designing and operating monitoring, alerting, and observability solutions
- Practical experience working in Azure environments
- Strong analytical skills and a bias toward eliminating root causes, not symptoms
- Ability to collaborate across application, infrastructure, and operations teams
- Experience supporting warehouse management systems or industrial automation platforms
- Exposure to Kubernetes, microservices, or container orchestration
- Familiarity with infrastructure‑as‑code tools such as Terraform or Ansible
- Understanding of distributed systems and high‑availability design
- Experience with SRE practices such as SLO‑based operations, runbook automation, or chaos testing
#### Why This Role Is Different
#### This is not an inherited SRE function.
#### There is no mature framework to maintain.
#### You will:
- Help define what reliability means at US Cold
- Work on systems that operate in the physical world
- Engineer solutions that reduce toil and operational load
- See the direct impact of your work on warehouse uptime and performance
- Build practices that scale as the platform modernizes
#### This is an opportunity to grow as an SRE while helping establish the reliability foundation of a mission‑critical platform.
#### Operational Context
- Systems operate continuously across warehouse facilities
- Reliability failures have physical and operational consequences
- On‑call participation is part of the role
- Work occurs across cloud, SaaS, and on‑premise environments