Principal SRE
Compensation estimateAI
See base, equity, bonus, and total comp estimates for this role — free, no credit card.
Sign up to see compensation estimatePrincipal Site Reliability Engineer (Principal SRE)
Location: Remote (Must be based in USA)
Pay: $175k - $225k Base + Bonus
Overview
We are partnered with a well‑funded, high‑growth AI technology company building a modern, scalable platform that enables customers to deploy advanced machine learning capabilities directly into their own environments.
This organisation is moving away from rigid, monolithic architectures toward a more flexible, event‑driven and platform‑oriented model. Infrastructure, reliability, and governance are treated as first‑class concerns rather than operational afterthoughts.
The
Principal Site Reliability Engineer
is a senior architectural leadership role, reporting directly into the executive technology organisation. This position sits at the intersection of infrastructure, machine learning systems, and internal platform governance, with broad influence across engineering, data, ML, and product teams.
This is not a traditional “on‑call heavy” SRE role. The focus is on
architectural leverage, durable systems, and long‑term platform evolution
, with hands‑on involvement where it matters most.
Responsibilities
- Define and evolve the architectural direction of a large‑scale, event‑driven infrastructure platform running primarily on AWS.
- Influence how reliability, scalability, and operational standards are applied across engineering, data science, and ML teams.
- Design and operate distributed systems that support ML training, orchestration, and model serving at scale.
- Own and guide the reliability strategy for containerised workloads deployed both internally and into customer or partner environments.
- Establish platform‑level standards around idempotency, event replay, schema governance, and operational traceability.
- Provide technical leadership across Kubernetes control planes, multi‑cluster environments, and multi‑tenant isolation strategies.
- Define and mature infrastructure‑as‑code standards, CI/CD pipelines, and governance guardrails that reduce risk and increase deployment confidence.
- Shape observability, incident response, and post‑incident learning practices to drive systemic improvement rather than reactive fixes.
- Mentor senior engineers and help scale an SRE function capable of supporting the platform as the company grows.
Must‑Have Qualifications
- 10–15+ years of experience designing, building, and operating production infrastructure at scale.
- Deep hands‑on experience with AWS, including networking, compute, storage, identity, and event‑driven services.
- Proven experience designing or operating event‑driven architectures (e.g. Kafka/MSK, Kinesis, EventBridge, RabbitMQ).
- Strong production experience managing Kubernetes control planes and multi‑cluster environments.
- Experience with Infrastructure as Code (Terraform preferred) and modern CI/CD practices.
- Background working in data‑heavy or ML‑centric environments, with an understanding of the reliability challenges unique to ML systems.
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
Similar roles
- Senior SREWaystar · Atlanta, Georgia, United States · Onsite
- Team Lead, SRELoblaw Companies Limited · Brampton, Ontario, Canada · Onsite
- SRECollabera · Baltimore, Maryland, United States · Remote
- SREJobs via Dice · Chandler, Arizona, United States · Onsite
- SREApex Systems · Chandler, Arizona, United States · Onsite