Principal Site Reliability Engineer - AI Infrastructure Operations

Washington, United StatesRemoteFull TimePrincipalPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Nscale is seeking a Principal Site Reliability Engineer (SRE) to provide technical leadership for their AI Infrastructure Operations. This senior role focuses on setting reliability strategy, designing large-scale control-plane systems and automation frameworks, and defining reliability standards. The Principal SRE will act as a senior technical escalation point, identify reliability risks, and mentor other engineers. The ideal candidate has 10+ years of experience in SRE, Systems, or Software Engineering, with deep expertise in Linux, networking, and distributed systems. Experience with AI/HPC platforms, Kubernetes, and observability systems is a plus. This is a remote-first, full-time position.

About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future.

About The Role

At Nscale, our AI Infrastructure Operations team is responsible for the reliability and scalability of one of the most demanding AI platforms in the industry. We value engineers who think in systems, lead through influence, and raise the bar for operational excellence across the organisation.

We're looking for a Principal Site Reliability Engineer (SRE) to provide technical leadership across our AI Infrastructure Operations domain.

This is a senior, highly impactful role focused on setting reliability strategy, designing foundational systems, and driving cross-team improvements at scale. You will operate as a technical authority for reliability, automation, and operational architecture across Nscale's GPU, network, and control-plane platforms.

What You'll Be Doing

Owning and evolving the long-term reliability strategy for Nscale's AI and HPC infrastructure

Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams
Acting as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixes
Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level
Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity
Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices
- Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability

About You

10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructure

Expert-level software engineering skills, with a strong track record of building production-grade automation and systems
Deep expertise in Linux, networking, and distributed systems design at scale
Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers
Proven ability to lead technical initiatives across teams without direct authority
- Strong systems-thinking mindset, with the ability to balance reliability, velocity, and cost

Nice to Have

Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM)

Experience designing observability systems for high-cardinality, high-throughput environments
Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures
- A history of driving step-change improvements in reliability, scalability, or operational efficiency

What We Can Offer You

At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.

Highly competitive package (base + equity) with reviews every 12 months.
Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Equal Opportunities Statement

We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there's anything we can do to accommodate your specific situation, please let us know.

The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.

*For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice:* *Here.*

Ready to apply?

You'll be redirected to nSCALE's application page.

Is this role right for you?

Role summary

Similar roles