Software Engineer - Site Reliability

Name: RiseMe
Availability: InStock

California, United StatesHybridFull TimePosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

FriendliAI seeks a Software Engineer, SRE to build and operate the core infrastructure for their large-scale, GPU-accelerated AI inference platform. This hands-on role involves managing Kubernetes fleets, designing cloud architectures on AWS, and leading efforts in observability, CI/CD, and service reliability. Responsibilities include using Terraform and Helm for infrastructure, automating CI/CD pipelines, enhancing reliability with service mesh, operating distributed data systems, and implementing observability tools. The ideal candidate has 3+ years of experience in SRE/DevOps, proficiency in AWS, Kubernetes, Terraform, Helm, and programming languages like Go, Java, or Python, with strong debugging skills in distributed systems.

### About the job

FriendliAI is looking for an engineer to design, build, and operate the foundations of our large-scale, GPU-accelerated AI inference platform. As a Software Engineer, SRE, you will be responsible for ensuring the reliability, scalability, and efficiency of our cloud-native systems. You’ll work at the intersection of infrastructure, developer platforms, and reliability engineering—building tools and processes that empower engineering teams to ship with confidence.

This is a hands-on role where you will manage large Kubernetes fleets, design resilient cloud architectures, and lead efforts in observability, CI/CD automation, and service reliability.

### Key Responsibilities

Design, build, and operate cloud-native infrastructure (primarily AWS) using Terraform and Helm.
Manage and scale multi-cluster Kubernetes environments with strong reliability, security, and cost-efficiency in mind.
Develop, automate, and maintain CI/CD systems (Argo CD, Argo Rollouts, Spinnaker, GitHub Actions).
Enhance reliability through service mesh (Istio) operations, traffic routing optimization, and secure mTLS-based communication.
Build and operate distributed data systems (e.g., Redis, Vault) with multi-AZ resiliency.
Improve developer velocity through internal deployment platforms, canary rollouts, and self-service tooling.
Implement observability practices (Datadog, metrics, logging, alerting) as infrastructure-as-code.
Lead post-incident reviews, reliability improvements, and production hardening.
Partner closely with product, infra, and security teams to deliver a reliable-by-default developer experience.

### Qualifications

3+ years of experience as an SRE, DevOps, or Infrastructure Engineer operating large-scale cloud systems.
Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
Proficiency in AWS, Kubernetes, Terraform, Helm.
Hands-on experience with service mesh (Istio), CI/CD platforms (Argo, Spinnaker, GitHub Actions), and infrastructure automation.
Strong debugging skills across distributed systems (networking, containers, databases, observability).
Programming skills in Go, Java, or Python, with the ability to develop tools and automation.
Solid understanding of cloud networking, identity, and security fundamentals.

### Preferred Experience

Multi-cloud or hybrid-cloud operations.
Experience building internal developer platforms on Kubernetes.
Knowledge of Redis, or other distributed data stores.
Policy-as-code (OPA/Gatekeeper, Kyverno) or advanced workload identity patterns.
Cost optimization strategies for large-scale workloads (GPU or networking-intensive).
Prior contributions to incident management, SLO/SLI design, or chaos engineering.

### Benefits

Flexible working hours
Daily lunch and dinner provided; unlimited snacks and beverages
Supportive and highly collaborative work environment
Health check-up support and top-tier equipment/hardware support
A front-row seat to the generative AI infrastructure revolution
Competitive compensation, startup equity, health insurance, and other benefits.

### About FriendliAI

FriendliAI is building the world’s best AI inference platform that makes large language and multi-modal models fast, efficient, and deployable at scale. We power high-throughput, low-latency AI workloads for organizations worldwide and integrate directly with Hugging Face, giving developers instant access to over 500,000 open-source models.

We are a small, fast-moving team doing work that matters at one of the most exciting moments in the history of technology. With our world-class inference engine, we are building a platform that the AI industry can actually rely on.

Ready to apply?

You'll be redirected to FriendliAI's application page.

Is this role right for you?

Role summary

Similar roles