Software Engineer - Site Reliability
Role summary
FriendliAI seeks a Software Engineer, SRE to build and operate the core infrastructure for their large-scale, GPU-accelerated AI inference platform. This hands-on role involves managing Kubernetes fleets, designing cloud architectures on AWS, and leading efforts in observability, CI/CD, and service reliability. Responsibilities include using Terraform and Helm for infrastructure, automating CI/CD pipelines, enhancing reliability with service mesh, operating distributed data systems, and implementing observability tools. The ideal candidate has 3+ years of experience in SRE/DevOps, proficiency in AWS, Kubernetes, Terraform, Helm, and programming languages like Go, Java, or Python, with strong debugging skills in distributed systems.
### About the job
FriendliAI is looking for an engineer to design, build, and operate the foundations of our large-scale, GPU-accelerated AI inference platform. As a Software Engineer, SRE, you will be responsible for ensuring the reliability, scalability, and efficiency of our cloud-native systems. You’ll work at the intersection of infrastructure, developer platforms, and reliability engineering—building tools and processes that empower engineering teams to ship with confidence.
This is a hands-on role where you will manage large Kubernetes fleets, design resilient cloud architectures, and lead efforts in observability, CI/CD automation, and service reliability.
### Key Responsibilities
- Design, build, and operate cloud-native infrastructure (primarily AWS) using Terraform and Helm.
- Manage and scale multi-cluster Kubernetes environments with strong reliability, security, and cost-efficiency in mind.
- Develop, automate, and maintain CI/CD systems (Argo CD, Argo Rollouts, Spinnaker, GitHub Actions).
- Enhance reliability through service mesh (Istio) operations, traffic routing optimization, and secure mTLS-based communication.
- Build and operate distributed data systems (e.g., Redis, Vault) with multi-AZ resiliency.
- Improve developer velocity through internal deployment platforms, canary rollouts, and self-service tooling.
- Implement observability practices (Datadog, metrics, logging, alerting) as infrastructure-as-code.
- Lead post-incident reviews, reliability improvements, and production hardening.
- Partner closely with product, infra, and security teams to deliver a reliable-by-default developer experience.
### Qualifications
- 3+ years of experience as an SRE, DevOps, or Infrastructure Engineer operating large-scale cloud systems.
- Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
- Proficiency in AWS, Kubernetes, Terraform, Helm.
- Hands-on experience with service mesh (Istio), CI/CD platforms (Argo, Spinnaker, GitHub Actions), and infrastructure automation.
- Strong debugging skills across distributed systems (networking, containers, databases, observability).
- Programming skills in Go, Java, or Python, with the ability to develop tools and automation.
- Solid understanding of cloud networking, identity, and security fundamentals.
### Preferred Experience
- Multi-cloud or hybrid-cloud operations.
- Experience building internal developer platforms on Kubernetes.
- Knowledge of Redis, or other distributed data stores.
- Policy-as-code (OPA/Gatekeeper, Kyverno) or advanced workload identity patterns.
- Cost optimization strategies for large-scale workloads (GPU or networking-intensive).
- Prior contributions to incident management, SLO/SLI design, or chaos engineering.
### Benefits
- Flexible working hours
- Daily lunch and dinner provided; unlimited snacks and beverages
- Supportive and highly collaborative work environment
- Health check-up support and top-tier equipment/hardware support
- A front-row seat to the generative AI infrastructure revolution
- Competitive compensation, startup equity, health insurance, and other benefits.
### About FriendliAI
FriendliAI is building the world’s best AI inference platform that makes large language and multi-modal models fast, efficient, and deployable at scale. We power high-throughput, low-latency AI workloads for organizations worldwide and integrate directly with Hugging Face, giving developers instant access to over 500,000 open-source models.
We are a small, fast-moving team doing work that matters at one of the most exciting moments in the history of technology. With our world-class inference engine, we are building a platform that the AI industry can actually rely on.

