FriendliAI logo
FriendliAI Verified
AI/ML, Cloud Infrastructure, Developer Tools, Deep Learning

Software Engineer - Site Reliability

California, United StatesHybridFull TimePosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

FriendliAI seeks a Software Engineer, SRE to build and operate the core infrastructure for their large-scale, GPU-accelerated AI inference platform. This hands-on role involves managing Kubernetes fleets, designing cloud architectures on AWS, and leading efforts in observability, CI/CD, and service reliability. Responsibilities include using Terraform and Helm for infrastructure, automating CI/CD pipelines, enhancing reliability with service mesh, operating distributed data systems, and implementing observability tools. The ideal candidate has 3+ years of experience in SRE/DevOps, proficiency in AWS, Kubernetes, Terraform, Helm, and programming languages like Go, Java, or Python, with strong debugging skills in distributed systems.

### About the job

FriendliAI is looking for an engineer to design, build, and operate the foundations of our large-scale, GPU-accelerated AI inference platform. As a Software Engineer, SRE, you will be responsible for ensuring the reliability, scalability, and efficiency of our cloud-native systems. You’ll work at the intersection of infrastructure, developer platforms, and reliability engineering—building tools and processes that empower engineering teams to ship with confidence.

This is a hands-on role where you will manage large Kubernetes fleets, design resilient cloud architectures, and lead efforts in observability, CI/CD automation, and service reliability.

### Key Responsibilities

  • Design, build, and operate cloud-native infrastructure (primarily AWS) using Terraform and Helm.
  • Manage and scale multi-cluster Kubernetes environments with strong reliability, security, and cost-efficiency in mind.
  • Develop, automate, and maintain CI/CD systems (Argo CD, Argo Rollouts, Spinnaker, GitHub Actions).
  • Enhance reliability through service mesh (Istio) operations, traffic routing optimization, and secure mTLS-based communication.
  • Build and operate distributed data systems (e.g., Redis, Vault) with multi-AZ resiliency.
  • Improve developer velocity through internal deployment platforms, canary rollouts, and self-service tooling.
  • Implement observability practices (Datadog, metrics, logging, alerting) as infrastructure-as-code.
  • Lead post-incident reviews, reliability improvements, and production hardening.
  • Partner closely with product, infra, and security teams to deliver a reliable-by-default developer experience.

### Qualifications

  • 3+ years of experience as an SRE, DevOps, or Infrastructure Engineer operating large-scale cloud systems.
  • Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
  • Proficiency in AWS, Kubernetes, Terraform, Helm.
  • Hands-on experience with service mesh (Istio), CI/CD platforms (Argo, Spinnaker, GitHub Actions), and infrastructure automation.
  • Strong debugging skills across distributed systems (networking, containers, databases, observability).
  • Programming skills in Go, Java, or Python, with the ability to develop tools and automation.
  • Solid understanding of cloud networking, identity, and security fundamentals.

### Preferred Experience

  • Multi-cloud or hybrid-cloud operations.
  • Experience building internal developer platforms on Kubernetes.
  • Knowledge of Redis, or other distributed data stores.
  • Policy-as-code (OPA/Gatekeeper, Kyverno) or advanced workload identity patterns.
  • Cost optimization strategies for large-scale workloads (GPU or networking-intensive).
  • Prior contributions to incident management, SLO/SLI design, or chaos engineering.

### Benefits

  • Flexible working hours
  • Daily lunch and dinner provided; unlimited snacks and beverages
  • Supportive and highly collaborative work environment
  • Health check-up support and top-tier equipment/hardware support
  • A front-row seat to the generative AI infrastructure revolution
  • Competitive compensation, startup equity, health insurance, and other benefits.

### About FriendliAI

FriendliAI is building the world’s best AI inference platform that makes large language and multi-modal models fast, efficient, and deployable at scale. We power high-throughput, low-latency AI workloads for organizations worldwide and integrate directly with Hugging Face, giving developers instant access to over 500,000 open-source models.

We are a small, fast-moving team doing work that matters at one of the most exciting moments in the history of technology. With our world-class inference engine, we are building a platform that the AI industry can actually rely on.

Ready to apply?
You'll be redirected to FriendliAI's application page.

Similar roles