LiteLLM Verified
AI/ML, Software Development, Developer Tools
Backend Performance Engineer
San Francisco, California, United StatesRemoteFull Time$150–$200 /yrPosted 2 months agoHidden Gem · YC Startup
Role summary
LiteLLM, a rapidly growing open-source LLM Gateway with 28K+ GitHub stars and $2.5M ARR, is seeking a Python Performance Engineer to join their San Francisco-based team. This role is critical for scaling the platform to handle 5K RPS by maximizing throughput, minimizing latency, and ensuring production reliability. The engineer will focus on reducing overhead latency for cache hits and misses, optimizing performance with added components and large payloads, addressing customer-specific latency issues, and resolving memory leaks. The role also involves expanding coverage to new API endpoints like realtime and audio transcriptions.
### **TLDR**
LiteLLM is an **open-source LLM Gateway with 28K+ stars on GitHub** and trusted by companies like **NASA, Rocket Money, Samsara, Lemonade, and Adobe.** We’re rapidly expanding and seeking a performance engineer to help scale the platform to handle 5K RPS (Requests per second). We’re based in San Francisco.
### **What is LiteLLM**
LiteLLM provides an **open source Python SDK and Python FastAPI Server that allows calling 100+ LLM APIs (Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic) in the OpenAI format**
We just hit **$2.5M ARR** and have raised a **$1.6M seed round from Y Combinator, Gravity Fund and Pioneer Fund.** You can find more information on our [**website**](https://www.litellm.ai/), [**Github**](https://github.com/BerriAI/litellm) and [**Technical Documentation.**](https://docs.litellm.ai/docs/)
### About the Role
We're hiring a Python performance engineer to own maximizing throughput, minimizing latency and ensuring our platform is reliable in production.
**Roadmap for Performance Engineer:**
* By end of this year our RPS and latency overhead should be at parity with industry benchmarks. Cover stream + non-stream for /chat/completions, /completions, /embeddings, /realtime, /audio/transcriptions
* Reduce e2e overhead latency for cache misses. Currently at 100ms-500ms - ensure we meet industry standards.
* Reduce e2e overhead latency for cache hits - ensure we meet industry benchmarks.
* Ensure overhead latency scales well when other components are added to the platform - e.g Redis, Redis Cluster, DB, Non-Admin Virtual Keys
* Ensure overhead latency scales well with payload size - 1MB prompt with streaming should be sub 100ms
* Address customer specific and pipeline specific latency issues.
* e.g. Enterprise customers reporting high overhead - this person should be able to debug these issues, get on support calls and help address any environment specific settings.
* Address paying customer memory leaks
* Enterprise clients have ongoing memory leaks that need resolution
* Longer term - should add coverage over new endpoints - /realtime, /audio/transcriptions/, /audio/speech
LiteLLM is an **open-source LLM Gateway with 28K+ stars on GitHub** and trusted by companies like **NASA, Rocket Money, Samsara, Lemonade, and Adobe.** We’re rapidly expanding and seeking a performance engineer to help scale the platform to handle 5K RPS (Requests per second). We’re based in San Francisco.
### **What is LiteLLM**
LiteLLM provides an **open source Python SDK and Python FastAPI Server that allows calling 100+ LLM APIs (Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic) in the OpenAI format**
We just hit **$2.5M ARR** and have raised a **$1.6M seed round from Y Combinator, Gravity Fund and Pioneer Fund.** You can find more information on our [**website**](https://www.litellm.ai/), [**Github**](https://github.com/BerriAI/litellm) and [**Technical Documentation.**](https://docs.litellm.ai/docs/)
### About the Role
We're hiring a Python performance engineer to own maximizing throughput, minimizing latency and ensuring our platform is reliable in production.
**Roadmap for Performance Engineer:**
* By end of this year our RPS and latency overhead should be at parity with industry benchmarks. Cover stream + non-stream for /chat/completions, /completions, /embeddings, /realtime, /audio/transcriptions
* Reduce e2e overhead latency for cache misses. Currently at 100ms-500ms - ensure we meet industry standards.
* Reduce e2e overhead latency for cache hits - ensure we meet industry benchmarks.
* Ensure overhead latency scales well when other components are added to the platform - e.g Redis, Redis Cluster, DB, Non-Admin Virtual Keys
* Ensure overhead latency scales well with payload size - 1MB prompt with streaming should be sub 100ms
* Address customer specific and pipeline specific latency issues.
* e.g. Enterprise customers reporting high overhead - this person should be able to debug these issues, get on support calls and help address any environment specific settings.
* Address paying customer memory leaks
* Enterprise clients have ongoing memory leaks that need resolution
* Longer term - should add coverage over new endpoints - /realtime, /audio/transcriptions/, /audio/speech