LLM/ML Engineer (Inference)

San Francisco, California, United StatesOnsiteFull Time$200,000–$300,000 /yrPosted 2 months agoHidden Gem · YC Startup

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking an LLM/ML Engineer specializing in inference to join our early-stage company in San Francisco. This role requires deep expertise in Python and PyTorch, with a strong understanding of low-level operating systems concepts and modern inference systems like TGI, vLLM, TensorRT-LLM, and Optimum. You will architect and implement scalable inference systems, optimize model serving for high throughput and low latency, and develop custom tooling for testing and optimization. Experience with CUDA, Triton, and compiler optimization is a plus. This is an in-person, fast-paced role for a self-critical individual with a high bar for quality and a proactive approach to problem-solving.

### **We would love to meet you if you:**

* **Philosophy:** You are your own worst critic. You have a high bar for quality and don’t rest until the job is done right—no settling for 90%. We want someone who ships fast, with high agency, and who doesn't just voice problems but actively jumps in to fix them.
* **Experience:** You have deep expertise in Python and PyTorch, with a strong foundation in low-level operating systems concepts including multi-threading, memory management, networking, storage, performance, and scale. You're experienced with modern inference systems like TGI, vLLM, TensorRT-LLM, and Optimum, and comfortable creating custom tooling for testing and optimization.
* **Approach:** You combine technical expertise with practical problem-solving. You're methodical in debugging complex systems and can rapidly prototype and validate solutions.

### **The core work will include:**

* Architecting and implementing robust, scalable inference systems for serving state-of-the-art AI models
* Optimizing model serving infrastructure for high throughput and low latency at scale
* Developing and integrating advanced inference optimization techniques
* Working closely with our research team to bring cutting-edge capabilities into production
* Building developer tools and infrastructure to support rapid experimentation and deployment.

### **Bonus points if you:**

* Have experience with low-level systems programming (CUDA, Triton) and compiler optimization
* Are passionate about open-source contributions and staying current with ML infrastructure developments
* Bring practical experience with high-performance computing and distributed systems
* Have worked in early-stage environments where you helped shape technical direction
* Are energized by solving complex technical challenges in a collaborative environment

This is an in person role at our office in SF. We’re an early stage company which means that the role requires working hard and moving quickly. Please only apply if that excites you.

Ready to apply?

You'll be redirected to Reducto's application page.