Member of Technical Staff

San Francisco, California, United StatesOnsiteFull Time$180,000–$250,000 /yrPosted 2 months agoVisa sponsorship availableHidden Gem · YC Startup

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Lucid is seeking an experienced software engineer to build and manage the infrastructure for large-scale ML training. The role involves handling petabytes of data, managing distributed GPU workloads, and optimizing system reliability for ML researchers. The ideal candidate will have strong distributed systems experience, proficiency in Python and PyTorch, and comfort with both cloud and bare metal environments. This is an individual contributor role focused on building foundational systems to enable rapid iteration on experiments.

Lucid is a small team building real-time interactive video models that spark joy, and are seeking a talented engineer to build the infrastructure that makes this possible. If you are passionate about our mission, we would love to hear from you!

**Role Description**

You are an experienced software engineer who builds the systems that power large-scale ML training. You understand how to move and process petabytes of data efficiently, manage distributed GPU workloads, and create infrastructure that researchers can rely on. You know how to optimize for the specific constraints of training large models while maintaining system reliability.

**What you'll do**

* Build systems to handle petabytes of training data efficiently
* Design and manage distributed training infrastructure across GPU clusters
* Evaluate and implement orchestration systems (we currently use SLURM)
* Work directly with ML researchers to identify and solve infrastructure bottlenecks
* Build the foundational systems that let our team iterate quickly on large-scale experiments

**Requirements**

* Strong distributed systems experience with focus on performance optimization
* Deep understanding of ML training infrastructure and distributed training patterns
* Experience with Python and PyTorch in production environments
* Comfortable working with both cloud and bare metal infrastructure
* Self-starter who can evaluate technical trade-offs and make architectural decisions
* Excellent communication and collaboration skills

**Location**

San Francisco, CA

**What we offer at Lucid**

* Interesting and challenging work
* Base salary $180,000-250,000 plus equity
* A lot of learning and growth opportunities
* Health, dental, and vision insurance (US)
* Regular team events and offsites

Ready to apply?

You'll be redirected to Lucid's application page.

Is this role right for you?

Role summary

Similar roles