Harell Data logo
Harell Data Verified
Market Research

Software Engineer - AI Infrastructure

Palo Alto, California, United StatesOnsiteFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

This is a founding AI Infrastructure Engineer role reporting to the CTO, focused on the 0-to-1 build of the company's core compute and orchestration layer. The engineer will architect GPU compute fabrics, design developer interfaces (SDKs/APIs), operationalize ML lifecycles, and ensure client success through debugging and observability tools. Key responsibilities include making build vs. buy decisions, defining infrastructure and security strategy, and shaping engineering standards and hiring processes. The role requires 5+ years of experience in ML infrastructure or backend systems, with hands-on experience in production ML/DL pipelines (PyTorch, Hugging Face) and Kubernetes on AWS/GCP, ideally for GPU workloads. Strong CS fundamentals and system design skills are essential.

About the Role - Onsite in Palo Alto CA or Bellevue WA (not eligible for relocation assistance)

As a founding AI Infrastructure Engineer, you will report directly to the CTO and lead the development of our core compute and orchestration layer. This is a high-impact role where you will hold a significant ownership stake in the company and lead the 0-to-1 build of our infrastructure. You will work closely with our customers to translate their needs into a world-class platform, while simultaneously shaping our engineering culture and technical direction from the ground up.

What You Will Do

- Architect GPU Compute Fabric
: Build and manage the orchestration layer for GPU workloads, ensuring efficient resource allocation and cost management for large-scale training, fine-tuning, and inference.
- Design Developer Interfaces
: Build developer-centric SDKs and APIs that transform complex ML workflows into intuitive experiences for researchers and data scientists.
- Operationalize the ML Lifecycle
: Develop robust, end-to-end pipelines-from data ingestion and preprocessing to secure model serving and monitoring.
- Client Success & Observability
: Work closely with customers to debug fine-tuning jobs and build the observability tools required to track model performance and resource health in real-time.
- Define Systems & Culture Strategy
: Lead the technical roadmap by making critical "build vs. buy" decisions on infrastructure and security, while directly shaping the team’s engineering standards and hiring processes.

Qualifications

  • 5+ years of software engineering experience, with focus on ML infrastructure or backend systems supporting ML workload.
  • Experience deploying and operating ML/DL training or inference pipelines in production (PyTorch, Hugging Face, or similar).
  • Hands-on experience with Kubernetes on AWS/GCP, ideally for GPU workloads
  • Strong CS fundamentals and system design skills.
  • Ability to thrive in fast-paced, dynamic environments and navigate ambiguity.

This is an onsite role in Palo Alto CA or Bellevue WA only and not eligible for relocation assistance

Ready to apply?
You'll be redirected to Harell Data's application page.

Similar roles