Thia logo
Thia Verified
IT Services, IT Consulting

ML Platform Engineer

United StatesRemoteFull Time$110,000–$160,000 /yrPosted 1 month ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

THIA is seeking an ML Platform Engineer to build and operate the infrastructure that powers their AI-powered platform. This role focuses on model serving, evaluation pipelines, and observability, enabling the ML team to iterate faster. The ideal candidate will have strong Python skills, experience with backend services, distributed systems, and cloud infrastructure (GCP/AWS/Azure). Familiarity with LLM-specific infrastructure and multi-tenant SaaS is preferred. This is a fully remote, full-time position with a focus on a clean codebase and serious approach to tech debt.

About THIA

THIA is transforming how small and medium enterprises build internal applications and automate business processes. Our AI-powered platform enables business experts to create custom applications using natural language, eliminating the need for expensive development teams. We're well-funded, generating revenue, and solving real problems for companies that need more than off-the-shelf software.

The Role

This is the role for an engineer who builds the systems ML runs on. You'll own model serving, eval pipelines, and the observability layer that makes everything inspectable - the work that makes the rest of the ML team faster. You won't be training models, but you'll need to understand them well enough to debug serving and eval pipelines when they misbehave. You'll work closely with a small, senior team and have direct influence over how our ML stack is built.

We move fast, keep our codebase clean, and take tech debt seriously.

What You'll Do

ML Platform Engineering

  • Build and operate model serving infrastructure: routing, batching, autoscaling, latency, cost
  • Build eval pipelines and observability tooling that make assistant behavior inspectable
  • Build batch inference and data pipelines that feed training and evaluation
  • Support the multi-tenant rollout: tenant-aware routing, isolation, and resource management
  • Read ML code well enough to debug serving and eval pipelines end to end

Collaboration

  • Work autonomously while staying tightly coordinated with a small, async-first team
  • Partner with the ML team to make their iteration loops faster
  • Contribute to architectural decisions and internal documentation

What We're Looking For

Must-Haves

  • Strong Python; comfortable with at least one other production language (Go, Java, TypeScript, C, etc.)
  • Production experience with backend services and one or more of: model-serving infra, batch inference pipelines, queue-based pipelines, or large-scale data processing
  • Distributed-systems fundamentals: queues, autoscaling, observability
  • Cloud infrastructure experience (GCP/AWS/Azure)
  • Able to read ML code well enough to debug serving + eval pipelines, or willing to learn

Strongly Preferred

  • LLM-specific infra: routing, batching, KV-cache management, structured generation
  • Eval pipelines or LLM observability (OpenTelemetry traces, LangSmith, Phoenix, custom)
  • Multi-tenant SaaS infrastructure experience

You Don't Need

  • Experience training models from scratch - this role is about the systems around them

How We Evaluate

We hire for skill and potential, however acquired. If you can do the work, we want to hear from you.

A Note on AI

We actively encourage using AI tools to move faster. Real-world experience is still required - to direct AI effectively, catch what it misses, and spot security issues before they reach production.

Our Stack

Python · TypeScript · Modal · GCP · PostgreSQL / SQLite · Qdrant · Redis · Terraform · Docker · GitLab CI/CD · Datadog · Wiz

What You Gain

  • Ownership - end-to-end accountability for ML platform infrastructure at a growing AI company
  • Impact - direct collaboration with leadership and real influence on technical direction
  • Growth - clear path to a lead role as the team expands
  • Equity - early-stage equity at an AI startup
  • Flexibility - fully remote with flexible hours
  • Quality - a clean codebase and a team that takes tech debt seriously

Pay: $110,000.00 - $160,000.00 per year

Benefits:

  • 401(k)
  • 401(k) matching
  • Dental insurance
  • Health insurance
  • Paid time off
  • Vision insurance

Work Location: Remote

Ready to apply?
You'll be redirected to Thia's application page.

Similar roles