ML Platform Engineer

United StatesRemoteFull Time$110,000–$160,000 /yrPosted 1 month ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

THIA is seeking an ML Platform Engineer to build and operate the infrastructure that powers their AI-powered platform. This role focuses on model serving, evaluation pipelines, and observability, enabling the ML team to iterate faster. The ideal candidate will have strong Python skills, experience with backend services, distributed systems, and cloud infrastructure (GCP/AWS/Azure). Familiarity with LLM-specific infrastructure and multi-tenant SaaS is preferred. This is a fully remote, full-time position with a focus on a clean codebase and serious approach to tech debt.

About THIA

THIA is transforming how small and medium enterprises build internal applications and automate business processes. Our AI-powered platform enables business experts to create custom applications using natural language, eliminating the need for expensive development teams. We're well-funded, generating revenue, and solving real problems for companies that need more than off-the-shelf software.

The Role

This is the role for an engineer who builds the systems ML runs on. You'll own model serving, eval pipelines, and the observability layer that makes everything inspectable - the work that makes the rest of the ML team faster. You won't be training models, but you'll need to understand them well enough to debug serving and eval pipelines when they misbehave. You'll work closely with a small, senior team and have direct influence over how our ML stack is built.

We move fast, keep our codebase clean, and take tech debt seriously.

What You'll Do

ML Platform Engineering

Build and operate model serving infrastructure: routing, batching, autoscaling, latency, cost
Build eval pipelines and observability tooling that make assistant behavior inspectable
Build batch inference and data pipelines that feed training and evaluation
Support the multi-tenant rollout: tenant-aware routing, isolation, and resource management
Read ML code well enough to debug serving and eval pipelines end to end

Collaboration

Work autonomously while staying tightly coordinated with a small, async-first team
Partner with the ML team to make their iteration loops faster
Contribute to architectural decisions and internal documentation

What We're Looking For

Must-Haves

Strong Python; comfortable with at least one other production language (Go, Java, TypeScript, C, etc.)
Production experience with backend services and one or more of: model-serving infra, batch inference pipelines, queue-based pipelines, or large-scale data processing
Distributed-systems fundamentals: queues, autoscaling, observability
Cloud infrastructure experience (GCP/AWS/Azure)
Able to read ML code well enough to debug serving + eval pipelines, or willing to learn

Strongly Preferred

LLM-specific infra: routing, batching, KV-cache management, structured generation
Eval pipelines or LLM observability (OpenTelemetry traces, LangSmith, Phoenix, custom)
Multi-tenant SaaS infrastructure experience

You Don't Need

Experience training models from scratch - this role is about the systems around them

How We Evaluate

We hire for skill and potential, however acquired. If you can do the work, we want to hear from you.

A Note on AI

We actively encourage using AI tools to move faster. Real-world experience is still required - to direct AI effectively, catch what it misses, and spot security issues before they reach production.

Our Stack

Python · TypeScript · Modal · GCP · PostgreSQL / SQLite · Qdrant · Redis · Terraform · Docker · GitLab CI/CD · Datadog · Wiz

What You Gain

Ownership - end-to-end accountability for ML platform infrastructure at a growing AI company
Impact - direct collaboration with leadership and real influence on technical direction
Growth - clear path to a lead role as the team expands
Equity - early-stage equity at an AI startup
Flexibility - fully remote with flexible hours
Quality - a clean codebase and a team that takes tech debt seriously

Pay: $110,000.00 - $160,000.00 per year

Benefits:

401(k)
401(k) matching
Dental insurance
Health insurance
Paid time off
Vision insurance

Work Location: Remote

Ready to apply?

You'll be redirected to Thia's application page.

Is this role right for you?

Role summary

Similar roles