AI Infrastructure Engineer

San Francisco, California, United StatesOnsiteFull Time$180,000–$220,000 /yrPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking an AI Infrastructure Engineer to build and own the observability and diagnostics layer for a real-time AI assistant platform. This role is crucial for making complex AI systems transparent, debuggable, and reliable through end-to-end tracing, rapid root-cause analysis, and real-time monitoring. Responsibilities include designing event tracing, building automated failure detection pipelines, creating visibility dashboards, monitoring live sessions for anomalies, and enabling human intervention tools. The ideal candidate will have strong backend experience with distributed systems and observability, proficiency in Python and event-driven architectures, and experience debugging complex systems, with familiarity with AI/LLM systems, workflow/state machines, and telemetry tools.

Overview

Build and own the observability and diagnostics layer for a real-time AI assistant platform. You’ll make complex AI systems transparent, debuggable, and reliable by enabling end-to-end tracing, rapid root-cause analysis, and real-time monitoring.

Responsibilities

Design event tracing across AI decisioning, workflows, and real-time communication systems
Build automated pipelines to detect, classify, and analyze system failures
Create dashboards for real-time and post-session visibility (timelines, decision paths, errors)
Monitor live sessions and surface alerts for anomalies (latency, loops, failed actions)
Enable human intervention tools for in-session issue handling
Identify recurring failure patterns and drive system improvements
Implement automated triage and alerting to route issues to the right teams

Requirements

Strong backend experience with distributed systems and observability
Proficiency in Python and event-driven architectures
Experience debugging complex systems
Familiarity with AI/LLM systems, workflow/state machines, and telemetry tools

Nice to Have

Experience with real-time/voice systems
Observability tools (e.g., Grafana, OpenTelemetry)
Exposure to human-in-the-loop systems or operational tooling

Ready to apply?

You'll be redirected to High Trail's application page.

Is this role right for you?

Role summary

Similar roles