High Trail logo
High Trail Verified
Financial Services, Investment Management, Hedge Fund

AI Infrastructure Engineer

San Francisco, California, United StatesOnsiteFull Time$180,000–$220,000 /yrPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking an AI Infrastructure Engineer to build and own the observability and diagnostics layer for a real-time AI assistant platform. This role is crucial for making complex AI systems transparent, debuggable, and reliable through end-to-end tracing, rapid root-cause analysis, and real-time monitoring. Responsibilities include designing event tracing, building automated failure detection pipelines, creating visibility dashboards, monitoring live sessions for anomalies, and enabling human intervention tools. The ideal candidate will have strong backend experience with distributed systems and observability, proficiency in Python and event-driven architectures, and experience debugging complex systems, with familiarity with AI/LLM systems, workflow/state machines, and telemetry tools.

Overview

Build and own the observability and diagnostics layer for a real-time AI assistant platform. You’ll make complex AI systems transparent, debuggable, and reliable by enabling end-to-end tracing, rapid root-cause analysis, and real-time monitoring.

Responsibilities

  • Design event tracing across AI decisioning, workflows, and real-time communication systems
  • Build automated pipelines to detect, classify, and analyze system failures
  • Create dashboards for real-time and post-session visibility (timelines, decision paths, errors)
  • Monitor live sessions and surface alerts for anomalies (latency, loops, failed actions)
  • Enable human intervention tools for in-session issue handling
  • Identify recurring failure patterns and drive system improvements
  • Implement automated triage and alerting to route issues to the right teams

Requirements

  • Strong backend experience with distributed systems and observability
  • Proficiency in Python and event-driven architectures
  • Experience debugging complex systems
  • Familiarity with AI/LLM systems, workflow/state machines, and telemetry tools

Nice to Have

  • Experience with real-time/voice systems
  • Observability tools (e.g., Grafana, OpenTelemetry)
  • Exposure to human-in-the-loop systems or operational tooling
Ready to apply?
You'll be redirected to High Trail's application page.

Similar roles