AI Infrastructure Engineer
Role summary
We are seeking an AI Infrastructure Engineer to build and own the observability and diagnostics layer for a real-time AI assistant platform. This role is crucial for making complex AI systems transparent, debuggable, and reliable through end-to-end tracing, rapid root-cause analysis, and real-time monitoring. Responsibilities include designing event tracing, building automated failure detection pipelines, creating visibility dashboards, monitoring live sessions for anomalies, and enabling human intervention tools. The ideal candidate will have strong backend experience with distributed systems and observability, proficiency in Python and event-driven architectures, and experience debugging complex systems, with familiarity with AI/LLM systems, workflow/state machines, and telemetry tools.
Overview
Build and own the observability and diagnostics layer for a real-time AI assistant platform. You’ll make complex AI systems transparent, debuggable, and reliable by enabling end-to-end tracing, rapid root-cause analysis, and real-time monitoring.
Responsibilities
- Design event tracing across AI decisioning, workflows, and real-time communication systems
- Build automated pipelines to detect, classify, and analyze system failures
- Create dashboards for real-time and post-session visibility (timelines, decision paths, errors)
- Monitor live sessions and surface alerts for anomalies (latency, loops, failed actions)
- Enable human intervention tools for in-session issue handling
- Identify recurring failure patterns and drive system improvements
- Implement automated triage and alerting to route issues to the right teams
Requirements
- Strong backend experience with distributed systems and observability
- Proficiency in Python and event-driven architectures
- Experience debugging complex systems
- Familiarity with AI/LLM systems, workflow/state machines, and telemetry tools
Nice to Have
- Experience with real-time/voice systems
- Observability tools (e.g., Grafana, OpenTelemetry)
- Exposure to human-in-the-loop systems or operational tooling
Similar roles
AI Infrastructure EngineerMotive Studio · Vancouver, British Columbia, Canada · Hybrid- AI Infrastructure EngineerJobgether · United States · Remote
- AI Infrastructure EngineerRichtech Creative Displays · Las Vegas, Nevada, United States · Onsite
- AI Infrastructure EngineerJump Trading Group · Chicago, Illinois, United States · Onsite
- Senior AI Infrastructure EngineerTechChain Talent · San Francisco, California, United States · Onsite