Agentic QA Engineer – Generative AI & Multi-Agent Systems
Role summary
Seeking a hands-on Agentic QA Engineer to lead end-to-end testing for agentic and multi-agent AI systems. This role involves defining QA strategy, building scalable test frameworks, and ensuring the reliability, accuracy, latency, and orchestration correctness of AI systems from development through production. Responsibilities include designing tests for agent orchestration, tool usage, planner-executor loops, and inter-agent workflows, validating critical AI components like state, memory, and prompts, and implementing resiliency and chaos tests. The engineer will also define and measure performance SLOs, integrate testing into CI/CD pipelines, and collaborate with cross-functional teams.
Role Summary
Seeking a hands-on Agentic QA Engineer to lead end-to-end testing for agentic and multi-agent AI systems. You will define QA strategy, build scalable test frameworks, and ensure reliability, accuracy, latency, and orchestration correctness from Dev → Prod.
Key Responsibilities
Own QA strategy for agentic/multi-agent systems across Dev, Staging, Prod
Design tests for agent orchestration, tool usage, planner-executor loops, inter-agent workflows
Validate state, memory, prompts, context windows, and agent graph correctness
Build resiliency & chaos tests (failover, retries, circuit breakers, degraded modes)
Define and measure latency SLOs, reliability, soak tests, canary releases
Implement accuracy validation frameworks (semantic similarity, factuality, hallucination, guardrails – PII/toxicity)
Perform load/stress testing for multi-agent systems (scale, concurrency, throughput)
Create reusable test artifacts (synthetic data, prompt libraries, simulators, agent fixtures)
Integrate testing into CI/CD pipelines & production monitoring
Drive release readiness, incident triage, and operational excellence
Collaborate with Agentic Ops, Data Science, MLOps, and Platform teams
Required Skills
7+ years QA; 2+ years in AI/ML/LLM systems & agentic architectures
Strong Python or TypeScript/JavaScript (test frameworks, simulators)
Experience with LLM evaluation (BLEU, ROUGE, BERTScore, embeddings, semantic similarity)
Knowledge of prompt testing, guardrails, hallucination detection
Expertise in distributed systems testing, latency profiling, chaos engineering
Experience with LangChain, LangGraph, LlamaIndex, DSPy, OpenAI/Azure OpenAI orchestration
Strong CI/CD (GitHub Actions/Azure DevOps)
Observability: OpenTelemetry, Prometheus/Grafana, Datadog
Knowledge of security, PII, compliance in AI systems
Preferred Skills
Multi-agent simulation & agent graph testing
MLOps & evaluation pipelines, A/B testing
AWS, serverless, containers, event-driven architectures
Managing SLAs, cost, and latency for AI systems