Senior AI Infrastructure Engineer (LLMOps/MLOps)
Compensation estimateAI
See base, equity, bonus, and total comp estimates for this role — free, no credit card.
Sign up to see compensation estimate**Kai**
is the AI company rebuilding cybersecurity for the machine-speed era. Founded by second time founders and trusted by Fortune 500 enterprises, Kai is building a future where security has no categories, no silos, and no human speed bottlenecks. The Kai Agentic Platform replaces fragmented, human-limited workflows with agentic AI systems that continuously contextualize, assess, reason, and execute security work at the speed of thought - making human defenders, superhuman.
**Why Kai?**
- $125M in Funding: We are well-funded and have the resources to innovate and scale rapidly.
- Proven Early Success with Fortune 500 Customers: We have started partnering with Fortune 500 companies, marking early success and growing trust in our innovative solutions. This highlights the immense potential and reliability of our AI-powered cybersecurity offerings.
- Experienced Leadership: Our founding team consists of second and third-time entrepreneurs, each with over 25 years of experience in the cybersecurity industry. Their proven expertise and vision drive our ambitious goals, positioning us to lead in AI-powered cybersecurity.
- World-Class Leadership Team: Our Heads of AI, Engineering, and Product bring extensive experience from some of the world’s most influential companies, ensuring top-tier mentorship, direction, and vision.
- Cutting-Edge AI Solutions: Our team leverages the most advanced AI technologies, including Large Language Models (LLMs) and Generative AI.
- Generous Compensation: We offer highly competitive salaries, equity options, and a supportive work environment. Your contributions will be valued and rewarded as we grow together.
- Cybersecurity Knowledge Preferred but Not Required: While experience in cybersecurity is a plus, we are primarily seeking top-tier talent in microservices architecture, software development, and/or DevOps who are passionate about solving complex problems.
As a
**Senior AI Infrastructure Engineer**
, you will own the design, deployment, and scaling of our
**AI infrastructure and production pipelines**
. You’ll bridge the gap between our
**AI research team**
and
**engineering organization**
, enabling the deployment of advanced
**LLM and ML models**
into secure, high-performance production systems.
You will build APIs, automate workflows, optimize GPU clusters, and ensure our models perform reliably in real-world cybersecurity applications. This role is ideal for someone who thrives in a startup environment — hands-on, cross-functional, and driven to build world-class AI systems from the ground up.
Key Responsibilities
**Core (Mission-Critical)**
- Own and manage the AI infrastructure stack — GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar).
- Productionize LLMs and ML models developed by the AI team, deploying them into secure, monitored, and scalable environments.
- Design and maintain REST/gRPC APIs for inference and automation, integrating tightly with the core cybersecurity platform.
- Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability.
**Infrastructure & Reliability**
- Build and maintain infrastructure-as-code (IaC) setups using Terraform or Pulumi for reproducible environments.
- Implement observability and monitoring — latency, throughput, model drift, and uptime dashboards with Prometheus / Grafana / OpenTelemetry.
- Automate CI/CD pipelines for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools.
- Architect scalable, hybrid AI systems across on-prem and cloud, enabling cost-effective compute scaling and fault tolerance.
**Security, Data, and Performance**
- Enforce data privacy and compliance across AI pipelines (SOC2, encryption, access control, VPC isolation).
- Manage data and model artifacts, including versioning, lineage tracking, and storage for models, checkpoints, and embeddings.
- Optimize inference latency, GPU utilization, and throughput, using batching, caching, or quantization techniques.
- Build fallback and failover mechanisms to maintain service reliability in case of model or API failure.
**Innovation & Leadership**
- Research and integrate emerging LLMOps and MLOps tools (e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI).
- Create sandbox environments for AI researchers to experiment safely.
- Lead cost optimization and capacity planning, forecasting GPU and cloud needs.
- Document and maintain runbooks, architecture diagrams, and standard operating procedures.
- Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement.
Qualifications
**Required**
- 5+ years of experience in ML Infrastructure, MLOps, or AI Platform Engineering.
- Proven expertise with LLM serving, distributed systems, and GPU orchestration (e.g., Kubernetes, Ray, or vLLM).
- Strong programming skills in Python and experience building APIs (FastAPI, Flask, gRPC).
- Proficiency with cloud platforms (Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi).
- Solid understanding of CI/CD, Docker, containerization, and model registry practices.
- Experience implementing observability, monitoring, and fault-tolerant deployments.
Preferred
- Familiarity with vector databases (FAISS, Pinecone, Weaviate, Qdrant).
- Exposure to security or compliance-focused environments.
- Experience with PyTorch / TensorFlow and MLflow / Weights & Biases.
- Knowledge of distributed training or large-scale inference optimization (DeepSpeed, TensorRT, Quantization).
- Prior work at startups or fast-paced R&D-to-production environments.