We're in beta · Starting with US & Canada · Shipping weekly — your feedback shapes RiseMe
Kai logo
Kai Verified
Mobile Operating Systems, Consumer Electronics, Software

Senior AI Infrastructure Engineer (LLMOps/MLOps)

San Jose, California, United StatesOnsiteFull TimeSeniorPosted 2 months ago

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Sign up to see compensation estimate

**Kai**
is the AI company rebuilding cybersecurity for the machine-speed era. Founded by second time founders and trusted by Fortune 500 enterprises, Kai is building a future where security has no categories, no silos, and no human speed bottlenecks. The Kai Agentic Platform replaces fragmented, human-limited workflows with agentic AI systems that continuously contextualize, assess, reason, and execute security work at the speed of thought - making human defenders, superhuman.
**Why Kai?**

  • $125M in Funding: We are well-funded and have the resources to innovate and scale rapidly.
  • Proven Early Success with Fortune 500 Customers: We have started partnering with Fortune 500 companies, marking early success and growing trust in our innovative solutions. This highlights the immense potential and reliability of our AI-powered cybersecurity offerings.
  • Experienced Leadership: Our founding team consists of second and third-time entrepreneurs, each with over 25 years of experience in the cybersecurity industry. Their proven expertise and vision drive our ambitious goals, positioning us to lead in AI-powered cybersecurity.
  • World-Class Leadership Team: Our Heads of AI, Engineering, and Product bring extensive experience from some of the world’s most influential companies, ensuring top-tier mentorship, direction, and vision.
  • Cutting-Edge AI Solutions: Our team leverages the most advanced AI technologies, including Large Language Models (LLMs) and Generative AI.
  • Generous Compensation: We offer highly competitive salaries, equity options, and a supportive work environment. Your contributions will be valued and rewarded as we grow together.
  • Cybersecurity Knowledge Preferred but Not Required: While experience in cybersecurity is a plus, we are primarily seeking top-tier talent in microservices architecture, software development, and/or DevOps who are passionate about solving complex problems.

As a
**Senior AI Infrastructure Engineer**
, you will own the design, deployment, and scaling of our
**AI infrastructure and production pipelines**
. You’ll bridge the gap between our
**AI research team**
and
**engineering organization**
, enabling the deployment of advanced
**LLM and ML models**
into secure, high-performance production systems.
You will build APIs, automate workflows, optimize GPU clusters, and ensure our models perform reliably in real-world cybersecurity applications. This role is ideal for someone who thrives in a startup environment — hands-on, cross-functional, and driven to build world-class AI systems from the ground up.
Key Responsibilities
**Core (Mission-Critical)**

  • Own and manage the AI infrastructure stack — GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar).
  • Productionize LLMs and ML models developed by the AI team, deploying them into secure, monitored, and scalable environments.
  • Design and maintain REST/gRPC APIs for inference and automation, integrating tightly with the core cybersecurity platform.
  • Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability.

**Infrastructure & Reliability**

  • Build and maintain infrastructure-as-code (IaC) setups using Terraform or Pulumi for reproducible environments.
  • Implement observability and monitoring — latency, throughput, model drift, and uptime dashboards with Prometheus / Grafana / OpenTelemetry.
  • Automate CI/CD pipelines for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools.
  • Architect scalable, hybrid AI systems across on-prem and cloud, enabling cost-effective compute scaling and fault tolerance.

**Security, Data, and Performance**

  • Enforce data privacy and compliance across AI pipelines (SOC2, encryption, access control, VPC isolation).
  • Manage data and model artifacts, including versioning, lineage tracking, and storage for models, checkpoints, and embeddings.
  • Optimize inference latency, GPU utilization, and throughput, using batching, caching, or quantization techniques.
  • Build fallback and failover mechanisms to maintain service reliability in case of model or API failure.

**Innovation & Leadership**

  • Research and integrate emerging LLMOps and MLOps tools (e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI).
  • Create sandbox environments for AI researchers to experiment safely.
  • Lead cost optimization and capacity planning, forecasting GPU and cloud needs.
  • Document and maintain runbooks, architecture diagrams, and standard operating procedures.
  • Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement.

Qualifications
**Required**

  • 5+ years of experience in ML Infrastructure, MLOps, or AI Platform Engineering.
  • Proven expertise with LLM serving, distributed systems, and GPU orchestration (e.g., Kubernetes, Ray, or vLLM).
  • Strong programming skills in Python and experience building APIs (FastAPI, Flask, gRPC).
  • Proficiency with cloud platforms (Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi).
  • Solid understanding of CI/CD, Docker, containerization, and model registry practices.
  • Experience implementing observability, monitoring, and fault-tolerant deployments.

Preferred

  • Familiarity with vector databases (FAISS, Pinecone, Weaviate, Qdrant).
  • Exposure to security or compliance-focused environments.
  • Experience with PyTorch / TensorFlow and MLflow / Weights & Biases.
  • Knowledge of distributed training or large-scale inference optimization (DeepSpeed, TensorRT, Quantization).
  • Prior work at startups or fast-paced R&D-to-production environments.
Ready to apply?
You'll be redirected to Kai's application page.