
Site Reliability Engineer
Role summary
We are seeking an experienced Site Reliability Engineer (SRE) with over 8 years of experience to support and scale the infrastructure for GenAI applications, including training, inference, and model serving. The role involves managing and automating cloud infrastructure and GPU clusters, defining SLOs/SLAs, implementing monitoring and incident response, and optimizing performance, scalability, and cost. Key technical skills include Kubernetes, Docker, IaC (Terraform, Helm), scripting (Python, Go, Java), monitoring tools (Prometheus, Grafana, ELK, Datadog), and a strong understanding of networking and system engineering fundamentals. Experience with AI/ML infrastructure and regulated environments is a plus.
Title :Site Reliability Engineer (SRE) – GenAI Platform
Location: Toronto , ON
Duration: Long term
We’re looking for an experienced
SRE (8+ yrs)
to support and scale infrastructure for
GenAI applications
(training, inference, model serving).
🔹
Key Skills:
• SRE / Infrastructure Ops for large-scale systems
• Kubernetes, Docker & IaC (Terraform, Helm, etc.)
• Strong scripting (Python, Go, Java)
• Monitoring tools (Prometheus, Grafana, ELK, Datadog)
• Networking + system engineering fundamentals
🔹
What You’ll Do:
• Manage and automate cloud infrastructure & GPU clusters
• Define SLOs/SLAs, monitoring, and incident response (RCA)
• Optimize performance, scalability & cost
• Drive reliability, security, and disaster recovery strategies
⭐ Nice: AI/ML infra, regulated environments (Finance/Security)
#Hiring #SRE #Kubernetes #DevOps #GenAI #Cloud #Reliability
Similar roles
- Senior Site Reliability EngineerParallel Domain · Madrid, Comunidad de Madrid, Spain · Remote
- Site Reliability EngineerPacer Group · Montreal, Quebec, Canada · Hybrid
- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid