AI SRE / AI Ops engineer-7
Role summary
We are seeking an AI SRE / AI Ops Engineer in Montreal, QC, for a full-time hybrid role. This position focuses on Site Reliability Engineering and AI Operations for large-scale systems. The ideal candidate will have production experience in SRE/Infrastructure/Ops, strong programming skills in Python, Go, or Java, and deep expertise in containerization (Docker) and orchestration (Kubernetes). Proficiency in Infrastructure-as-Code tools like Terraform and experience with AI compute clusters, high-performance storage, and monitoring tools (Prometheus, Grafana, Datadog) are essential. The role also requires strong networking and systems engineering knowledge, capacity planning, performance tuning, incident response, and a proven ability to reduce operational toil through automation. Experience in regulated environments is a plus.
Montréal, Quebec H1A 0A1 Posted March 29th, 2026
Looking for more job opportunities? Click here!
Job Type: Full Time
Job Category: IT
Job Description
AI SRE / AI Ops engineer
Montreal, QC - Hybrid
Skills Required :
- Production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
- Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
- Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
- Solid experience in capacity planning, performance tuning, scaling, and incident response
- Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
- Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
- Excellent communication, documentation, and cross-team collaboration skills
- Proven track record of reducing operational toil via automation
Required Skills
DEVOPS ENGINEER
SENIOR EMAIL SECURITY ENGINEER