AI SRE / AI Ops engineer-7

Montreal, Quebec, CanadaHybridFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking an AI SRE / AI Ops Engineer in Montreal, QC, for a full-time hybrid role. This position focuses on Site Reliability Engineering and AI Operations for large-scale systems. The ideal candidate will have production experience in SRE/Infrastructure/Ops, strong programming skills in Python, Go, or Java, and deep expertise in containerization (Docker) and orchestration (Kubernetes). Proficiency in Infrastructure-as-Code tools like Terraform and experience with AI compute clusters, high-performance storage, and monitoring tools (Prometheus, Grafana, Datadog) are essential. The role also requires strong networking and systems engineering knowledge, capacity planning, performance tuning, incident response, and a proven ability to reduce operational toil through automation. Experience in regulated environments is a plus.

Montréal, Quebec H1A 0A1 Posted March 29th, 2026

Looking for more job opportunities? Click here!

Job Type: Full Time

Job Category: IT

Job Description

AI SRE / AI Ops engineer

Montreal, QC - Hybrid

Skills Required :

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation

Required Skills

DEVOPS ENGINEER

SENIOR EMAIL SECURITY ENGINEER

Ready to apply?

You'll be redirected to Realign's application page.