Realign logo
Realign Verified
Software, Business Process Management, Enterprise Architecture

AI SRE / AI Ops engineer-7

Montreal, Quebec, CanadaHybridFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking an AI SRE / AI Ops Engineer in Montreal, QC, for a full-time hybrid role. This position focuses on Site Reliability Engineering and AI Operations for large-scale systems. The ideal candidate will have production experience in SRE/Infrastructure/Ops, strong programming skills in Python, Go, or Java, and deep expertise in containerization (Docker) and orchestration (Kubernetes). Proficiency in Infrastructure-as-Code tools like Terraform and experience with AI compute clusters, high-performance storage, and monitoring tools (Prometheus, Grafana, Datadog) are essential. The role also requires strong networking and systems engineering knowledge, capacity planning, performance tuning, incident response, and a proven ability to reduce operational toil through automation. Experience in regulated environments is a plus.

Montréal, Quebec H1A 0A1 Posted March 29th, 2026

Looking for more job opportunities? Click here!

Job Type: Full Time

Job Category: IT

Job Description

AI SRE / AI Ops engineer

Montreal, QC - Hybrid

Skills Required :

  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
  • Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
  • Excellent communication, documentation, and cross-team collaboration skills
  • Proven track record of reducing operational toil via automation

Required Skills

DEVOPS ENGINEER

SENIOR EMAIL SECURITY ENGINEER

Ready to apply?
You'll be redirected to Realign's application page.