Site Reliability Engineer

Montreal, Quebec, CanadaHybridContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Site Reliability Engineer (SRE) with a focus on AI infrastructure to join our team in Montreal, QC. This hybrid role involves operating, monitoring, and maintaining the infrastructure that supports GenAI applications, including training, inference, and model serving. You will design and build automation to reduce manual toil, develop infrastructure-as-code for provisioning and managing resources like GPU clusters and Kubernetes, and establish SLOs/SLIs/SLAs. Responsibilities include leading incident response, performing capacity planning, optimizing cost-performance, and ensuring system security and compliance. Collaboration with various engineering teams is key, as is defining DR strategies and maintaining operational documentation. Participation in 24/7 on-call rotations is required.

Role: SRE with AI

Location: Montreal, QC - Hybrid

Hybrid

Roles and Responsibilities:

• Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving) • Design and build automation for core platform capabilities, reducing manual toil • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

• Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting • Optimize cost vs. performance tradeoffs in large-scale compute environments • Harden systems for security, compliance, auditability, and data governance • Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems • Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms • Maintain runbooks, operational playbooks, documentation, and training materials • Participate in on-call rotations and respond to production incidents 24/7 as needed • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills:

• Production experience in SRE / Infrastructure / ops for large-scale systems • Strong programming/scripting skills (Python, Go, Java, or equivalent) • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.) • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.) • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.) • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage) • Solid experience in capacity planning, performance tuning, scaling, and incident response • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments • Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus • Excellent communication, documentation, and cross-team collaboration skills • Proven track record of reducing operational toil via automation

Ready to apply?

You'll be redirected to Atlantis IT Group's application page.

Is this role right for you?

Role summary

Similar roles