Site Reliability Engineer

Montreal, Quebec, CanadaHybridContractPosted 1 month ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Seeking a Site Reliability Engineer with 7-8 years of experience in SRE, infrastructure, or operations for large-scale systems. The role requires expertise in supporting IaaS platforms and infrastructure for GenAI applications. Candidates must possess strong programming/scripting skills in Python, Go, or Java, and experience with containerization (Docker) and orchestration (Kubernetes) tools. Proficiency in Infrastructure as Code (IaC) tools like Terraform, Helm, CloudFormation, and Ansible is essential. Knowledge of GPU/AI compute clusters and experience with monitoring/alerting tools (Prometheus, Grafana, ELK/EFK, Datadog) are also required. Strong networking and systems engineering knowledge, including TCP/IP, DNS, routing, load balancing, and distributed storage, is necessary.

7-8 years of experience in SRE / Infrastructure / ops for large-scale systems

Experience in supporting IaaS platforms

Exp. in infrastructure supporting GenAI applications

Should have strong programming/scripting skills (Python, Go, Java)

Experience with containerization (Docker) and orchestration (Kubernetes, etc.) tools

Exp. with IaC (Terraform, Helm, CloudFormation, Ansible, etc.)

Knowledge of GPU / AI compute clusters

Exp. with monitoring/ alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)

Ready to apply?

You'll be redirected to Pacer Group's application page.

Is this role right for you?

Role summary

Similar roles