We're in beta · Starting with US & Canada · Shipping weekly — your feedback shapes RiseMe
Apptoza Inc. logo
Apptoza Inc. Verified
Information Technology & Services, Software Development, Mobile Apps

Site Reliability Engineering Manager

Quebec, CanadaHybridContractManager / HeadPosted 1 month ago

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Sign up to see compensation estimate

Role: SRE +AI

Location: Montreal, QC (Hybrid: 3 days in office- Face-to-Face interview)

Duration: 6+ Months

Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

Roles and Responsibilities:

• Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

• Design and build automation for core platform capabilities, reducing manual toil

• Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

• Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards

• Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

• Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

• Optimize cost vs. performance tradeoffs in large-scale compute environments

• Harden systems for security, compliance, auditability, and data governance

• Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems

• Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms

• Maintain runbooks, operational playbooks, documentation, and training materials

• Participate in on-call rotations and respond to production incidents 24/7 as needed

• Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills:

• Production experience in SRE / Infrastructure / ops for large-scale systems

• Strong programming/scripting skills (Python, Go, Java, or equivalent)

• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

• Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)

• Solid experience in capacity planning, performance tuning, scaling, and incident response

• Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments

• Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus

• Excellent communication, documentation, and cross-team collaboration skills

• Proven track record of reducing operational toil via automation

Ready to apply?
You'll be redirected to Apptoza Inc.'s application page.

Similar roles