
SRE Lead
Role summary
We are seeking a Senior Site Reliability Engineer (SRE) with a strong platform ownership mindset to lead reliability, scalability, and performance initiatives for mission-critical distributed systems. This role blends software engineering, cloud infrastructure, and production operations, focusing on building resilient systems, enhancing observability, automating operations, and driving reliability at scale. The SRE Lead will own system availability, performance, and scalability, define SLIs/SLOs, lead incident response and root cause analysis, design monitoring systems, automate operational workflows via CI/CD and IaC, manage cloud infrastructure (AWS, Azure, GCP), troubleshoot distributed systems and data pipelines, and optimize networking. The role also involves applying AI/ML for advanced reliability features and collaborating with engineering and business stakeholders. Requires 5+ years of experience in SRE/DevOps/Production Engineering, strong Python and Bash scripting skills, hands-on cloud experience, and proficiency in observability and CI/CD tools.
Position: Senior Site Reliability Engineer (SRE) Platform Lead
Job Location: Toronto, Ontario, Canada
Job Type: Full Time
Immediate Interview
Role Overview
- We are looking for a Senior Site Reliability Engineer (SRE) with a strong platform
- ownership mindset to drive reliability, scalability, and performance of mission-critical, distributed systems.
- This role sits at the intersection of software engineering, cloud infrastructure, and production operations, with a focus on building resilient systems, improving observability, automating operations, and driving reliability at scale.
- You will act as a technical lead for platform reliability, working closely with engineering and business stakeholders to ensure systems are highly available, performant, and continuously improving.
Experience:
- 5+ years of experience in SRE, DevOps, or Production Engineering
- Experience supporting large-scale distributed systems
- Experience working in production-critical environments with high availability requirements
- Exposure to global systems and cross-team collaboration
Key Responsibilities
Platform Reliability & Ownership
- Own availability, performance, and scalability of production systems
- Define and implement SLIs, SLOs, and error budgets
- Drive continuous improvements in system resilience and efficiency
Incident Management & Root Cause Analysis
- Lead end-to-end incident response and service restoration
- Perform deep root cause analysis across infrastructure, application, data, and network layers
- Implement long-term fixes and reduce recurrence through engineering improvements
Observability & Monitoring
- Design and enhance monitoring, logging, and alerting systems
- Develop actionable dashboards and improve alert quality
- Enable proactive detection of system issues
Automation & DevOps Practices
- Automate operational workflows to reduce manual effort
- Build and maintain CI/CD pipelines
- Implement Infrastructure as Code (IaC) for scalable infrastructure management
Cloud & Distributed Systems
- Manage and optimize systems on modern cloud platforms
- Troubleshoot distributed systems across compute, storage, and network layers
- Diagnose latency, routing, and performance issues in globally distributed environments
Data & Workflow Reliability
- Troubleshoot data pipelines, job failures, and data inconsistencies
- Perform data validation and analysis
- Ensure reliability across data dependencies and workflows
Networking & Traffic Management
- Diagnose issues related to DNS, HTTP/S, proxies, and load balancing
- Work with CDN and edge delivery platforms (e.g., Akamai or similar) to optimize traffic routing and performance
Stakeholder Collaboration
- Act as a liaison between engineering teams and business stakeholders
- Communicate system status, incidents, and risks with clarity and context
- Partner with cross-functional teams to drive reliability improvements
AI-Driven Reliability (Emerging Focus)
- Apply AI/ML-driven techniques for anomaly detection, alert optimization, and
- predictive issue identification
- Leverage intelligent automation to improve incident response and operational
- efficiency
Core Expectations
- Demonstrates strong ownership of production systems and outcomes
- Independently drives incident resolution and follow-through
- Applies structured, analytical thinking to complex technical problems
- Communicates effectively in high-impact, production-critical scenarios
- Focuses on long-term reliability and scalability improvements
Technical Skills:
Programming & Automation
- Strong experience in Python for automation and tooling
- Proficiency in shell scripting (Bash)
- Experience with API-driven and event-driven automation
Cloud & Infrastructure
- Hands-on experience with AWS, Azure, or GCP
- Strong understanding of cloud architecture, networking, and security fundamentals
- Infrastructure as Code using Terraform, CloudFormation, or Ansible
DevOps & CI/CD
- Experience with Jenkins, GitLab CI, or similar tools
- Strong understanding of build, release, and deployment pipelines
Observability
- Experience with Datadog, Splunk, Prometheus, or Grafana
- Strong logging, monitoring, and alerting practices
- Familiarity with incident management tools (e.g., PagerDuty)
Data & Databases
- Strong SQL skills for troubleshooting and validation
- Understanding of data pipelines and system dependencies
Systems & Platform
- Strong Linux fundamentals
- Experience with Docker and containerized environments
- Exposure to Kubernetes and web servers (e.g., Nginx)
Orchestration
- Experience with Airflow, Autosys, or similar scheduling tools
Networking & CDN
- Strong understanding of DNS, HTTP/S, proxies, and load balancing
Experience with CDN and edge delivery platforms (e.g., Akamai or similar)
Similar roles
- Lead SREJobs via Dice · Mckinney, Texas, United States · Hybrid
Senior SREWaystar · Atlanta, Georgia, United States · Onsite
Team Lead, SRELoblaw Companies Limited · Brampton, Ontario, Canada · Onsite- SRECollabera · Baltimore, Maryland, United States · Remote
- Principal SREHarnham · United States · Remote