Gemini Solutions Pvt Verified

IT Services and IT Consulting

SRE Lead

Toronto, Ontario, CanadaOnsiteFull TimeLeadPosted 12 days ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Senior Site Reliability Engineer (SRE) with a strong platform ownership mindset to lead reliability, scalability, and performance initiatives for mission-critical distributed systems. This role blends software engineering, cloud infrastructure, and production operations, focusing on building resilient systems, enhancing observability, automating operations, and driving reliability at scale. The SRE Lead will own system availability, performance, and scalability, define SLIs/SLOs, lead incident response and root cause analysis, design monitoring systems, automate operational workflows via CI/CD and IaC, manage cloud infrastructure (AWS, Azure, GCP), troubleshoot distributed systems and data pipelines, and optimize networking. The role also involves applying AI/ML for advanced reliability features and collaborating with engineering and business stakeholders. Requires 5+ years of experience in SRE/DevOps/Production Engineering, strong Python and Bash scripting skills, hands-on cloud experience, and proficiency in observability and CI/CD tools.

Position: Senior Site Reliability Engineer (SRE) Platform Lead

Job Location: Toronto, Ontario, Canada

Job Type: Full Time

Immediate Interview

Role Overview

We are looking for a Senior Site Reliability Engineer (SRE) with a strong platform
ownership mindset to drive reliability, scalability, and performance of mission-critical, distributed systems.
This role sits at the intersection of software engineering, cloud infrastructure, and production operations, with a focus on building resilient systems, improving observability, automating operations, and driving reliability at scale.
You will act as a technical lead for platform reliability, working closely with engineering and business stakeholders to ensure systems are highly available, performant, and continuously improving.

Experience:

5+ years of experience in SRE, DevOps, or Production Engineering
Experience supporting large-scale distributed systems
Experience working in production-critical environments with high availability requirements
Exposure to global systems and cross-team collaboration

Key Responsibilities

Platform Reliability & Ownership

Own availability, performance, and scalability of production systems
Define and implement SLIs, SLOs, and error budgets
Drive continuous improvements in system resilience and efficiency

Incident Management & Root Cause Analysis

Lead end-to-end incident response and service restoration
Perform deep root cause analysis across infrastructure, application, data, and network layers
Implement long-term fixes and reduce recurrence through engineering improvements

Observability & Monitoring

Design and enhance monitoring, logging, and alerting systems
Develop actionable dashboards and improve alert quality
Enable proactive detection of system issues

Automation & DevOps Practices

Automate operational workflows to reduce manual effort
Build and maintain CI/CD pipelines
Implement Infrastructure as Code (IaC) for scalable infrastructure management

Cloud & Distributed Systems

Manage and optimize systems on modern cloud platforms
Troubleshoot distributed systems across compute, storage, and network layers
Diagnose latency, routing, and performance issues in globally distributed environments

Data & Workflow Reliability

Troubleshoot data pipelines, job failures, and data inconsistencies
Perform data validation and analysis
Ensure reliability across data dependencies and workflows

Networking & Traffic Management

Diagnose issues related to DNS, HTTP/S, proxies, and load balancing
Work with CDN and edge delivery platforms (e.g., Akamai or similar) to optimize traffic routing and performance

Stakeholder Collaboration

Act as a liaison between engineering teams and business stakeholders
Communicate system status, incidents, and risks with clarity and context
Partner with cross-functional teams to drive reliability improvements

AI-Driven Reliability (Emerging Focus)

Apply AI/ML-driven techniques for anomaly detection, alert optimization, and
predictive issue identification
Leverage intelligent automation to improve incident response and operational
efficiency

Core Expectations

Demonstrates strong ownership of production systems and outcomes
Independently drives incident resolution and follow-through
Applies structured, analytical thinking to complex technical problems
Communicates effectively in high-impact, production-critical scenarios
Focuses on long-term reliability and scalability improvements

Technical Skills:

Programming & Automation

Strong experience in Python for automation and tooling
Proficiency in shell scripting (Bash)
Experience with API-driven and event-driven automation

Cloud & Infrastructure

Hands-on experience with AWS, Azure, or GCP
Strong understanding of cloud architecture, networking, and security fundamentals
Infrastructure as Code using Terraform, CloudFormation, or Ansible

DevOps & CI/CD

Experience with Jenkins, GitLab CI, or similar tools
Strong understanding of build, release, and deployment pipelines

Observability

Experience with Datadog, Splunk, Prometheus, or Grafana
Strong logging, monitoring, and alerting practices
Familiarity with incident management tools (e.g., PagerDuty)

Data & Databases

Strong SQL skills for troubleshooting and validation
Understanding of data pipelines and system dependencies

Systems & Platform

Strong Linux fundamentals
Experience with Docker and containerized environments
Exposure to Kubernetes and web servers (e.g., Nginx)

Orchestration

Experience with Airflow, Autosys, or similar scheduling tools

Networking & CDN

Strong understanding of DNS, HTTP/S, proxies, and load balancing

Experience with CDN and edge delivery platforms (e.g., Akamai or similar)

Ready to apply?

You'll be redirected to Gemini Solutions Pvt's application page.

Is this role right for you?

Role summary

Similar roles