Gemini Solutions Pvt logo
Gemini Solutions Pvt Verified
IT Services and IT Consulting

SRE Lead

Toronto, Ontario, CanadaOnsiteFull TimeLeadPosted 12 days ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Senior Site Reliability Engineer (SRE) with a strong platform ownership mindset to lead reliability, scalability, and performance initiatives for mission-critical distributed systems. This role blends software engineering, cloud infrastructure, and production operations, focusing on building resilient systems, enhancing observability, automating operations, and driving reliability at scale. The SRE Lead will own system availability, performance, and scalability, define SLIs/SLOs, lead incident response and root cause analysis, design monitoring systems, automate operational workflows via CI/CD and IaC, manage cloud infrastructure (AWS, Azure, GCP), troubleshoot distributed systems and data pipelines, and optimize networking. The role also involves applying AI/ML for advanced reliability features and collaborating with engineering and business stakeholders. Requires 5+ years of experience in SRE/DevOps/Production Engineering, strong Python and Bash scripting skills, hands-on cloud experience, and proficiency in observability and CI/CD tools.

Position: Senior Site Reliability Engineer (SRE) Platform Lead

Job Location: Toronto, Ontario, Canada

Job Type: Full Time

Immediate Interview

Role Overview

  • We are looking for a Senior Site Reliability Engineer (SRE) with a strong platform
  • ownership mindset to drive reliability, scalability, and performance of mission-critical, distributed systems.
  • This role sits at the intersection of software engineering, cloud infrastructure, and production operations, with a focus on building resilient systems, improving observability, automating operations, and driving reliability at scale.
  • You will act as a technical lead for platform reliability, working closely with engineering and business stakeholders to ensure systems are highly available, performant, and continuously improving.

Experience:

  • 5+ years of experience in SRE, DevOps, or Production Engineering
  • Experience supporting large-scale distributed systems
  • Experience working in production-critical environments with high availability requirements
  • Exposure to global systems and cross-team collaboration

Key Responsibilities

Platform Reliability & Ownership

  • Own availability, performance, and scalability of production systems
  • Define and implement SLIs, SLOs, and error budgets
  • Drive continuous improvements in system resilience and efficiency

Incident Management & Root Cause Analysis

  • Lead end-to-end incident response and service restoration
  • Perform deep root cause analysis across infrastructure, application, data, and network layers
  • Implement long-term fixes and reduce recurrence through engineering improvements

Observability & Monitoring

  • Design and enhance monitoring, logging, and alerting systems
  • Develop actionable dashboards and improve alert quality
  • Enable proactive detection of system issues

Automation & DevOps Practices

  • Automate operational workflows to reduce manual effort
  • Build and maintain CI/CD pipelines
  • Implement Infrastructure as Code (IaC) for scalable infrastructure management

Cloud & Distributed Systems

  • Manage and optimize systems on modern cloud platforms
  • Troubleshoot distributed systems across compute, storage, and network layers
  • Diagnose latency, routing, and performance issues in globally distributed environments

Data & Workflow Reliability

  • Troubleshoot data pipelines, job failures, and data inconsistencies
  • Perform data validation and analysis
  • Ensure reliability across data dependencies and workflows

Networking & Traffic Management

  • Diagnose issues related to DNS, HTTP/S, proxies, and load balancing
  • Work with CDN and edge delivery platforms (e.g., Akamai or similar) to optimize traffic routing and performance

Stakeholder Collaboration

  • Act as a liaison between engineering teams and business stakeholders
  • Communicate system status, incidents, and risks with clarity and context
  • Partner with cross-functional teams to drive reliability improvements

AI-Driven Reliability (Emerging Focus)

  • Apply AI/ML-driven techniques for anomaly detection, alert optimization, and
  • predictive issue identification
  • Leverage intelligent automation to improve incident response and operational
  • efficiency

Core Expectations

  • Demonstrates strong ownership of production systems and outcomes
  • Independently drives incident resolution and follow-through
  • Applies structured, analytical thinking to complex technical problems
  • Communicates effectively in high-impact, production-critical scenarios
  • Focuses on long-term reliability and scalability improvements

Technical Skills:

Programming & Automation

  • Strong experience in Python for automation and tooling
  • Proficiency in shell scripting (Bash)
  • Experience with API-driven and event-driven automation

Cloud & Infrastructure

  • Hands-on experience with AWS, Azure, or GCP
  • Strong understanding of cloud architecture, networking, and security fundamentals
  • Infrastructure as Code using Terraform, CloudFormation, or Ansible

DevOps & CI/CD

  • Experience with Jenkins, GitLab CI, or similar tools
  • Strong understanding of build, release, and deployment pipelines

Observability

  • Experience with Datadog, Splunk, Prometheus, or Grafana
  • Strong logging, monitoring, and alerting practices
  • Familiarity with incident management tools (e.g., PagerDuty)

Data & Databases

  • Strong SQL skills for troubleshooting and validation
  • Understanding of data pipelines and system dependencies

Systems & Platform

  • Strong Linux fundamentals
  • Experience with Docker and containerized environments
  • Exposure to Kubernetes and web servers (e.g., Nginx)

Orchestration

  • Experience with Airflow, Autosys, or similar scheduling tools

Networking & CDN

  • Strong understanding of DNS, HTTP/S, proxies, and load balancing

Experience with CDN and edge delivery platforms (e.g., Akamai or similar)

Ready to apply?
You'll be redirected to Gemini Solutions Pvt's application page.

Similar roles