We're in beta · Starting with US & Canada · Shipping weekly — your feedback shapes RiseMe
CarltonOne logo
CarltonOne Verified
Software Development

Site Reliability Engineering Manager

Markham, Ontario, CanadaHybridFull TimeManager / Head$120,000–$130,000 /yrPosted 1 month ago

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Sign up to see compensation estimate

CarltonOne is a global B2B technology leader, and part of the Goldman Sachs portfolio, helping organizations around the world reward and inspire exceptional people. Our solutions empower employees to be more productive, sales teams to perform at their best, and customers to stay engaged and loyal.

Our platform powers the global engagement industry, enabling companies to deliver impactful employee recognition, customer loyalty, rewards, sales, and channel incentive programs. We partner with over
450 clients
,
500 vendors
, and serve
14 million members
across
185 countries
.

Beyond engagement, every CarltonOne solution drives our eco-action mission: funding tree planting to help restore the planet. To date, we’ve funded over
20 million trees
and are on track to plant millions more each year. Learn more at
carltonone.com
.

About the Opportunity:

We are seeking a strategic and technically adept SRE Manager to lead our Site Reliability Engineering team. This role is pivotal in ensuring the reliability, scalability, and performance of our cloud-native infrastructure and services. You will guide a team of SREs, collaborate cross-functionally with DevOps, Security, and Engineering, and champion best practices in observability, incident response, and automation.

Responsibilities:

Leadership & Strategy

  • Lead, mentor, and grow a team of Site Reliability Engineers, fostering a culture of ownership, continuous learning, and operational excellence
  • Define and drive SRE strategy, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget management
  • Collaborate with cross-functional teams (Engineering, DevOps, Security, Product) to align reliability goals with business objectives
  • Build and maintain strong relationships with stakeholders across the organization

Reliability & Incident Management

  • Establish and continuously improve the end-to-end incident management lifecycle, from detection through post-incident review
  • Lead coordination of incident response efforts across engineering, DevOps, and support teams during major outages
  • Implement and maintain runbooks and playbooks for common incident scenarios
  • Facilitate blameless postmortems to identify root causes, document findings, and ensure follow-up actions are completed
  • Track and report on incident metrics (MTTR, MTTD, frequency, severity) to identify trends and drive continuous improvement
  • Drive automation initiatives to reduce toil, eliminate manual effort, and improve system resilience

Monitoring, Observability & Performance

  • Design and implement comprehensive monitoring and observability strategies using industry-leading tools including Datadog, Grafana, CloudWatch, and Prometheus
  • Deploy and optimize cloud security monitoring using Rapid7 InsightCloudSec and Wiz for threat detection and compliance
  • Leverage Cloudflare for edge performance monitoring and DDoS protection
  • Establish actionable alerting systems with proper thresholds and escalation paths
  • Analyze performance, availability metrics, and capacity trends to proactively identify and resolve issues
  • Create and maintain dashboards that provide visibility into system health and business-critical metrics

Operational Excellence & Cloud Infrastructure

  • Lead root cause analysis for recurring issues and implement long-term preventative solutions
  • Optimize cloud resource usage and costs through automation, right-sizing, and performance tuning
  • Oversee disaster recovery planning and testing to meet Recovery Point Objective (RPO) and Recovery Time Objective (RTO) requirements
  • Implement and maintain Infrastructure-as-Code (IaC) practices using Terraform, CloudFormation, and Helm
  • Champion security best practices including RBAC, IAM policies, encryption, and vulnerability management
  • Drive capacity planning initiatives to ensure infrastructure scales with business growth

Qualifications

  • Bachelor’s degree in computer science, Engineering, or related field
  • 7+ years of experience in cloud infrastructure, DevOps, or SRE roles, with 2+ years in a leadership
  • Proven experience managing incident response and reliability programs at scale
  • Deep expertise in AWS services (EKS, EC2, S3, VPC, IAM, RDS Aurora, Lambda)
  • Strong background in Kubernetes, container orchestration, and service meshes
  • Proficiency in Infrastructure-as-Code (Terraform, CloudFormation, Helm)
  • Experience with CI/CD pipelines and automation (Bamboo, Jenkins, Ansible)
  • Solid understanding of networking concepts (TCP/IP, DNS, load balancing, CDN)
  • Familiarity with monitoring and observability platforms (Datadog, Grafana, CloudWatch)
  • Excellent communication, stakeholder management, and cross-functional collaboration skills
  • Strong incident management and crisis leadership capabilities
  • Strategic thinking with focus on long-term reliability and scalability goals

Nice to Have

  • AWS Certified Solutions Architect or SRE-related certifications (SRE Practitioner, CKA, CKAD)
  • Experience with ITIL or other incident management frameworks
  • Solid understanding of security frameworks and tools (RBAC, IAM, KMS, Wiz, Rapid7)
  • Experience with multi-cloud environments (Azure, GCP)
  • Familiarity with Cloudflare, Ubuntu Server, VMware vSphere, and on-premises hosting
  • Experience with observability tools such as OpenTelemetry, Honeycomb, or New Relic
  • Familiarity with chaos engineering principles and tools (Chaos Monkey, Gremlin)
  • Background in high-scale, high-availability systems (99.99%+ uptime SLOs)

Additional Perks

Here are some additional perks that we provide:

  • Competitive salary and benefits package.
  • Health, dental, and vision coverage.
  • 3 weeks’ vacation plus personal days.
  • Access to our employee benefits portal for exclusive discounts.
  • Monthly company-wide events, celebrations, and team activities.
  • Bravo reward points program for recognition and appreciation.
  • Convenient office location close to public transit.

How to Apply

If this great opportunity looks rewarding to you, let’s connect. Our online application will give you the option to apply to this role directly.

The target hiring range for this position is $120,000 to $130,000. Placement in the salary range will be based on factors such as market conditions, internal equity, and candidate experience, skills, and qualifications relevant to the role.

We value diversity and inclusion and encourage all qualified people to apply. If we can make this easier through accommodation in the recruitment process, or if you need assistance to accommodate a disability, please contact us with the “Help” button in the application.

We will review applications, with priority given to those who have completed the assessment, and look forward to hearing from you.

Ready to apply?
You'll be redirected to CarltonOne's application page.

Similar roles