Site Reliability Engineer / AI Ops Engineer

Toronto, Ontario, CanadaOnsiteContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a skilled Site Reliability Engineer (SRE) / AIOps Engineer to design, build, and operate intelligent, automated reliability solutions. This role focuses on combining SRE expertise with AI-driven observability, monitoring, and automation using tools like Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, GitHub Actions, and Python. Key responsibilities include implementing AI-powered monitoring, engineering AIOps workflows for automation, configuring incident response systems, defining SRE practices like SLOs, integrating CI/CD pipelines, and developing Python-based automation. The ideal candidate will have hands-on experience with the specified tools and a strong understanding of distributed systems and cloud infrastructure.

Position
: Site Reliability Engineer / AI Ops Engineer

Location
: Toronto, On

Overview
:

We are seeking a highly skilled Site Reliability Engineer (SRE) / AIOps Engineer to design, build, and operate intelligent, automated reliability solutions across production environments.

This role combines deep SRE expertise with modern AI-driven observability, monitoring, and automation. You will work with leading tools such as Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, GitHub Actions, and Python to create proactive, self-healing systems that improve reliability and reduce manual effort.

Key Responsibilities

AI-Driven Observability & Monitoring

Implement and optimize monitoring solutions using Dynatrace, Splunk, and Moogsoft
Leverage AI/ML capabilities (Davis AI, Splunk ITSI, Moogsoft AIOps) to:
Detect anomalies
Predict incidents
Correlate events across distributed systems
Reduce alert noise via intelligent clustering

AIOps Workflow Engineering

Design and build AI-powered workflows to automate:
Incident detection
Root cause analysis
Remediation actions
Post-incident insights
Integrate AI insights into automated pipelines and operational runbooks

Incident Response & Automation

Configure and manage PagerDuty for alerting and escalation
Build self-healing automation using Ansible, Python, and GitHub Actions
Develop automated remediation playbooks triggered by AI-driven events

Platform Reliability & SRE Practices

Define and implement SLOs, SLIs, and error budgets
Improve reliability through automation and performance tuning
Reduce operational toil via scalable engineering solutions

DevOps & CI/CD Integration

Build pipelines using Git & GitHub Actions integrating:
Observability signals
AI-driven quality gates
Automated rollback and recovery

Python Development

Develop Python-based automation and integrations
Build tooling connecting monitoring platforms, ticketing systems, and automation engines

Required Skills & Experience

Core Technical Skills

Hands-on experience with:
Dynatrace (Davis AI preferred)
Splunk (ITSI, MLTK preferred)
Moogsoft AIOps
PagerDuty
Ansible
Git & GitHub Actions
Python scripting

AIOps & Automation

Experience leveraging AI/ML in observability and incident management tools
Ability to design automated workflows for:
Event correlation
Predictive alerting
Automated remediation
Intelligent routing

SRE Expertise

Strong understanding of distributed systems and cloud infrastructure
Experience with SLO/SLI frameworks and reliability engineering practices
Familiarity with Kubernetes and Docker

Regards

Patrick Fernandez

Talent Acquisition Group - Strategic Recruitment Manager

Ready to apply?

You'll be redirected to Themesoft Inc.'s application page.