SRE / AI Ops Engineer

Toronto, Ontario, CanadaHybridContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Site Reliability Engineer (SRE) / AI Ops Engineer to design, build, and operate intelligent, automated reliability solutions. This role involves using industry-leading tools like Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, Git/GitHub Actions, and Python to create proactive, self-healing, AI-enhanced workflows. Key responsibilities include implementing AI-driven observability, engineering AI Ops workflows for automation, configuring incident response tools, applying SRE principles, and integrating DevOps/CI/CD practices. The ideal candidate will have hands-on experience with the specified tools and a strong understanding of distributed systems and reliability engineering, aiming to transform operations into a predictive, automated, and intelligent ecosystem.

Job Title
:
SRE / AI Ops Engineer

Location:
Toronto, ON (3 or 4 days onsite a week)

Duration: Long Term Contract

Job Description:

Overview

We are seeking a highly skilled
Site Reliability Engineer (SRE) / AI Ops Engineer
to design, build, and operate intelligent, automated reliability solutions across our production environments. This role blends deep operational expertise with modern AI‑driven observability, monitoring, and automation practices. You will work with industry‑leading tools—Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, Git/GitHub Actions, and Python—to create proactive, self‑healing, AI‑enhanced workflows that elevate system reliability and reduce manual toil.

This is a hands‑on engineering role for someone who thrives at the intersection of SRE, automation, and AI‑powered operations.

Key Responsibilities

AI‑Driven Observability & Monitoring

- Implement and optimize monitoring solutions using
Dynatrace
,
Splunk
, and
Moogsoft
, leveraging their AI/ML capabilities (e.g., Davis AI, Splunk ITSI, Moogsoft AIOps) to:
- Detect anomalies
- Predict incidents
- Correlate events across distributed systems
- Reduce alert noise through intelligent clustering

AI Ops Workflow Engineering

- Design and build
AI‑powered operational workflows
that automate:
- Incident detection
- Root cause analysis
- Remediation actions
- Post‑incident insights
- Integrate AI insights from observability platforms into automated pipelines and runbooks.

Incident Response & Automation

- Configure and manage
PagerDuty
for intelligent alerting, escalation policies, and automated incident response.
- Build
self‑healing automation
using
Ansible
, Python, and GitHub Actions.
- Develop automated remediation playbooks triggered by AI‑driven events.

Platform Reliability & SRE Practices

Apply SRE principles such as SLOs, SLIs, error budgets, and chaos testing.
Improve system reliability through automation, performance tuning, and proactive engineering.
Reduce operational toil by designing scalable, automated solutions.

DevOps & CI/CD Integration

- Use
Git
and
GitHub Actions
to build automated pipelines that integrate:
- Observability signals
- AI‑driven quality gates
- Automated rollback and recovery workflows

Python Scripting & Tooling

Develop Python‑based automation, data processing, and AI‑enhanced operational tooling.
Build integrations between monitoring platforms, ticketing systems, and automation engines.

Required Skills & Experience

Core Technical Skills

- Hands‑on experience with:
- Dynatrace
(including Davis AI)
- Splunk
(ITSI, Machine Learning Toolkit preferred)
- Moogsoft AIOps
- PagerDuty
- Ansible
- Git & GitHub Actions
- Python scripting

AI Ops & Automation

Experience leveraging AI/ML features within observability and incident‑management tools.
Ability to design automated workflows that use AI insights for:
Event correlation
Predictive alerting
Automated remediation
Intelligent routing

SRE Expertise

Strong understanding of distributed systems, cloud infrastructure, and reliability engineering.
Experience with SLO/SLI design, error budgets, and performance optimization.
Familiarity with containerized environments (Kubernetes, Docker) is a plus.

Soft Skills

Strong analytical mindset with a passion for automation and continuous improvement.
Excellent communication and cross‑team collaboration abilities.
Ability to translate operational challenges into scalable engineering solutions.

Preferred Qualifications

Experience with cloud platform Redhat Openshift
Exposure to LLM‑based automation or generative AI for operational workflows.
Background in building or integrating with ChatOps frameworks.
Knowledge of event‑driven architectures and message queues.

What You’ll Achieve

In this role, you will help transform traditional application and infrastructure operations into a modern, AI‑enhanced reliability ecosystem. You’ll build systems that not only detect and respond to issues but
*learn*
from them—driving a future where operations are predictive, automated, and intelligent.

Thanks & Regards,

Vignesh

vignesh@themesoft.com

Ready to apply?

You'll be redirected to Themesoft Inc.'s application page.

Is this role right for you?

Role summary

Similar roles