
SRE / AI Ops Engineer
Role summary
We are seeking a Site Reliability Engineer (SRE) / AI Ops Engineer to design, build, and operate intelligent, automated reliability solutions. This role involves using industry-leading tools like Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, Git/GitHub Actions, and Python to create proactive, self-healing, AI-enhanced workflows. Key responsibilities include implementing AI-driven observability, engineering AI Ops workflows for automation, configuring incident response tools, applying SRE principles, and integrating DevOps/CI/CD practices. The ideal candidate will have hands-on experience with the specified tools and a strong understanding of distributed systems and reliability engineering, aiming to transform operations into a predictive, automated, and intelligent ecosystem.
Job Title
:
SRE / AI Ops Engineer
Location:
Toronto, ON (3 or 4 days onsite a week)
Duration: Long Term Contract
Job Description:
Overview
We are seeking a highly skilled
Site Reliability Engineer (SRE) / AI Ops Engineer
to design, build, and operate intelligent, automated reliability solutions across our production environments. This role blends deep operational expertise with modern AI‑driven observability, monitoring, and automation practices. You will work with industry‑leading tools—Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, Git/GitHub Actions, and Python—to create proactive, self‑healing, AI‑enhanced workflows that elevate system reliability and reduce manual toil.
This is a hands‑on engineering role for someone who thrives at the intersection of SRE, automation, and AI‑powered operations.
Key Responsibilities
AI‑Driven Observability & Monitoring
- Implement and optimize monitoring solutions using
Dynatrace
,
Splunk
, and
Moogsoft
, leveraging their AI/ML capabilities (e.g., Davis AI, Splunk ITSI, Moogsoft AIOps) to:
- Detect anomalies
- Predict incidents
- Correlate events across distributed systems
- Reduce alert noise through intelligent clustering
AI Ops Workflow Engineering
- Design and build
AI‑powered operational workflows
that automate:
- Incident detection
- Root cause analysis
- Remediation actions
- Post‑incident insights
- Integrate AI insights from observability platforms into automated pipelines and runbooks.
Incident Response & Automation
- Configure and manage
PagerDuty
for intelligent alerting, escalation policies, and automated incident response.
- Build
self‑healing automation
using
Ansible
, Python, and GitHub Actions.
- Develop automated remediation playbooks triggered by AI‑driven events.
Platform Reliability & SRE Practices
- Apply SRE principles such as SLOs, SLIs, error budgets, and chaos testing.
- Improve system reliability through automation, performance tuning, and proactive engineering.
- Reduce operational toil by designing scalable, automated solutions.
DevOps & CI/CD Integration
- Use
Git
and
GitHub Actions
to build automated pipelines that integrate:
- Observability signals
- AI‑driven quality gates
- Automated rollback and recovery workflows
Python Scripting & Tooling
- Develop Python‑based automation, data processing, and AI‑enhanced operational tooling.
- Build integrations between monitoring platforms, ticketing systems, and automation engines.
Required Skills & Experience
Core Technical Skills
- Hands‑on experience with:
- Dynatrace
(including Davis AI)
- Splunk
(ITSI, Machine Learning Toolkit preferred)
- Moogsoft AIOps
- PagerDuty
- Ansible
- Git & GitHub Actions
- Python scripting
AI Ops & Automation
- Experience leveraging AI/ML features within observability and incident‑management tools.
- Ability to design automated workflows that use AI insights for:
- Event correlation
- Predictive alerting
- Automated remediation
- Intelligent routing
SRE Expertise
- Strong understanding of distributed systems, cloud infrastructure, and reliability engineering.
- Experience with SLO/SLI design, error budgets, and performance optimization.
- Familiarity with containerized environments (Kubernetes, Docker) is a plus.
Soft Skills
- Strong analytical mindset with a passion for automation and continuous improvement.
- Excellent communication and cross‑team collaboration abilities.
- Ability to translate operational challenges into scalable engineering solutions.
Preferred Qualifications
- Experience with cloud platform Redhat Openshift
- Exposure to LLM‑based automation or generative AI for operational workflows.
- Background in building or integrating with ChatOps frameworks.
- Knowledge of event‑driven architectures and message queues.
What You’ll Achieve
In this role, you will help transform traditional application and infrastructure operations into a modern, AI‑enhanced reliability ecosystem. You’ll build systems that not only detect and respond to issues but
*learn*
from them—driving a future where operations are predictive, automated, and intelligent.
Thanks & Regards,
Vignesh
vignesh@themesoft.com

