
Site Reliability Engineer / AI Ops Engineer
Role summary
We are seeking a skilled Site Reliability Engineer (SRE) / AIOps Engineer to design, build, and operate intelligent, automated reliability solutions. This role focuses on combining SRE expertise with AI-driven observability, monitoring, and automation using tools like Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, GitHub Actions, and Python. Key responsibilities include implementing AI-powered monitoring, engineering AIOps workflows for automation, configuring incident response systems, defining SRE practices like SLOs, integrating CI/CD pipelines, and developing Python-based automation. The ideal candidate will have hands-on experience with the specified tools and a strong understanding of distributed systems and cloud infrastructure.
Position
: Site Reliability Engineer / AI Ops Engineer
Location
: Toronto, On
Overview
:
We are seeking a highly skilled Site Reliability Engineer (SRE) / AIOps Engineer to design, build, and operate intelligent, automated reliability solutions across production environments.
This role combines deep SRE expertise with modern AI-driven observability, monitoring, and automation. You will work with leading tools such as Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, GitHub Actions, and Python to create proactive, self-healing systems that improve reliability and reduce manual effort.
Key Responsibilities
AI-Driven Observability & Monitoring
- Implement and optimize monitoring solutions using Dynatrace, Splunk, and Moogsoft
- Leverage AI/ML capabilities (Davis AI, Splunk ITSI, Moogsoft AIOps) to:
- Detect anomalies
- Predict incidents
- Correlate events across distributed systems
- Reduce alert noise via intelligent clustering
AIOps Workflow Engineering
- Design and build AI-powered workflows to automate:
- Incident detection
- Root cause analysis
- Remediation actions
- Post-incident insights
- Integrate AI insights into automated pipelines and operational runbooks
Incident Response & Automation
- Configure and manage PagerDuty for alerting and escalation
- Build self-healing automation using Ansible, Python, and GitHub Actions
- Develop automated remediation playbooks triggered by AI-driven events
Platform Reliability & SRE Practices
- Define and implement SLOs, SLIs, and error budgets
- Improve reliability through automation and performance tuning
- Reduce operational toil via scalable engineering solutions
DevOps & CI/CD Integration
- Build pipelines using Git & GitHub Actions integrating:
- Observability signals
- AI-driven quality gates
- Automated rollback and recovery
Python Development
- Develop Python-based automation and integrations
- Build tooling connecting monitoring platforms, ticketing systems, and automation engines
Required Skills & Experience
Core Technical Skills
- Hands-on experience with:
- Dynatrace (Davis AI preferred)
- Splunk (ITSI, MLTK preferred)
- Moogsoft AIOps
- PagerDuty
- Ansible
- Git & GitHub Actions
- Python scripting
AIOps & Automation
- Experience leveraging AI/ML in observability and incident management tools
- Ability to design automated workflows for:
- Event correlation
- Predictive alerting
- Automated remediation
- Intelligent routing
SRE Expertise
- Strong understanding of distributed systems and cloud infrastructure
- Experience with SLO/SLI frameworks and reliability engineering practices
- Familiarity with Kubernetes and Docker
Regards
Patrick Fernandez
Talent Acquisition Group - Strategic Recruitment Manager