Themesoft Inc. logo
Themesoft Inc. Verified
IT Services, Consulting, Digital Transformation, Cloud Computing

Site Reliability Engineer / AI Ops Engineer

Toronto, Ontario, CanadaOnsiteContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a skilled Site Reliability Engineer (SRE) / AIOps Engineer to design, build, and operate intelligent, automated reliability solutions. This role focuses on combining SRE expertise with AI-driven observability, monitoring, and automation using tools like Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, GitHub Actions, and Python. Key responsibilities include implementing AI-powered monitoring, engineering AIOps workflows for automation, configuring incident response systems, defining SRE practices like SLOs, integrating CI/CD pipelines, and developing Python-based automation. The ideal candidate will have hands-on experience with the specified tools and a strong understanding of distributed systems and cloud infrastructure.

Position
: Site Reliability Engineer / AI Ops Engineer

Location
: Toronto, On

Overview
:

We are seeking a highly skilled Site Reliability Engineer (SRE) / AIOps Engineer to design, build, and operate intelligent, automated reliability solutions across production environments.

This role combines deep SRE expertise with modern AI-driven observability, monitoring, and automation. You will work with leading tools such as Dynatrace, Splunk, Moogsoft, PagerDuty, Ansible, GitHub Actions, and Python to create proactive, self-healing systems that improve reliability and reduce manual effort.

Key Responsibilities

AI-Driven Observability & Monitoring

  • Implement and optimize monitoring solutions using Dynatrace, Splunk, and Moogsoft
  • Leverage AI/ML capabilities (Davis AI, Splunk ITSI, Moogsoft AIOps) to:
  • Detect anomalies
  • Predict incidents
  • Correlate events across distributed systems
  • Reduce alert noise via intelligent clustering

AIOps Workflow Engineering

  • Design and build AI-powered workflows to automate:
  • Incident detection
  • Root cause analysis
  • Remediation actions
  • Post-incident insights
  • Integrate AI insights into automated pipelines and operational runbooks

Incident Response & Automation

  • Configure and manage PagerDuty for alerting and escalation
  • Build self-healing automation using Ansible, Python, and GitHub Actions
  • Develop automated remediation playbooks triggered by AI-driven events

Platform Reliability & SRE Practices

  • Define and implement SLOs, SLIs, and error budgets
  • Improve reliability through automation and performance tuning
  • Reduce operational toil via scalable engineering solutions

DevOps & CI/CD Integration

  • Build pipelines using Git & GitHub Actions integrating:
  • Observability signals
  • AI-driven quality gates
  • Automated rollback and recovery

Python Development

  • Develop Python-based automation and integrations
  • Build tooling connecting monitoring platforms, ticketing systems, and automation engines

Required Skills & Experience

Core Technical Skills

  • Hands-on experience with:
  • Dynatrace (Davis AI preferred)
  • Splunk (ITSI, MLTK preferred)
  • Moogsoft AIOps
  • PagerDuty
  • Ansible
  • Git & GitHub Actions
  • Python scripting

AIOps & Automation

  • Experience leveraging AI/ML in observability and incident management tools
  • Ability to design automated workflows for:
  • Event correlation
  • Predictive alerting
  • Automated remediation
  • Intelligent routing

SRE Expertise

  • Strong understanding of distributed systems and cloud infrastructure
  • Experience with SLO/SLI frameworks and reliability engineering practices
  • Familiarity with Kubernetes and Docker

Regards

Patrick Fernandez

Talent Acquisition Group - Strategic Recruitment Manager

Ready to apply?
You'll be redirected to Themesoft Inc.'s application page.