J&M Group logo
J&M Group Verified
Consulting, Business Services, Professional Services

Production Support Engineer/SRE

Toronto, Ontario, CanadaOnsiteFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Production Support Engineer with a Site Reliability Engineering (SRE) focus to maintain high availability of digital applications and backend systems. This role involves hands-on support, toil elimination, and infrastructure automation. Key responsibilities include managing and patching servers, handling certificate updates, administering databases like Elasticsearch, MongoDB, and Redis, and developing automation with Ansible and Infrastructure-as-Code. The engineer will also troubleshoot production incidents, perform root cause analysis, and collaborate with development teams. A minimum of 5 years of experience in production support or SRE roles is required, with specific experience in core technologies like Elasticsearch, MongoDB, Ansible, or OpenShift.

Production Support Engineer (SRE Focus)
Position Overview
We are seeking a skilled and experienced
Production Support Engineer
to support our digital applications. This role combines hands-on production support with
Site Reliability Engineering (SRE)
principles, focusing on
toil elimination, infrastructure automation, and ensuring high availability
of critical digital applications and backend systems.
Primary Responsibilities

  • Toil Removal & Infrastructure Maintenance (15%)
  • Execute SSL/TLS certificate updates and renewals across production environments
  • Perform Windows and Linux server patching and security updates
  • Manage NPID password updates and credential rotation protocols
  • Implement security vulnerability remediation in production systems
  • Identify, document, and eliminate repetitive manual operational tasks
  • Infrastructure & Database Cluster Management (20%)
  • Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)
  • Administer MongoDB clusters (replication, sharding, backup, recovery, maintenance)
  • Operate and maintain Redis instances for caching and session management
  • Monitor cluster health, capacity planning, and optimization
  • Execute failover and disaster recovery procedures
  • Ensure data integrity and backup compliance
  • Automation & SRE Activities (15%)
  • Develop, maintain, and enhance Ansible playbooks for infrastructure automation
  • Build Infrastructure-as-Code (IaC) solutions to reduce manual intervention
  • Create and maintain runbooks and operational playbooks
  • Design monitoring, alerting, and observability solutions
  • Implement automated remediation for common operational issues
  • Identify and prioritize toil reduction opportunities
  • Production Application Support (50%)
  • Troubleshoot and resolve production incidents impacting digital applications
  • Collaborate with development and support teams for issue diagnosis
  • Participate in incident response, root cause analysis (RCA), and postmortems
  • Monitor and respond to application performance degradation

Technical Requirements
Must-Have Skills

  • Ansible – 2+ years (playbooks, roles, automation workflows)
  • Elasticsearch – 2+ years (cluster management & troubleshooting)
  • MongoDB – 2+ years (replica sets, sharding, backup/recovery, tuning)
  • Redis – Deployment, configuration, and operational support
  • OpenShift – Containerized application deployment & management
  • Microsoft Azure – Cloud services, resource management, deployments
  • Linux Administration – 3+ years (RHEL/CentOS/Ubuntu)
  • Windows Server Administration – Patching, certificates, maintenance
  • Shell Scripting – Bash scripting for automation
  • Incident Management – Handling critical production incidents

Preferred Skills

  • Kubernetes or container orchestration platforms
  • Python or Go scripting
  • CI/CD tools (Jenkins, GitLab CI, Azure DevOps)
  • Monitoring tools (Prometheus, Grafana, ELK Stack, Datadog)
  • Infrastructure as Code (Terraform, CloudFormation)
  • Security best practices and vulnerability management
  • Certifications (AZ-900, CKA, Elasticsearch, etc.)

Required Qualifications

  • 5+ years of experience in production support or SRE roles
  • 3+ years working with at least two core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)
  • Experience in financial services or regulated environments (preferred)
  • Strong troubleshooting and analytical skills
  • Excellent documentation and communication abilities
  • Ability to work independently and collaboratively

Operational Expectations

  • On-Call Rotation: Participate in scheduled production support
  • Incident Response: Available for critical issues outside business hours
  • Availability: Flexible during high-priority production incidents
  • Response Time: Initial response within 30 minutes for critical incidents
  • Documentation: Maintain detailed runbooks and knowledge base articles
  • Collaboration: Work closely with infrastructure, development, and operations teams
Ready to apply?
You'll be redirected to J&M Group's application page.