Production Support Engineer/SRE

Toronto, Ontario, CanadaOnsiteFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Production Support Engineer with a Site Reliability Engineering (SRE) focus to maintain high availability of digital applications and backend systems. This role involves hands-on support, toil elimination, and infrastructure automation. Key responsibilities include managing and patching servers, handling certificate updates, administering databases like Elasticsearch, MongoDB, and Redis, and developing automation with Ansible and Infrastructure-as-Code. The engineer will also troubleshoot production incidents, perform root cause analysis, and collaborate with development teams. A minimum of 5 years of experience in production support or SRE roles is required, with specific experience in core technologies like Elasticsearch, MongoDB, Ansible, or OpenShift.

Production Support Engineer (SRE Focus)
Position Overview
We are seeking a skilled and experienced
Production Support Engineer
to support our digital applications. This role combines hands-on production support with
Site Reliability Engineering (SRE)
principles, focusing on
toil elimination, infrastructure automation, and ensuring high availability
of critical digital applications and backend systems.
Primary Responsibilities

Toil Removal & Infrastructure Maintenance (15%)
Execute SSL/TLS certificate updates and renewals across production environments
Perform Windows and Linux server patching and security updates
Manage NPID password updates and credential rotation protocols
Implement security vulnerability remediation in production systems
Identify, document, and eliminate repetitive manual operational tasks
Infrastructure & Database Cluster Management (20%)
Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)
Administer MongoDB clusters (replication, sharding, backup, recovery, maintenance)
Operate and maintain Redis instances for caching and session management
Monitor cluster health, capacity planning, and optimization
Execute failover and disaster recovery procedures
Ensure data integrity and backup compliance
Automation & SRE Activities (15%)
Develop, maintain, and enhance Ansible playbooks for infrastructure automation
Build Infrastructure-as-Code (IaC) solutions to reduce manual intervention
Create and maintain runbooks and operational playbooks
Design monitoring, alerting, and observability solutions
Implement automated remediation for common operational issues
Identify and prioritize toil reduction opportunities
Production Application Support (50%)
Troubleshoot and resolve production incidents impacting digital applications
Collaborate with development and support teams for issue diagnosis
Participate in incident response, root cause analysis (RCA), and postmortems
Monitor and respond to application performance degradation

Technical Requirements
Must-Have Skills

Ansible – 2+ years (playbooks, roles, automation workflows)
Elasticsearch – 2+ years (cluster management & troubleshooting)
MongoDB – 2+ years (replica sets, sharding, backup/recovery, tuning)
Redis – Deployment, configuration, and operational support
OpenShift – Containerized application deployment & management
Microsoft Azure – Cloud services, resource management, deployments
Linux Administration – 3+ years (RHEL/CentOS/Ubuntu)
Windows Server Administration – Patching, certificates, maintenance
Shell Scripting – Bash scripting for automation
Incident Management – Handling critical production incidents

Preferred Skills

Kubernetes or container orchestration platforms
Python or Go scripting
CI/CD tools (Jenkins, GitLab CI, Azure DevOps)
Monitoring tools (Prometheus, Grafana, ELK Stack, Datadog)
Infrastructure as Code (Terraform, CloudFormation)
Security best practices and vulnerability management
Certifications (AZ-900, CKA, Elasticsearch, etc.)

Required Qualifications

5+ years of experience in production support or SRE roles
3+ years working with at least two core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)
Experience in financial services or regulated environments (preferred)
Strong troubleshooting and analytical skills
Excellent documentation and communication abilities
Ability to work independently and collaboratively

Operational Expectations

On-Call Rotation: Participate in scheduled production support
Incident Response: Available for critical issues outside business hours
Availability: Flexible during high-priority production incidents
Response Time: Initial response within 30 minutes for critical incidents
Documentation: Maintain detailed runbooks and knowledge base articles
Collaboration: Work closely with infrastructure, development, and operations teams

Ready to apply?

You'll be redirected to J&M Group's application page.