Site Reliability Engineer

CanadaOnsiteFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Site Reliability Engineer to enhance the reliability, resilience, and operational readiness of our services. You will collaborate with engineering teams to refine system design and operational practices, prevent incidents, lead response efforts, and drive improvements through post-mortems. Your core mission is to ensure our systems are reliable, scalable, and resilient. This involves implementing infrastructure improvements, managing incidents with technical expertise, automating manual processes, and responding to alerts via an on-call rotation. You will also define and maintain reliability metrics like SLIs, SLOs, and error budgets, and improve observability through metrics, logs, and tracing to reduce detection and resolution times.

### Who you are
- You have experience designing and operating scalable, reliable systems in AWS or a similar cloud environment
- You have handled on-call shifts for critical systems
- You are experienced with chaos engineering (i.e. Gremlin)
- You are able to dive in and debug live production systems
- You enjoy working in a growing system, and writing and deploying code without any downtime
- You have experience scripting and/or development (i.e. Linux Shell, Python, Javascript, Java)
- You are a self-starter, taking initiative in an ambiguous space preferably within a start-up environment

### What the job involves
- We're looking for a Site Reliability Engineer to improve the reliability, resilience, and operational readiness of our services
- You’ll work closely with engineering teams to improve system design and operational excellence
- You’ll help prevent incidents, lead response efforts, and drive improvements through post-mortems
- Your mission: ensure our systems are reliable, scalable, and resilient
- Implementing the improvements to the reliability, fault tolerance, scalability, and performance of our infrastructure
- Managing incidents using your technical know-how to involve the appropriate teams and automate away manual practices
- Providing support to our critical services by responding to automated alerts through our on-call rotation
- Define and maintain SLIs, SLOs,SLA, and error budgets to guide reliability decisions
- Improve observability across our systems (metrics, logs, tracing) to reduce time to detection and resolution
- Make production issues easier to detect, troubleshoot, and resolve
- Improving monitoring, alerting, dashboards, tracing and runbooks for critical services
- Leading postmortems and follow-up actions to reduce repeat incidents

Ready to apply?

You'll be redirected to Newton.co's application page.

Is this role right for you?

Role summary

Similar roles