Site Reliability Engineer (SRE)

Toronto, Ontario, CanadaOnsiteFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Katalyze AI is seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and security of their AI-driven biotech platform. The role involves defining and maintaining SLOs/SLIs, building and operating CI/CD pipelines, monitoring, alerting, and incident response systems. Responsibilities include designing and managing cloud infrastructure (AWS/GCP/Azure) using IaC (Terraform/Pulumi), implementing observability tools, and partnering with engineering to embed reliability practices. The SRE will also lead incident response, support security/compliance, and build automation to enhance operational efficiency. Experience with Kubernetes, Docker, observability tools, and scripting languages like Python/Go is required.

About Katalyze AI

Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.

About the Role

We're looking for a Site Reliability Engineer to ensure Katalyze AI's platform is reliable, scalable, and secure as we grow with enterprise customers. You'll build and maintain the infrastructure and practices that keep our systems running smoothly and help us move fast without breaking things.

What You'll Do

Define and maintain SLOs, SLIs, and error budgets for critical platform services
Build and operate CI/CD pipelines, monitoring, alerting, and incident response systems
Design and manage cloud infrastructure (AWS/GCP/Azure) using infrastructure-as-code (Terraform, Pulumi)
Implement observability tooling (logging, tracing, metrics) across the platform
Partner with engineering to embed reliability practices into the development lifecycle
Lead incident response and post-mortems; drive systemic improvements
Support security and compliance requirements for enterprise customer deployments
Build automation to reduce toil and improve operational efficiency

What We're Looking For

4+ years of SRE, DevOps, or platform engineering experience
Strong experience with Kubernetes, Docker, and container orchestration
Proficiency with cloud platforms (AWS preferred) and infrastructure-as-code
Experience with observability tools (Datadog, Grafana, Prometheus, or similar)
Understanding of security best practices and enterprise compliance requirements (SOC 2, HIPAA awareness)
Experience with Python or Go for automation scripting
Startup experience preferred — you're comfortable building from scratch

Ready to apply?

You'll be redirected to Katalyze AI's application page.

Is this role right for you?

Role summary

Similar roles