Site Reliability Engineer (SRE)
Role summary
Katalyze AI is seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and security of their AI-driven biotech platform. The role involves defining and maintaining SLOs/SLIs, building and operating CI/CD pipelines, monitoring, alerting, and incident response systems. Responsibilities include designing and managing cloud infrastructure (AWS/GCP/Azure) using IaC (Terraform/Pulumi), implementing observability tools, and partnering with engineering to embed reliability practices. The SRE will also lead incident response, support security/compliance, and build automation to enhance operational efficiency. Experience with Kubernetes, Docker, observability tools, and scripting languages like Python/Go is required.
About Katalyze AI
Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.
About the Role
We're looking for a Site Reliability Engineer to ensure Katalyze AI's platform is reliable, scalable, and secure as we grow with enterprise customers. You'll build and maintain the infrastructure and practices that keep our systems running smoothly and help us move fast without breaking things.
What You'll Do
- Define and maintain SLOs, SLIs, and error budgets for critical platform services
- Build and operate CI/CD pipelines, monitoring, alerting, and incident response systems
- Design and manage cloud infrastructure (AWS/GCP/Azure) using infrastructure-as-code (Terraform, Pulumi)
- Implement observability tooling (logging, tracing, metrics) across the platform
- Partner with engineering to embed reliability practices into the development lifecycle
- Lead incident response and post-mortems; drive systemic improvements
- Support security and compliance requirements for enterprise customer deployments
- Build automation to reduce toil and improve operational efficiency
What We're Looking For
- 4+ years of SRE, DevOps, or platform engineering experience
- Strong experience with Kubernetes, Docker, and container orchestration
- Proficiency with cloud platforms (AWS preferred) and infrastructure-as-code
- Experience with observability tools (Datadog, Grafana, Prometheus, or similar)
- Understanding of security best practices and enterprise compliance requirements (SOC 2, HIPAA awareness)
- Experience with Python or Go for automation scripting
- Startup experience preferred — you're comfortable building from scratch
Similar roles
Site Reliability Engineer (SRE)Mithril · Palo Alto, California, United States · Hybrid
Senior Site Reliability Engineer (SRE)hackajob · Atlanta, Georgia, United States · Remote
Site Reliability Engineer (SRE)Samsung Electronics · British Columbia, Canada · Onsite- Senior Site Reliability Engineer (SRE)PrizePicks · Georgia, United States · Remote
- Site Reliability Engineer (SRE)Xona · California, United States · Onsite