SWE - Platform / Infrastructure Engineer

San Francisco, California, United StatesHybridFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Katalyze AI is seeking a Platform/Infrastructure Engineer to build and scale a reliable, secure, and multi-tenant cloud infrastructure for their AI-driven biotech platform. This role is critical for ensuring production-grade services that meet strict SOC2/HIPAA compliance standards. Responsibilities include establishing an observability stack, implementing Infrastructure as Code with Terraform, ensuring high uptime SLAs, managing security controls, optimizing AWS costs, and creating essential documentation. The ideal candidate will have mastery of AWS, deep experience with Terraform, containerization, CI/CD, and security best practices, with real-world incident response experience.

About Katalyze AI

Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.

About the Role

We are looking for a
Platform / Infrastructure Engineer
to build and scale a reliable, secure, and multi-tenant infrastructure for our AI-powered pharmaceutical platforms. You will be responsible for the "production-grade" backbone of our services, ensuring that our systems meet strict SOC2/HIPAA compliance standards while serving global pharmaceutical leaders. From implementing OpenTelemetry to automating zero-downtime deployments with Terraform, you will own the reliability and security of our entire cloud ecosystem.

What You’ll Do

- Observability & Monitoring:
Establish a production-grade observability stack (OpenTelemetry, distributed tracing) to achieve a <5 min mean-time-to-detection for incidents.
- Infrastructure as Code (IaC):
Implement automated provisioning and CI/CD pipelines (Terraform, GitHub Actions) for multi-tenant SaaS environments.
- Reliability Engineering:
Own platform metrics, aiming for a 99.9% uptime SLA and <500ms p95 API latency.
- Security & Compliance:
Implement and maintain SOC2/HIPAA controls, managing secrets (AWS Secrets Manager), IAM policies, and audit logging.
- Cost Optimization:
Drive architectural improvements and right-sizing strategies to optimize AWS expenditure.
- Documentation:
Create postmortems, runbooks, and architectural decision records (ADRs) to ensure team autonomy and operational excellence.

What We’re Looking For

Required:

- AWS Mastery:
5+ years managing production infrastructure (ECS/Fargate, RDS, S3, CloudFront, VPC networking).
- Terraform Expertise:
Deep experience with IaC patterns for multi-environment deployments (Dev/Staging/Prod).
- Containerization:
Battle-tested experience managing Docker/ECS with a focus on auto-scaling and health checks.
- Incident Response:
Real-world experience in on-call rotations and resolving live production outages.
- CI/CD & Automation:
Strong experience implementing pipelines for monorepo applications (Nx experience is a plus).
- Security Mindset:
Practical knowledge of least-privilege IAM, network isolation, and secrets management.
- Overlap:
Ability to work with at least 4-6 hours of overlap with US East Coast (EST/EDT) business hours.

Nice to Have:

Snowflake administration (role management, query optimization).
Python scripting for infra-automation.
Experience with Kafka, Redis, or BullMQ queue infrastructure.
Familiarity with dbt pipeline orchestration (Airflow/MWAA).

Tech Stack:

- Cloud:
AWS (ECS, RDS, S3, Lambda, CloudFront).
- Infrastructure:
Terraform, Docker, GitHub Actions.
- Data:
Snowflake, PostgreSQL, Redis, BullMQ.
- Observability:
OpenTelemetry, CloudWatch, Datadog.
- Frameworks:
Nx Monorepo (Next.js, Fastify, Django).

Ready to apply?

You'll be redirected to Katalyze AI's application page.