Senior Site Reliability Engineer
Compensation estimateAI
See base, equity, bonus, and total comp estimates for this role — free, no credit card.
Sign up to see compensation estimate💡 What You’ll Do
You’ll operate at the intersection of software engineering and systems engineering , building resilient systems that scale, self-heal, and empower developers to ship safely.
🔎 Reliability Engineering
- Define and manage
SLIs, SLOs, and error budgets
- Reduce MTTD, MTTA, and MTTR through structured incident response
- Conduct blameless postmortems and drive preventative improvements
- Champion reliability in architectural reviews and production readiness
📊 Observability & Monitoring
- Design actionable, symptom-based alerts (not noise)
- Build dashboards and tracing systems using tools like
CloudWatch, Prometheus, Grafana, New Relic, X-Ray, ADOT
- Implement synthetic monitoring to simulate real user journeys (URLs, clickpaths, APIs)
- Ensure full observability coverage across critical paths
☁️ Cloud & Infrastructure
- Operate and optimize
AWS environments (EC2, EKS/ECS, Lambda, VPC, RDS, IAM, S3, ALB/NLB, CloudTrail)
- Build resilient, multi-AZ and regionally replicated systems
- Implement autoscaling and fault-tolerant architecture
- Leverage Infrastructure as Code (Terraform, CDK, CloudFormation)
🤖 Automation & Toil Reduction
- Eliminate manual processes through automation
- Build self-healing infrastructure
- Improve CI/CD pipelines with safe deployment strategies (canary releases, feature flags)
- Write production-quality code (not just scripts) in Python, Go, Ruby, Bash, or Java
📈 Performance & Capacity Planning
- Analyze system metrics and traffic patterns
- Conduct load testing, chaos testing, and capacity modeling
- Identify bottlenecks and proactively optimize systems
🤝 Cross-Functional Collaboration
You’ll work closely with:
- Engineering & Platform teams on scalable system design
- Security teams on IAM, KMS, GuardDuty, secrets management
- Product leaders to align reliability with roadmap priorities
- Cloud vendors and SaaS providers during critical incidents
🧠 What You Bring
Must-Have Experience
- Bachelor’s degree in Computer Science, Software Engineering, or related field
- Strong Linux/Unix systems knowledge
- Deep AWS experience
- Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration
- Infrastructure as Code (Terraform, CDK, CloudFormation)
- Production on-call and incident management experience
- Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.)
- Experience with MongoDB, PostgreSQL, Redis, RabbitMQ
- Experience with observability and monitoring platforms
- CI/CD pipeline experience (GitHub, Kubernetes, etc.)
Nice-to-Have
- Performance engineering and chaos testing
- Experience in fintech or regulated environments
- Knowledge of distributed storage systems (NFS, HDFS, Ceph, S3)
- Familiarity with dynamic resource frameworks (Kubernetes, Mesos, Yarn)
Similar roles
- Site Reliability EngineerPacer Group · Montreal, Quebec, Canada · Hybrid
Senior Site Reliability EngineerBasis Theory · United States · Remote- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid