Senior Site Reliability Engineer

Hamilton, Ontario, CanadaOnsiteFull TimeSeniorPosted 1 month ago

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

💡 What You’ll Do

You’ll operate at the intersection of software engineering and systems engineering , building resilient systems that scale, self-heal, and empower developers to ship safely.

🔎 Reliability Engineering

- Define and manage
SLIs, SLOs, and error budgets
- Reduce MTTD, MTTA, and MTTR through structured incident response
- Conduct blameless postmortems and drive preventative improvements
- Champion reliability in architectural reviews and production readiness

📊 Observability & Monitoring

- Design actionable, symptom-based alerts (not noise)
- Build dashboards and tracing systems using tools like
CloudWatch, Prometheus, Grafana, New Relic, X-Ray, ADOT
- Implement synthetic monitoring to simulate real user journeys (URLs, clickpaths, APIs)
- Ensure full observability coverage across critical paths

☁️ Cloud & Infrastructure

- Operate and optimize
AWS environments (EC2, EKS/ECS, Lambda, VPC, RDS, IAM, S3, ALB/NLB, CloudTrail)
- Build resilient, multi-AZ and regionally replicated systems
- Implement autoscaling and fault-tolerant architecture
- Leverage Infrastructure as Code (Terraform, CDK, CloudFormation)

🤖 Automation & Toil Reduction

Eliminate manual processes through automation
Build self-healing infrastructure
Improve CI/CD pipelines with safe deployment strategies (canary releases, feature flags)
Write production-quality code (not just scripts) in Python, Go, Ruby, Bash, or Java

📈 Performance & Capacity Planning

Analyze system metrics and traffic patterns
Conduct load testing, chaos testing, and capacity modeling
Identify bottlenecks and proactively optimize systems

🤝 Cross-Functional Collaboration

You’ll work closely with:

Engineering & Platform teams on scalable system design
Security teams on IAM, KMS, GuardDuty, secrets management
Product leaders to align reliability with roadmap priorities
Cloud vendors and SaaS providers during critical incidents

🧠 What You Bring

Must-Have Experience

Bachelor’s degree in Computer Science, Software Engineering, or related field
Strong Linux/Unix systems knowledge
Deep AWS experience
Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration
Infrastructure as Code (Terraform, CDK, CloudFormation)
Production on-call and incident management experience
Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.)
Experience with MongoDB, PostgreSQL, Redis, RabbitMQ
Experience with observability and monitoring platforms
CI/CD pipeline experience (GitHub, Kubernetes, etc.)

Nice-to-Have

Performance engineering and chaos testing
Experience in fintech or regulated environments
Knowledge of distributed storage systems (NFS, HDFS, Ceph, S3)
Familiarity with dynamic resource frameworks (Kubernetes, Mesos, Yarn)

Ready to apply?

You'll be redirected to Devopie Inc.'s application page.

Compensation estimateAI

Similar roles