Site Reliability Engineer
Role summary
The Site Reliability Engineer will ensure the reliability, availability, performance, and scalability of production systems by applying software engineering principles to infrastructure and operations. This role involves partnering with development teams to create resilient, observable, and automated platforms that meet defined service level objectives (SLOs). Key responsibilities include managing Linux/Unix systems, designing and operating distributed systems on cloud platforms like AWS or GCP, and utilizing containerization technologies such as Docker and Kubernetes. The engineer will also focus on monitoring, alerting, logging, and defining/managing SLIs and SLOs, with an emphasis on integrating security and compliance into operational workflows.
Job Title: Site Reliability Engineer
Location: Austin, TX - Hybrid, 2 days onsite
In person interview
Job Overview:
Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations. Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).
Required Skills:
- 8+ years of experience in systems engineering, DevOps, or site reliability engineering roles.
- Strong experience with Linux/Unix systems and system internals.
- Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash).
- Experience designing and operating highly available, distributed systems.
- Strong knowledge of cloud platforms (AWS, or GCP) and cloud-native services.
- Experience with containerization and orchestration (Docker, Kubernetes).
- Strong understanding of monitoring, alerting, and logging concepts.
- Experience defining and managing SLIs, SLOs, and error budgets.
- Familiarity with incident management, root cause analysis (RCA), and postmortems.
- Experience integrating security and compliance into operational workflows.
Preferred Skills:
- Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
- Experience operating 24x7 production environments with on-call rotations
- Experience with chaos engineering and resiliency testing
- Experience with feature flags, canary deployments, and progressive delivery
- Strong documentation skills for runbooks, dashboards, and operational standards
Similar roles
- Senior Site Reliability EngineerParallel Domain · Madrid, Comunidad de Madrid, Spain · Remote
- Site Reliability EngineerPacer Group · Montreal, Quebec, Canada · Hybrid
- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid