Site Reliability Engineer

Austin, Texas, United StatesHybridContractPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

The Site Reliability Engineer will ensure the reliability, availability, performance, and scalability of production systems by applying software engineering principles to infrastructure and operations. This role involves partnering with development teams to create resilient, observable, and automated platforms that meet defined service level objectives (SLOs). Key responsibilities include managing Linux/Unix systems, designing and operating distributed systems on cloud platforms like AWS or GCP, and utilizing containerization technologies such as Docker and Kubernetes. The engineer will also focus on monitoring, alerting, logging, and defining/managing SLIs and SLOs, with an emphasis on integrating security and compliance into operational workflows.

Job Title: Site Reliability Engineer

Location: Austin, TX - Hybrid, 2 days onsite

In person interview

Job Overview:

Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations. Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).

Required Skills:

8+ years of experience in systems engineering, DevOps, or site reliability engineering roles.
Strong experience with Linux/Unix systems and system internals.
Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash).
Experience designing and operating highly available, distributed systems.
Strong knowledge of cloud platforms (AWS, or GCP) and cloud-native services.
Experience with containerization and orchestration (Docker, Kubernetes).
Strong understanding of monitoring, alerting, and logging concepts.
Experience defining and managing SLIs, SLOs, and error budgets.
Familiarity with incident management, root cause analysis (RCA), and postmortems.
Experience integrating security and compliance into operational workflows.

Preferred Skills:

Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
Experience operating 24x7 production environments with on-call rotations
Experience with chaos engineering and resiliency testing
Experience with feature flags, canary deployments, and progressive delivery
Strong documentation skills for runbooks, dashboards, and operational standards

Ready to apply?

You'll be redirected to Kaav Inc's application page.

Is this role right for you?

Role summary

Similar roles