Senior Site Reliability / DevOps Engineer (Cloud & Platform Reliability)

Austin, Texas, United StatesHybridContractSeniorPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Senior Site Reliability / DevOps Engineer for a contract role supporting enterprise production systems and cloud infrastructure. This hybrid position in Austin, TX, focuses on ensuring system reliability, scalability, and performance by applying software engineering principles to infrastructure and operations. The engineer will partner with development teams to build resilient, observable, and automated platforms aligned with SLOs. Key responsibilities include designing and maintaining distributed systems, managing cloud environments (AWS/GCP), automating infrastructure, supporting containerized environments (Docker/Kubernetes), implementing monitoring and alerting, and performing incident response. The role requires 8+ years of experience in SRE/DevOps, strong Linux/Unix skills, proficiency in a scripting language (Python, Go, Java, or Bash), and experience with cloud platforms and containerization.

Location:
Austin, TX (Hybrid – 2 days onsite, 3 days remote)

Duration:
May 2026 – August 2026 (Extension Possible)

Schedule:
Monday–Friday | 8:00 AM – 5:00 PM CST

Hours:
Up to 780 hours

Work Authorization:
U.S.-based candidates only

Local Candidates Only:
Must reside within 50 miles of Austin, TX

Overview

We are seeking a highly experienced
Site Reliability / DevOps Engineer
to support enterprise production systems and cloud infrastructure.

This role focuses on ensuring
system reliability, scalability, and performance
by applying software engineering principles to infrastructure and operations. The ideal candidate will partner with development teams to build
resilient, observable, and automated platforms
aligned with service level objectives (SLOs).

Key Responsibilities

Platform Reliability & Engineering

Design, build, and maintain highly available, scalable distributed systems
Ensure system reliability, performance, and uptime across production environments
Define and manage SLIs, SLOs, and error budgets

Infrastructure & Cloud Operations

Manage and optimize cloud environments (AWS or GCP)
Implement infrastructure automation and configuration management
Support containerized environments using Docker and Kubernetes

Monitoring, Observability & Incident Management

Implement monitoring, logging, and alerting solutions
Perform incident response, root cause analysis (RCA), and postmortems
Develop and maintain dashboards, runbooks, and operational standards

DevOps & Automation

Develop scripts and tools using languages such as Python, Go, Java, or Bash
Enable CI/CD pipelines and improve deployment reliability
Support progressive delivery practices (canary releases, feature flags)

Security & Compliance

Integrate security best practices into operational workflows
Ensure compliance and reliability standards are maintained across systems

Required Qualifications

- 8+ years
of experience in Site Reliability Engineering, DevOps, or Systems Engineering
- Strong experience with
Linux/Unix systems and system internals
- Proficiency in at least one programming/scripting language (
Python, Go, Java, or Bash
)
- Experience designing and operating
distributed, highly available systems
- Hands-on experience with
cloud platforms (AWS or GCP)
- Experience with
Docker and Kubernetes
- Strong understanding of
monitoring, logging, and alerting systems
- Experience with
SLIs, SLOs, and error budgets
- Proven experience in
incident management and root cause analysis

Preferred Qualifications

- Experience with observability tools such as
Prometheus, Grafana, Datadog, Splunk, or Application Insights
- Experience supporting
24/7 production environments and on-call rotations
- Familiarity with
chaos engineering and resiliency testing
- Experience with
canary deployments and progressive delivery strategies
- Strong documentation skills (runbooks, dashboards, operational processes)

Additional Details

- Hybrid role with
mandatory onsite days (Monday & Thursday)
- Occasional after-hours or weekend support may be required
- All travel or relocation expenses are the responsibility of the candidate

Ready to apply?

You'll be redirected to CrowdPlat's application page.