Site Reliability Engineer W2

Mountain View, California, United StatesOnsiteContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking an experienced Site Reliability Engineer (SRE) to join our team. This role involves partnering with product teams to address critical infrastructure challenges, focusing on designing, building, and operating our cloud platform. You will be responsible for ensuring the reliability, performance, and security of our systems. Key responsibilities include automating infrastructure with Terraform and CI/CD pipelines, defining and monitoring SLIs/SLOs, enforcing security and compliance measures, implementing robust observability, leading incident response, optimizing cloud costs, and mentoring development teams on DevOps and reliability best practices. The ideal candidate has 5+ years of experience with production systems, deep AWS knowledge, and expertise in Kubernetes and Terraform.

Role Description

We’re seeking an experienced, highly collaborative SRE to partner with product teamsand tackle our most critical infrastructure challenges. You’ll be hands-on in designing, building, and operating our cloud platform—and driving the reliability, performance, and security that empower our engineering organization.

As a Site Reliability Engineer at DrumWave, you will:

● Infrastructure as Code & CI/CD: Automate provisioning and deployments with Terraform and integrate best-practice pipelines (GitHub Actions, ArgoCD, etc.).

● Reliability Engineering: Define SLIs/SLOs, manage error budgets, and build dashboards & alerts to proactively measure and improve system health.

● Security & Compliance: Enforce least-privilege IAM policies, automate vulnerability scans, and maintain audit logging for compliance.

● Monitoring & Observability: Instrument services with metrics, logs, and distributed tracing to enable rapid troubleshooting, aid teams in alerting, custom metrics, and dashboarding

● Incident Management: Own on-call rotations, lead real-time incident response, conduct post-mortems, and drive continuous improvements.

● Cost Optimization: Implement tagging strategies, right-size resources, and leverage concrete data to decide on optimal methods to control cloud spend at scale.

● Documentation & Mentorship: Author runbooks, standards, and best-practice guides—and coach dev teams on implementing modern DevOps, reliability, and security patterns.

Qualifications

● Have 5+ years of experience running production critical systems

● Deep proficiency with the AWS Cloud and Cloud-Native best practices

● Experience with Kubernetes (EKS, GKE) and Container Orchestration at scale

● Skilled in Terraform to declaratively provision and maintain infrastructure services

● Working knowledge of managing and debugging databases like Redis and Postgres

● Strong familiarity with VPC, VPN, Load Balancing, and cloud networking components

● Proficiency with Git workflows, branching strategies, and CI/CD system integrations

● Solid understanding of web and network protocols and standards (HTTP, REST,

TLS, DNS, etc...)

Nice to Have’s:

● Bachelor's degree, or equivalent in Computer Science, Engineering, or a related field.

● Experience with ArgoCD, Github Actions, Jenkins, or other CI/CD pipeline solutions

● Working knowledge of Python, Golang, and Helm templating languages

● Node.js experience a plus, including running scalable, resilient Node microservices

● Grasp of foundational security best practices for cloud infrastructure

● Awareness of Terragrunt, managing Terraform state, and optimal project structure

● Seasoned in production readiness fundamentals amidst a fast-moving team

Ready to apply?

You'll be redirected to Edrix's application page.