
Site Reliability Engineering (SRE) Lead
Role summary
A hybrid, contract-to-hire SRE Lead role is available to define and scale observability and reliability capabilities in an enterprise environment. This hands-on leadership position involves shaping observability strategy, building scalable monitoring solutions, and driving adoption across infrastructure, application, and platform teams. The role offers high visibility and influence, with strong investment in cloud, platform engineering, and modernization initiatives. Responsibilities include designing and implementing observability solutions, defining standards, building telemetry pipelines, and partnering with various teams to embed observability into system design. Key qualifications include experience in SRE, observability tools, cloud platforms, and infrastructure-as-code.
SRE Lead
Hybrid – 2- 3 days onsite, Contract to hire role
Our direct client is seeking an SRE Lead to help define and scale observability and reliability capabilities across an enterprise environment.
This is a hands-on leadership role where you will shape observability strategy, build scalable monitoring solutions, and drive adoption across infrastructure, application, and platform teams.
Some key highlights:
- Opportunity to drive a critical reliability and observability function across the organization
- High visibility role with influence across engineering and architecture teams
- Strong investment in cloud, platform engineering, and modernization initiatives
- Competitive compensation and benefits
What You’ll Do
- Lead the design and implementation of observability and monitoring solutions across cloud, on-prem, and hybrid environments
- Define and drive standards and best practices for reliability, monitoring, and telemetry
- Build and scale telemetry pipelines including metrics, logs, and traces
- Implement modern observability frameworks
- Partner with infrastructure, application, security, and data teams to embed observability into system design
- Establish governance around telemetry lifecycle including data retention, granularity, and cost optimization
- Evaluate, implement, and optimize tools such as Prometheus, Grafana, ELK, and Azure Monitor
- Act as a technical leader, influencing architecture decisions and driving adoption across engineering teams
Qualifications
- Experience in SRE, observability, infrastructure engineering, and/or DevOps environments
- Strong hands-on experience with observability and monitoring tools such as Prometheus, Grafana, ELK stack, and Azure Monitor
- Experience with OpenTelemetry, eBPF, and modern telemetry standards
- Proven experience building or improving observability platforms in enterprise environments
- Strong understanding of cloud platforms (Azure preferred), networking, and distributed systems
- Experience with infrastructure-as-code tools such as Terraform or Ansible and CI/CD pipelines
- Exposure to Kubernetes or other containerized environments
- Strong architectural and problem-solving skills with the ability to design scalable solutions
- Excellent communication skills and ability to work across multiple teams and stakeholders
Nice to Have
- Experience with tools such as SolarWinds, OpsRamp, or ExtraHop
- Experience in large-scale or regulated environments
- Prior experience leading initiatives or mentoring engineers