Site Reliability Engineering (SRE) Lead

New York, New York, United StatesHybridContractPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

A hybrid, contract-to-hire SRE Lead role is available to define and scale observability and reliability capabilities in an enterprise environment. This hands-on leadership position involves shaping observability strategy, building scalable monitoring solutions, and driving adoption across infrastructure, application, and platform teams. The role offers high visibility and influence, with strong investment in cloud, platform engineering, and modernization initiatives. Responsibilities include designing and implementing observability solutions, defining standards, building telemetry pipelines, and partnering with various teams to embed observability into system design. Key qualifications include experience in SRE, observability tools, cloud platforms, and infrastructure-as-code.

SRE Lead

Hybrid – 2- 3 days onsite, Contract to hire role

Our direct client is seeking an SRE Lead to help define and scale observability and reliability capabilities across an enterprise environment.

This is a hands-on leadership role where you will shape observability strategy, build scalable monitoring solutions, and drive adoption across infrastructure, application, and platform teams.

Some key highlights:

Opportunity to drive a critical reliability and observability function across the organization
High visibility role with influence across engineering and architecture teams
Strong investment in cloud, platform engineering, and modernization initiatives
Competitive compensation and benefits

What You’ll Do

Lead the design and implementation of observability and monitoring solutions across cloud, on-prem, and hybrid environments
Define and drive standards and best practices for reliability, monitoring, and telemetry
Build and scale telemetry pipelines including metrics, logs, and traces
Implement modern observability frameworks
Partner with infrastructure, application, security, and data teams to embed observability into system design
Establish governance around telemetry lifecycle including data retention, granularity, and cost optimization
Evaluate, implement, and optimize tools such as Prometheus, Grafana, ELK, and Azure Monitor
Act as a technical leader, influencing architecture decisions and driving adoption across engineering teams

Qualifications

Experience in SRE, observability, infrastructure engineering, and/or DevOps environments
Strong hands-on experience with observability and monitoring tools such as Prometheus, Grafana, ELK stack, and Azure Monitor
Experience with OpenTelemetry, eBPF, and modern telemetry standards
Proven experience building or improving observability platforms in enterprise environments
Strong understanding of cloud platforms (Azure preferred), networking, and distributed systems
Experience with infrastructure-as-code tools such as Terraform or Ansible and CI/CD pipelines
Exposure to Kubernetes or other containerized environments
Strong architectural and problem-solving skills with the ability to design scalable solutions
Excellent communication skills and ability to work across multiple teams and stakeholders

Nice to Have

Experience with tools such as SolarWinds, OpsRamp, or ExtraHop
Experience in large-scale or regulated environments
Prior experience leading initiatives or mentoring engineers

Ready to apply?

You'll be redirected to Gotham Technology Group's application page.

Similar roles

Site Reliability Engineering (SRE) Lead
Nabla Infotech LLC · Arizona, United States · Onsite