Sr. Site Reliability Engineer

Austin, Texas, United StatesHybridFull TimePosted today

Role:
Sr. Site Reliability Engineer (SRE) - Unified Observability & AIOps
Location:
Austin, TX / Fort Mill, SC (Hybrid)Job Type: Full Time
Role Summary
We are seeking a
Senior SRE
with strong expertise in
Unified Observability, proactive detection, AIOps, and GenAI-driven operations
to support complex, distributed financial services platforms. The role requires hands-on experience designing
SLI/SLO-driven monitoring
,
dynamic thresholds
,
intelligent alerting
, and
AI/ML-based anomaly detection
across multi-stream architectures.
Key Responsibilities
Observability & Reliability Engineering

Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
Build actionable dashboards for operations, engineering, and leadership
Implement alerting strategies using static and dynamic thresholds

Proactive Detection & AIOps

Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
Transition monitoring from reactive alerts to proactive insights
Implement noise reduction, alert correlation, and root cause analysis
Apply baseline modeling, seasonality detection, and anomaly scoring

Distributed Systems & Dependency Analysis

Monitor and troubleshoot multi-service architectures involving:
Microservices
Downstream APIs
Kafka / streaming platforms
Cloud infrastructure (Terraform, IaC)
Identify whether issues originate from:
Upstream/downstream dependencies
Streaming platform
Infrastructure
Application code

Tooling & Platforms

Deep hands-on experience with Dynatrace (mandatory)
Experience with:
OpenTelemetry
Prometheus / Grafana
ELK / EFK
Cloud-native monitoring (AWS/Azure/GCP)
Strong JSON-based telemetry manipulation and enrichment

GenAI & LLM Enablement

Apply GenAI / LLMs for:
Incident summarization
Root cause explanation
Runbook recommendations
Auto-remediation suggestions
Collaborate with platform teams to operationalize GenAI safely

Required Skills & Experience
✅ 15+ years in SRE / Production Engineering ✅ Strong
Unified Observability
background (not infra-only) ✅ Hands-on
Dynatrace
experience (metrics, traces, logs, Davis AI) ✅ SLI/SLO engineering experience in production systems ✅ Experience implementing
dynamic thresholds
and anomaly detection ✅ Knowledge of
AI/ML concepts applied to Ops (AIOps)
✅ Distributed systems troubleshooting expertise ✅ Experience with Kafka or streaming data platforms
Differentiators (Highly Valued)

Experience in financial services or regulated environments
Proven reduction of alert noise and MTTR using AIOps
GenAI / LLM integration into operations workflows

Ready to apply?

You'll be redirected to The Value Maximizer's application page.