Sr. Site Reliability Engineer
Role: Sr. Site Reliability Engineer (SRE) – Unified Observability & AIOps
Location: Austin, TX / Fort Mill, SC (Hybrid)
Job Type: Full Time
Role Summary
We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.
Key Responsibilities
Observability & Reliability Engineering
- Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
- Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
- Build actionable dashboards for operations, engineering, and leadership
- Implement alerting strategies using static and dynamic thresholds
Proactive Detection & AIOps
- Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
- Transition monitoring from reactive alerts to proactive insights
- Implement noise reduction, alert correlation, and root cause analysis
- Apply baseline modeling, seasonality detection, and anomaly scoring
Distributed Systems & Dependency Analysis
- Monitor and troubleshoot multi-service architectures involving:
- Microservices
- Downstream APIs
- Kafka / streaming platforms
- Cloud infrastructure (Terraform, IaC)
- Identify whether issues originate from:
- Upstream/downstream dependencies
- Streaming platform
- Infrastructure
- Application code
Tooling & Platforms
- Deep hands-on experience with Dynatrace (mandatory)
- Experience with:
- OpenTelemetry
- Prometheus / Grafana
- ELK / EFK
- Cloud-native monitoring (AWS/Azure/GCP)
- Strong JSON-based telemetry manipulation and enrichment
GenAI & LLM Enablement
- Apply GenAI / LLMs for:
- Incident summarization
- Root cause explanation
- Runbook recommendations
- Auto-remediation suggestions
- Collaborate with platform teams to operationalize GenAI safely
Required Skills & Experience
✅ 15+ years in SRE / Production Engineering
✅ Strong Unified Observability background (not infra-only)
✅ Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
✅ SLI/SLO engineering experience in production systems
✅ Experience implementing dynamic thresholds and anomaly detection
✅ Knowledge of AI/ML concepts applied to Ops (AIOps)
✅ Distributed systems troubleshooting expertise
✅ Experience with Kafka or streaming data platforms
Differentiators (Highly Valued)
- Experience in financial services or regulated environments
- Proven reduction of alert noise and MTTR using AIOps
- GenAI / LLM integration into operations workflows
Similar roles
- Site Reliability EngineerPacer Group · Montreal, Quebec, Canada · Hybrid
Senior Site Reliability EngineerBasis Theory · 00, United States · Remote- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid