Site Reliability Engineer

United StatesOnsiteContractPosted 2 months agoVisa sponsorship available

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Role : Site Reliability Engineer (SRE) – Incident Automation

Skill set
:

JD :

Troubleshooting Guide (TSG) Authoring — Critical

Writing structured troubleshooting guides with: symptoms, diagnostic steps, KQL queries, expected results interpretation, and mitigation actions
Organizing TSGs into a logical folder hierarchy (by sub-service, monitor, failure category)
Creating a Root.md entry point that maps incident signals to the right TSG
Optionally creating TOC files for token optimization (~50% cost reduction)

2.
Prompt Engineering — Critical

Writing clear system prompts that guide the AI agent's investigation and mitigation workflow
Defining tool usage patterns, decision logic, and structured output format
Crafting investigation flows that handle cross-service dependency chains
Iterating prompts based on agent output quality during testing

3.
IcM Automation — Critical

Building automation workflows triggered by IcM incidents
Configuring incident routing, auto-triage, and escalation rules
Integrating DRI Agent into the IcM incident lifecycle (auto-invoke on incident creation)
Understanding severity levels, queue paths, and ownership models across sub-service teams

4.
Monitor Creation — Required

5.
KQL (Kusto Query Language) - Critical

Writing and validating KQL queries against service telemetry
Understanding Kusto cluster/database structure and table-to-service mapping
Time-based filtering, summarization, aggregation, and joins
Parameterized queries (e.g., placeholders for timestamps, cluster names, tenant IDs)
Interpreting query results to validate whether a mitigation was successful

Ready to apply?

You'll be redirected to Envision Technology Solutions's application page.

Similar roles