We're in beta · Starting with US & Canada · Shipping weekly — your feedback shapes RiseMe
Envision Technology Solutions logo
Envision Technology Solutions Verified
Information Technology & Services

Site Reliability Engineer

United StatesOnsiteContractPosted 2 months agoVisa sponsorship available

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Sign up to see compensation estimate

Role : Site Reliability Engineer (SRE) – Incident Automation

Skill set
:

  • Troubleshooting Guide (TSG) Authoring
  • Prompt Engineering
  • IcM Automation
  • Monitor Creation
  • KQL (Kusto Query Language)

JD :

Troubleshooting Guide (TSG) Authoring — Critical

  • Writing structured troubleshooting guides with: symptoms, diagnostic steps, KQL queries, expected results interpretation, and mitigation actions
  • Organizing TSGs into a logical folder hierarchy (by sub-service, monitor, failure category)
  • Creating a Root.md entry point that maps incident signals to the right TSG
  • Optionally creating TOC files for token optimization (~50% cost reduction)

2.
Prompt Engineering — Critical

  • Writing clear system prompts that guide the AI agent's investigation and mitigation workflow
  • Defining tool usage patterns, decision logic, and structured output format
  • Crafting investigation flows that handle cross-service dependency chains
  • Iterating prompts based on agent output quality during testing

3.
IcM Automation — Critical

  • Building automation workflows triggered by IcM incidents
  • Configuring incident routing, auto-triage, and escalation rules
  • Integrating DRI Agent into the IcM incident lifecycle (auto-invoke on incident creation)
  • Understanding severity levels, queue paths, and ownership models across sub-service teams

4.
Monitor Creation — Required

  • Creating and tuning monitors/alerts that detect service health issues
  • Mapping monitors to TSGs so the agent knows which guide to follow per alert
  • Understanding Geneva/MDM metrics for health signal definition

5.
KQL (Kusto Query Language) - Critical

  • Writing and validating KQL queries against service telemetry
  • Understanding Kusto cluster/database structure and table-to-service mapping
  • Time-based filtering, summarization, aggregation, and joins
  • Parameterized queries (e.g., placeholders for timestamps, cluster names, tenant IDs)
  • Interpreting query results to validate whether a mitigation was successful
Ready to apply?
You'll be redirected to Envision Technology Solutions's application page.

Similar roles