Envision Technology Solutions Verified
Information Technology & Services
Site Reliability Engineer
United StatesOnsiteContractPosted 2 months agoVisa sponsorship available
Compensation estimateAI
See base, equity, bonus, and total comp estimates for this role — free, no credit card.
Sign up to see compensation estimateRole : Site Reliability Engineer (SRE) – Incident Automation
Skill set
:
- Troubleshooting Guide (TSG) Authoring
- Prompt Engineering
- IcM Automation
- Monitor Creation
- KQL (Kusto Query Language)
JD :
Troubleshooting Guide (TSG) Authoring — Critical
- Writing structured troubleshooting guides with: symptoms, diagnostic steps, KQL queries, expected results interpretation, and mitigation actions
- Organizing TSGs into a logical folder hierarchy (by sub-service, monitor, failure category)
- Creating a Root.md entry point that maps incident signals to the right TSG
- Optionally creating TOC files for token optimization (~50% cost reduction)
2.
Prompt Engineering — Critical
- Writing clear system prompts that guide the AI agent's investigation and mitigation workflow
- Defining tool usage patterns, decision logic, and structured output format
- Crafting investigation flows that handle cross-service dependency chains
- Iterating prompts based on agent output quality during testing
3.
IcM Automation — Critical
- Building automation workflows triggered by IcM incidents
- Configuring incident routing, auto-triage, and escalation rules
- Integrating DRI Agent into the IcM incident lifecycle (auto-invoke on incident creation)
- Understanding severity levels, queue paths, and ownership models across sub-service teams
4.
Monitor Creation — Required
- Creating and tuning monitors/alerts that detect service health issues
- Mapping monitors to TSGs so the agent knows which guide to follow per alert
- Understanding Geneva/MDM metrics for health signal definition
5.
KQL (Kusto Query Language) - Critical
- Writing and validating KQL queries against service telemetry
- Understanding Kusto cluster/database structure and table-to-service mapping
- Time-based filtering, summarization, aggregation, and joins
- Parameterized queries (e.g., placeholders for timestamps, cluster names, tenant IDs)
- Interpreting query results to validate whether a mitigation was successful
Similar roles
- Site Reliability EngineerPacer Group · Montreal, Quebec, Canada · Hybrid
Senior Site Reliability EngineerBasis Theory · United States · Remote- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid