Incident Manager SRE
Role summary
We are seeking an Incident Manager SRE for a hybrid role in Oakland, CA. This is a contract position focused on managing application-related issues in a cloud-hosted environment, not infrastructure support. The role requires 7-10 years of experience in incident management, with strong knowledge of cloud services (AWS/Azure/GCP) and site reliability engineering principles. Responsibilities include managing incident bridges, communicating with stakeholders, documenting incidents, leading DR activities, and performing data analytics on incident tickets. Experience with tools like ServiceNow, PagerDuty, and JIRA is essential. The ideal candidate will have a degree in computer science or a related field and possess excellent problem-solving and communication skills.
Hi,
We are looking for Incident Manager SRE in “Oakland, CA(Hybrid)”
Role:
Incident Manager SRE
Location: Oakland, CA(Hybrid) -Only Locals or Nearby
interview Mode –Video
Contract
Notice Period: Looking for Immediate Joiners
Note : This is NOT an Infrastructure support role, This is Semi technical role to support an environment which is 100% hosted over cloud and to drive Applications related issues.
JD:
Responsibilities
- Manage incident management bridge calls with support teams, on-call support application teams and management. Manage, escalate, status, and assist, coordinating repair efforts for all major incidents (P1 – P4).
- Regular communication updates to the Customer, End-Users and other Stakeholders during the entire Incident Management cycle
- Track and document incident updates in real time
- Since Major incidents are highly escalated cases, handling with presence of mind and innovation.
- Support the development and execution of change management plans to drive adoption and utilization of new processes, systems, and technologies.
- Reviewing changes, their priority, their urgency and performing risk analysis.
- Creating problem tickets and respective action items, reviewing root cause analysis and its closers.
- Performing PIR and Postmortem reports.
- Leading Site reliability/Disaster Recovery/Game Day/Switchover/Failover activities.
- Experience in handling multiple monitoring tools like Service now, Pager duty, Slack, Zoom, JIRA, etc.
- Perform quality audits and data analytics on incident tickets to ensure quality and uncover new trends.
- Meet the SLAs and other KPIs agreed and produce the Process Performance Reports
- Provides documentation for Known Error Data Base (KEDB) or similar depository
- Develop process and procedures that ensure Incident Management related action items are tracked and completed
- Ensuring the Process adherence, meeting the Quality norms
- Provide Management reporting on Incident Metrics and Incident Management performance
Qualifications/Skills required.
- Degree in computer science, Information Technology, or related field.
- 7-10 years of experience in incident management or related field.
- Knowledge of Cloud services is must. ( AWS/Azure/GCP)
- Advanced proficiency in site reliability culture and principles and can demonstrate how to implement site reliability across platform teams while avoiding common pitfalls.
- Should be able to plan and conduct site reliability testing
- Should have experience in AMS - Application Management Services.
- Knowledge of incident management/change management/problem management processes and procedures.
- Experience with and knowledge of change management principles, methodologies and tools
- Excellent problem-solving and analytical skills.
- Excellent verbal & written communication and interpersonal skills.
- Ability to work independently and as part of a team.
- Ability to manage multiple tasks simultaneously.