Product Support Engineer/SRE - 397

Chicago, Illinois, United StatesHybridContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

This role is for a Production Support Engineer/SRE responsible for day-to-day incident management, operational readiness, and continuous improvements. The candidate will work closely with development and product teams, focusing on applications running on AWS and cloud APIs. Key responsibilities include troubleshooting complex issues related to telematics data, portals, services, and APIs, managing P1/P2 incidents, and ensuring platform reliability through monitoring, alerting, and best practices. The position requires a strong understanding of AWS services like Kinesis, Fargate, and Lambda, CI/CD implementation using Azure DevOps, and experience with ServiceNow for incident management. This is a hybrid role requiring 2-4 years of experience in supporting production-grade, customer-facing platforms.

Open for Peoria, IL and Irving, TX as well

Role and Overview of the role:

This position is production support OR SRE. This person is responsible for not only just day-to-day incident management and operational readiness but there's a lot of heavy lifting and working with development teams, product teams. Working on continuous improvements and being part of the on-call rotation. Lot of platform and applications are running on top of the AWS and cloud API. There are lot of the applications and programs that they work on, they touch multiple services that they work on and they touch multiple applications. Candidate be able to navigate Telematics data which is not being updated Or telematics data is incorrect Or main portals, services, components of the portal that aren't working.. AWS logs and Kibana and different services for which candidate be able to figure out and troubleshoot the underlying issue because it's not going to be a straightforward scenario. There could be issues with degradation of APIs and different services which are not working.

Top Skills:

l Cloud platform (AWS) and API

l Incident Management

l Running P1, P2 bridges with different technical resources and leaders and executives on there.

l Doing technical investigations

l Being able to work day-to-day tickets

l Heavy lifting with different teams and organizations

Education and Experience

l Degree not required but a nice to have. ( Top candidates will have a degree)

l 2-4 years’ experience is a HARD requirement.

Typical task breakdown:

Own incident tickets through the full lifecycle, from initial triage to resolution and closure.

Collaborate with engineering, platform, product, and operations teams to diagnose issues and coordinate fixes.

Communicate incident status, impact, and resolution progress to stakeholders.

Lead or contribute to root cause analysis and ensure follow up actions are identified and tracked.

Ensure platform reliability through monitoring, alerting, security, and operational best practices.

Respond to and manage production incidents impacting AWS services and APIs.

Drive reliability, stability, and operational readiness improvements across cloud platforms.

Understand end‑to‑end technical and business flows to support production services effectively.

Develop, maintain, and improve clear, actionable runbooks for operational support.

Lead knowledge transfer sessions to ensure support teams are ready for production support.

Interaction with team
:

Working cross functionally with different organizations and groups ( Vision link, product, access management, etc.)

11 folks within the team.

Required Technical Skills

(Required)

Experience supporting production grade, customer facing platforms in complex, multi‑team environments.

A demonstrated ownership mindset, taking accountability for service stability, incident outcomes, and follow through beyond initial investigation.

Strong understanding of AWS Kinesis streaming and messaging services,- containerized and serverless compute using Fargate and Lambda, and CI/CD pipeline implementation using Azure DevOps.

Experience utilizing ServiceNow for incident management and Azure Devops for features, user stories, etc.

Proven ability to partner effectively with engineering, product, and platform teams to resolve issues and improve operational efficiency.

Experience driving root cause analysis and continuous improvement, turning incidents into long term reliability gains.

Strong understanding of operational readiness standards, including monitoring, alerting and runbooks.

Comfort operating in on-call or escalation roles, maintaining composure and clear communication during high impact incidents.

Ability to identify gaps in processes or tooling and proactively improve support models, documentation, or workflows.

Experience working within enterprise ITSM frameworks.

Soft Skills

(Required)

Strong communication skills, with the ability to translate technical issues into clear status and impact updates for stakeholders

Ready to apply?

You'll be redirected to DSM-H Consulting's application page.