Site Reliability Engineering (SRE) Lead
Role summary
We are seeking an experienced Site Reliability Engineering (SRE) Lead / Architect to design, build, and evolve highly available, scalable, and secure payment platforms. This role requires deep expertise in AWS cloud, enterprise middleware (IBM WebSphere, IBM MQ), modern application stacks, observability, and DevOps, with a strong understanding of the Payments domain. You will define SRE strategy, reliability architecture, and operational excellence, collaborating with various teams to ensure mission-critical payment services meet high throughput, low latency, and fault-tolerant SLAs. Responsibilities include architecting resilient systems, defining SRE principles, leading cloud-native and hybrid system architecture, and driving HA/DR strategies.
🚀 We’re Hiring:
Site Reliability Engineering (SRE) Lead / Architect
(Phoenix, AZ)
📩Dive into the details below, and if it's a match, send your resume to
sakshi.khade@nablainfotech.com
We are seeking an experienced
Site Reliability Engineering (SRE) Lead / Architect
to design, build, and evolve highly available, scalable, and secure payment platforms. The role requires strong expertise across
AWS cloud
,
enterprise middleware (IBM WebSphere, IBM MQ)
,
modern application stacks
,
observability
, and
DevOps
, with deep understanding of
Payments domain systems
.
You will define SRE strategy, reliability architecture, and operational excellence while collaborating closely with application, infrastructure, security, and business teams.
Key Responsibilities
Reliability & Architecture
- Design and architect
highly resilient, fault‑tolerant payment systems
supporting high throughput and low latency SLAs.
- Define
SRE principles
, including SLOs, SLIs, error budgets, and reliability KPIs for mission‑critical payment services.
- Lead architecture decisions for
cloud‑native, hybrid, and legacy systems
, including IBM WebSphere–based platforms.
- Drive
active‑active, DR, and HA strategies
for AWS and on‑prem integrations.
Cloud & Platform Engineering
- Architect and operate workloads on
AWS
(EC2, EKS/ECS, RDS, S3, IAM, VPC, CloudWatch).
- Optimize infrastructure for
scalability, availability, security, and cost efficiency
.
- Guide containerization and orchestration strategies where applicable.
Application & Middleware Expertise
- Partner with development teams on
Java, Spring Boot–based microservices
.
- Support front‑end platforms built using
React and Angular
in terms of performance and reliability.
- Architect and operate messaging platforms using
Kafka
and
IBM MQ
.
- Manage enterprise middleware including
IBM WebSphere Application Server
.
DevOps & Automation
- Build and maintain
CI/CD pipelines using Jenkins
.
- Automate infrastructure provisioning, deployments, monitoring, and recovery processes.
- Promote
Infrastructure as Code (IaC)
and immutable infrastructure best practices.
- Champion DevOps and SRE culture across engineering teams.
Observability & Operations
- Design and standardize monitoring, logging, and alerting using:
- Splunk
- AWS CloudWatch
- Datadog
- Implement proactive monitoring and advanced alerting for payment flows.
- Lead incident response, root cause analysis (RCA), and post‑incident reviews.
- Drive reduction in MTTR and recurring incidents.
Database & Data Layer
- Architect and support
PostgreSQL and Oracle
databases with focus on:
- High availability
- Performance tuning
- Backup, restore, and disaster recovery
Payments Domain Leadership
- Provide reliability leadership for
payment processing systems
(authorization, capture, settlement, reconciliation).
- Ensure compliance with
PCI‑DSS, security, and regulatory standards
relevant to payments.
- Understand dependencies across gateways, processors, fraud, and downstream systems.
Leadership & Collaboration
- Act as
technical lead/architect
for SRE initiatives.
- Mentor SREs and engineers; guide best practices and standards.
- Work closely with product, architecture, security, and operations teams.
- Influence executive stakeholders on reliability, risk, and scalability decisions.
