Infrastructure & SRE Engineer

CanadaRemoteFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Intuitive is seeking a Staff Infrastructure & SRE Engineer to manage the full lifecycle of their cloud-native platform, focusing on AWS and Kubernetes. This hands-on role involves provisioning, maintaining reliability through observability, release engineering, and incident response. Responsibilities include daily scripting in Python and Bash, managing EKS clusters, integrating monitoring systems, and enforcing DevOps best practices. The engineer will build automation, observability, and IaC foundations to ensure a programmable, observable, and resilient cloud platform. The role emphasizes relentless automation, end-to-end system ownership, and data-driven reliability improvements.

About us:

Intuitive is an
innovation-led engineering company delivering business outcomes
for 100’s of Enterprises globally. With the reputation of being a
Tiger Team
& a
Trusted Partner
of enterprise technology leaders, we help solve the most complex Digital Transformation challenges across following Intuitive Superpowers:

Modernization & Migration

Application & Database Modernization
Platform Engineering (IaC/EaC, DevSecOps & SRE)
Cloud Native Engineering, Migration to Cloud, VMware Exit
FinOps

Data & AI/ML

Data (Cloud Native / DataBricks / Snowflake)
Machine Learning, AI/GenAI

Cybersecurity

Infrastructure Security
Application Security
Data Security
AI/Model Security

SDx & Digital Workspace (M365, G-suite)

SDDC, SD-WAN, SDN, NetSec, Wireless/Mobility
Email, Collaboration, Directory Services, Shared Files Services

Intuitive Services:

Professional and Advisory Services
Elastic Engineering Services
Managed Services
Talent Acquisition & Platform Resell Services

About the job:

Title: Title: Infrastructure & SRE Engineer

Start Date:
Immediately

# of Positions:
1

Position Type: Full Time/ Contract

Location
: Remote across Canada (occasional travel to USA)

About the Role:

The Staff Infrastructure & SRE Engineer will own the full lifecycle of our cloud-native platform — from provisioning and sizing AWS and Kubernetes infrastructure, to maintaining reliability through observability, release engineering, and incident response. This is a deeply hands-on engineering role with real production ownership, where you'll balance technical depth with operational leadership to keep our platform reliable and scalable.

You will write Terraform, Python, and Shell scripts daily, manage EKS clusters at scale, integrate applications into APM and monitoring systems, and enforce DevOps best practices including change control and uptime monitoring. Your focus will be on platform reliability and operational excellence — building the automation, observability, and infrastructure-as-code foundations that make our cloud platform programmable, observable, and resilient. We value engineers who automate relentlessly, own their systems end-to-end, and drive reliability improvements through data and discipline.

Key Responsibilities

As a Staff Infrastructure & SRE Engineer, you will:

Own AWS infrastructure provisioning and operations ensuring production reliability across VPCs, EC2, RDS, S3, IAM, Route 53, ALB/NLB in multi-account environments following AWS Well-Architected Framework principles; implement cost optimization, right-sizing, and resource tagging strategies
Lead Kubernetes platform operations end-to-end from provisioning EKS clusters from scratch through full lifecycle management — sizing and capacity planning with Cluster Autoscaler/ Karpenter, version upgrades, node group rotations, and breaking-change migrations
Drive infrastructure as code excellence setting standards for Terraform/OpenTofu module development with automated testing (terratest, plan validation), reliable state management with remote backends, and governance enforcement through policy checks (OPA/Rego, tflint)
Own end-to-end observability and APM integration ensuring full visibility across infrastructure and applications — design monitoring frameworks with Prometheus, Grafana, Loki, Tempo, and OpenTelemetry; instrument applications for distributed tracing and structured logging; define and track SLIs/SLOs for platform services
Lead release engineering and change control from planning through production deployment — coordinate infrastructure and application releases with rollback plans, validation gates, maintenance windows, and audit trails for all production changes
Drive incident response and platform reliability building on-call rotations, escalation paths, actionable runbooks, and blameless postmortem processes; implement chaos engineering practices to proactively identify platform weaknesses
Own environment provisioning pipelines ensuring repeatable, automated infrastructure delivery from bare AWS accounts to fully operational platforms across dev, staging, and production
Build GitOps workflows implementing ArgoCD or Flux for declarative cluster and application management, ensuring all changes flow through Git with PR-based review and automated validation
Develop automation and tooling writing Python CLI tools, Bash scripts, and CI/CD pipelines (GitLab CI/GitHub Actions) for infrastructure provisioning, deployment, health checks, and operational tasks; build Ansible roles for configuration management and OS hardening where needed
Mentor and grow engineers on SRE practices, Kubernetes operations, observability patterns, and infrastructure-as-code standards through pairing, code reviews, and internal enablement sessions
Champion DevOps culture breaking down silos between development and operations, promoting shared ownership of reliability, and driving toil reduction through automation — if you're doing it manually more than twice, automate it

Required Qualifications

10+ years in Infrastructure Engineering, SRE, or Platform Engineering roles with production ownership.
Expert-level AWS experience across core services (VPC, EC2, RDS, S3, IAM, EKS, Route 53, ALB/NLB, CloudWatch) in multi-account production environments.
Deep Kubernetes hands-on experience provisioning, sizing, operating, and troubleshooting EKS clusters in production; strong understanding of pod lifecycle, networking (CNI), storage (CSI), RBAC, and capacity planning.
Strong Terraform/OpenTofu proficiency — module development, state management, workspace strategies, and CI/CD integration for large infrastructure-as-code codebases
Production observability experience building monitoring and alerting from scratch — Prometheus, Grafana, and at least one APM tool (Datadog, New Relic, Dynatrace, or OpenTelemetry-based stacks); proven experience instrumenting applications for distributed tracing and structured logging.
Daily Python and Bash scripting for infrastructure automation, AWS API integrations, CLI tooling, and operational tasks — not just "can write scripts" but uses them as primary engineering tools.
Hands-on GitOps experience with ArgoCD or Flux for Kubernetes delivery; comfortable with PR-based infrastructure workflows where Git is the single source of truth.
Proven DevOps and release engineering experience — change control processes, release management, incident response, on-call rotations, and blameless postmortem processes in enterprise environments.
Strong Linux systems knowledge (RHEL/Ubuntu) including systemd, networking, storage, and performance tuning.
Proven ability to troubleshoot across the full stack — from AWS infrastructure to Kubernetes to application-layer issues.

Preferred Qualifications

Helm chart development and management for production Kubernetes workloads.
Container security and supply chain hardening (image scanning, signing, admission controllers).
AWS certifications (Solutions Architect, DevOps Engineer, or SysOps).
FinOps practices — cost allocation, budget alerting, and right-sizing automation across AWS accounts.
Service mesh operations (Istio, Linkerd, or AWS App Mesh) in production environments.
Go for building infrastructure tooling, Kubernetes operators, or CLI utilities.

Ready to apply?

You'll be redirected to Intuitive.ai's application page.