AI SRE / AI Ops Engineer

Montreal, Quebec, CanadaHybridFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Morgan Stanley is seeking an AI SRE / AI Ops Engineer for a hybrid role in Montreal, QC. This full-time position requires extensive production experience in SRE, infrastructure, and operations for large-scale systems. Key responsibilities include managing AI/ML compute clusters, containerization (Docker, Kubernetes), infrastructure-as-code (Terraform, Ansible), and monitoring tools (Prometheus, Datadog). The role demands strong programming skills in Python, Go, or Java, and expertise in networking, capacity planning, performance tuning, and incident response. Experience in regulated financial environments is a plus.

Hi,

Hope all well!!

I wanted to share an exciting opportunity with
Morgon Stanley
for an
AI SRE / AI Ops Engineer
role.

Partner:
Morgon Stanley

Role:
AI SRE / AI Ops Engineer

Type:
FTE

Location:
Montreal, QC

Work Mode:
Hybrid

Skills Required:

• Production experience in SRE / Infrastructure / ops for large-scale systems

• Strong programming/scripting skills (Python, Go, Java, or equivalent)

• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

• Production experience in SRE / Infrastructure / ops for large-scale systems

• Strong programming/scripting skills (Python, Go, Java, or equivalent)

• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

• Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)

• Solid experience in capacity planning, performance tuning, scaling, and incident response

• Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements

• Experience in regulated environments (financial services, compliance, audit, security) is a strong plus

• Excellent communication, documentation, and cross-team collaboration skills

• Proven track record of reducing operational toil via automation

Best Regards,

Tanuj Chand

Senior - Talent Acquisition

Tanuj.chand@ibuconsulting.com

+91 8288993961

+1 240 681-9158

8716 Silver Hall Road, Perry Hall, Maryland 21128, USA

ibuconsulting.com
|
ibugroup.co.uk

Ready to apply?

You'll be redirected to IBU's application page.