AI SRE / AI Ops Engineer
Role summary
Morgan Stanley is seeking an AI SRE / AI Ops Engineer for a hybrid role in Montreal, QC. This full-time position requires extensive production experience in SRE, infrastructure, and operations for large-scale systems. Key responsibilities include managing AI/ML compute clusters, containerization (Docker, Kubernetes), infrastructure-as-code (Terraform, Ansible), and monitoring tools (Prometheus, Datadog). The role demands strong programming skills in Python, Go, or Java, and expertise in networking, capacity planning, performance tuning, and incident response. Experience in regulated financial environments is a plus.
Hi,
Hope all well!!
I wanted to share an exciting opportunity with
Morgon Stanley
for an
AI SRE / AI Ops Engineer
role.
Partner:
Morgon Stanley
Role:
AI SRE / AI Ops Engineer
Type:
FTE
Location:
Montreal, QC
Work Mode:
Hybrid
Skills Required:
• Production experience in SRE / Infrastructure / ops for large-scale systems
• Strong programming/scripting skills (Python, Go, Java, or equivalent)
• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
• Production experience in SRE / Infrastructure / ops for large-scale systems
• Strong programming/scripting skills (Python, Go, Java, or equivalent)
• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
• Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
• Solid experience in capacity planning, performance tuning, scaling, and incident response
• Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
• Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
• Excellent communication, documentation, and cross-team collaboration skills
• Proven track record of reducing operational toil via automation
Best Regards,
Tanuj Chand
Senior - Talent Acquisition
Tanuj.chand@ibuconsulting.com
+91 8288993961
+1 240 681-9158
8716 Silver Hall Road, Perry Hall, Maryland 21128, USA
ibuconsulting.com
|
ibugroup.co.uk