AI/ML Infrastructure Engineer

CanadaOnsiteContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

The MLOps / AI Platform Engineer is responsible for the operational lifecycle of AI models, prompts, and agents within the AI-Enabled Platform. This role ensures reliable deployments, safe rollbacks, robust observability, and cost-effective performance at scale. Key responsibilities include implementing and managing AI/ML CI/CD pipelines, operating the AI platform (model registry, feature stores, inference infrastructure), monitoring and optimizing model performance and costs, providing experimentation frameworks like A/B testing, and partnering with AI Engineers and Governance teams to enforce responsible AI practices. The role also involves documenting procedures and platform guidelines.

Role Summary

The MLOps / AI Platform Engineer owns the operational lifecycle of AI models, prompts, and agents supporting the AI-Enabled Platform—ensuring reliable deployments, safe rollbacks, observability, and cost-effective performance at scale.

Key Responsibilities

Implement and manage AI/ML CI/CD:
Pipelines for models, prompts, and configuration changes
Canary deployments, rollbacks, and environment management
Operate the AI platform:
Model registry, feature stores, and inference infrastructure
SLOs and SLAs for AI endpoints used by Jira/Confluence apps and services
Monitor and optimize:
Model performance, drift, and data quality signals
Cost-to-serve, latency, and scalability for inference workloads
Adoption metrics, override rates, and false positives/negatives
Provide experimentation and evaluation frameworks:
A/B testing harnesses for new models and prompts
Dashboards for time saved, risk detection quality, and user engagement
Partner with AI Engineers, Backend, and Governance:
Enforce responsible AI and governance constraints in deployments
Support auditability and traceability of AI decisions and releases
Document and standardize:
Runbooks, playbooks, and incident management procedures
Platform guidelines for AI feature teams building on the platform

Qualifications

Strong experience in MLOps, ML platform engineering, or related DevOps roles
Hands-on experience with model registries, CI/CD tools, and monitoring stacks
Familiarity with serving ML/GenAI workloads in production
Solid skills in infrastructure-as-code, containerization, and cloud-native services
Understanding of responsible AI, observability, and cost optimization for ML systems

Ready to apply?

You'll be redirected to TechDoQuest's application page.

Similar roles

AI/ML Infrastructure Engineer
Recutify Inc. · Ontario, Canada · Remote