TechDoQuest logo
TechDoQuest Verified
Education, IT Services, Professional Training

AI/ML Infrastructure Engineer

CanadaOnsiteContractPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

The MLOps / AI Platform Engineer is responsible for the operational lifecycle of AI models, prompts, and agents within the AI-Enabled Platform. This role ensures reliable deployments, safe rollbacks, robust observability, and cost-effective performance at scale. Key responsibilities include implementing and managing AI/ML CI/CD pipelines, operating the AI platform (model registry, feature stores, inference infrastructure), monitoring and optimizing model performance and costs, providing experimentation frameworks like A/B testing, and partnering with AI Engineers and Governance teams to enforce responsible AI practices. The role also involves documenting procedures and platform guidelines.

Role Summary

The MLOps / AI Platform Engineer owns the operational lifecycle of AI models, prompts, and agents supporting the AI-Enabled Platform—ensuring reliable deployments, safe rollbacks, observability, and cost-effective performance at scale.

Key Responsibilities

  • Implement and manage AI/ML CI/CD:
  • Pipelines for models, prompts, and configuration changes
  • Canary deployments, rollbacks, and environment management
  • Operate the AI platform:
  • Model registry, feature stores, and inference infrastructure
  • SLOs and SLAs for AI endpoints used by Jira/Confluence apps and services
  • Monitor and optimize:
  • Model performance, drift, and data quality signals
  • Cost-to-serve, latency, and scalability for inference workloads
  • Adoption metrics, override rates, and false positives/negatives
  • Provide experimentation and evaluation frameworks:
  • A/B testing harnesses for new models and prompts
  • Dashboards for time saved, risk detection quality, and user engagement
  • Partner with AI Engineers, Backend, and Governance:
  • Enforce responsible AI and governance constraints in deployments
  • Support auditability and traceability of AI decisions and releases
  • Document and standardize:
  • Runbooks, playbooks, and incident management procedures
  • Platform guidelines for AI feature teams building on the platform

Qualifications

  • Strong experience in MLOps, ML platform engineering, or related DevOps roles
  • Hands-on experience with model registries, CI/CD tools, and monitoring stacks
  • Familiarity with serving ML/GenAI workloads in production
  • Solid skills in infrastructure-as-code, containerization, and cloud-native services
  • Understanding of responsible AI, observability, and cost optimization for ML systems
Ready to apply?
You'll be redirected to TechDoQuest's application page.

Similar roles