We're in alpha · Starting with US & Canada
TrueML logo
TrueML Verified
Financial Technology (FinTech), AI/Machine Learning, Debt Management

Senior Software Engineering Manager (DevOps)

00, United StatesRemoteFull TimeManager / HeadPosted 1 day agoVisa sponsorship available

TrueML Products is seeking a highly experienced and strategic Sr. Manager, DevOps to lead our infrastructure and platform engineering efforts. This role is critical in driving our cloud architecture strategy, establishing elite CI/CD standards, and ensuring the scalability and reliability of our machine learning-driven products.

Reporting to the Sr. Director, Program & Operations, you will lead the evolution of our internal developer platform and infrastructure-as-code (IaC) architecture. The ideal candidate is a hands-on leader with a "systems-thinking" mindset. We are looking for a visionary who thrives on solving complex distributed systems challenges and considers leveraging GenAI and AIOps tooling second-nature for optimizing system performance and automation.

What You'll Do (Technical Leadership & Strategy):

  • Define and execute the long-term strategic vision for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs.
  • Lead the design and implementation of self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity.
  • Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors.
  • Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols, maintaining system integrity across multiple regions.
  • Oversee the implementation and evolution of comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance.
  • Champion security by design by integrating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines.
  • Serve as the ultimate escalation point for major production outages, facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error.
  • Maintain deep technical currency in container orchestration (Kubernetes), serverless patterns, and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff.

What You'll Do (Hands-On Engineering & Technical Execution):

  • Maintain the ability to write and review high-quality code in languages like Python, Go, or Bash to automate complex operational tasks and system integrations.
  • Hands-on development of Terraform Infrastructure as Code for resource provisioning.
  • Directly architect and troubleshoot complex CI/CD workflows (GitHub Actions, ArgoCD, Atlantis), ensuring build-and-deploy cycles are optimized for speed and reliability.
  • Proactively manage and tune container orchestration environments, including hands-on configuration of Ingress controllers, declarative GitOps workflows, and cluster autoscaling.
  • Lead from the front during critical incidents by conducting deep-dive technical analysis across the EKS stack, troubleshooting Node-level kernel panics, VPC CNI networking bottlenecks, and RDS performance constraints to minimize MTTR
  • Conduct hands-on audits of cloud configurations and IAM policies, implementing "least privilege" access controls and automated remediation scripts.
  • Directly manage the integration and API configurations between various tools in the DevOps stack (e.g., connecting Jira, VictorOps, Slack, and Observe for seamless incident flow).

What You'll Do (People Leadership & Engineering Collaboration):

  • Recruit, hire, and develop a world-class team of DevOps Engineers; provide career pathing and technical mentorship to foster a culture of continuous learning.
  • Partner closely with Engineering Managers to align infrastructure deliverables with product roadmap, ensuring DevOps is an accelerator rather than a bottleneck.
  • Collaborate with the Quality Engineering and Security leadership to define and enforce "Definition of Done" standards that include automated testing and security gates.
  • Set clear, measurable goals (KPIs and OKRs) for the team, conducting regular performance reviews and providing feedback to drive individual and collective excellence.
  • Lead internal Brunch & Learns to educate the broader engineering organization on modern cloud-native patterns and self-service capabilities.

Who You Are (Qualifications):

  • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
  • 10+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering; 5+ years of experience managing engineers
  • Expert-level mastery with AWS and experience managing multi-region, high-availability deployments
  • Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in a production environment.
  • Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus.
  • Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash.
  • Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).
  • Experience acting as an Incident Commander for high-severity outages and fostering a "blameless" post-mortem culture.
  • Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product, Engineering, and Security teams.
  • Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to accelerate delivery.

Ways to "Stand Out":

  • Experience leading organizational platform migration, including the development of rollback strategies, stakeholder communication plans, and post-migration validation
  • Prior experience working with high-velocity, product-driven early-to-mid stage technology companies where reliability, extensibility, and availability were mission-critical to success
  • AWS or Kubernetes Certifications a plus -- but not in lieu of hands-on experience with the same within production environments
  • Notable contributions to Open Source projects or communities
Ready to apply?
You'll be redirected to TrueML's application page.