Principal AI Cloud Infrastructure Engineer

Charlotte, North Carolina, United StatesOnsiteFull TimePrincipalPosted 1 day ago

The position is described below. If you want to apply, click the Apply Now button at the top or bottom of this page. After you click Apply Now and complete your application, you'll be invited to create a profile, which will let you see your application status and any communications. If you already have a profile with us, you can log in to check status.
Need Help?
*If you have a disability and need assistance with the application, you can request a reasonable accommodation. Send an email to*
*Accessibility*
*(accommodation requests only; other inquiries won't receive a response).*
Regular or Temporary:
Regular
Language Fluency:
English (Required)
Work Shift:
1st shift (United States of America)
Please review the following job description:
\*\*\*This role is 5 days a week in the Charlotte Office\*\*\*
The AI Cloud Infrastructure Engineer owns the cloud infrastructure, environment architecture, compute management, networking, and platform operations that enable the Forge to build, deploy, scale, and operate AI and agentic systems in production with enterprise-grade reliability, security, and governance.
This is a hands-on senior infrastructure engineering role. The engineer designs and operates the cloud environments, container platforms, networking layers, identity boundaries, deployment pipelines, and runtime infrastructure that AI and agentic workloads depend on. Azure is the primary cloud, with support for AWS and Google Cloud where specific AI services or workload requirements warrant multi-cloud deployment.
Daily work includes provisioning and managing cloud environments, designing and maintaining container orchestration platforms, building Infrastructure as Code, managing compute and GPU resources for AI workloads, configuring networking and environment isolation, operating CI/CD deployment infrastructure, implementing identity and access controls at the infrastructure layer, instrumenting observability and telemetry, optimizing cost and performance, and ensuring all infrastructure meets Forge security, governance, and operational standards.
This role is the foundation that everything else in the Forge runs on. If the infrastructure is wrong, nothing built on top of it will be reliable, secure, or scalable.
\*\*\*For this opportunity, Truist will not sponsor an applicant for work visa status or employment authorization, nor will we offer any immigration-related support for this position (including, but not limited to H-1B, F-1 OPT, F-1 STEM OPT, F-1 CPT, J-1, TN-1 or TN-2, E-3, O-1, or future sponsorship for U.S. lawful permanent residence status.)\*\*\*
Essential Duties And Responsibilities
Following is a summary of the essential functions for this job. Other duties may be performed, both major and minor, which are not mentioned below. Specific activities may change from time to time.
Cloud Environment Architecture & Operations

Design, provision, and operate cloud environments for AI and agentic workloads across development, testing, staging, and production tiers with clear separation, security boundaries, and promotion controls.
Manage Azure as the primary cloud platform, with support for AWS and Google Cloud where specific AI services, model hosting, or workload requirements dictate multi-cloud deployment.
Implement and maintain environment isolation patterns that protect the bank, enforce regulatory boundaries, and enable safe experimentation without production risk.
Operate cloud subscriptions, resource groups, tagging strategies, cost management, and resource lifecycle governance aligned to Forge operating standards.

Container Orchestration & Compute Management

Design, deploy, and operate container orchestration platforms (AKS, Kubernetes, or equivalent) that host AI applications, agent runtimes, API services, and supporting workloads.
Manage compute resources for AI workloads, including GPU provisioning, scaling policies, resource quotas, node pool management, and workload scheduling optimized for AI inference and training patterns.
Implement container security patterns including non-root execution, read-only filesystems, capability restrictions, image scanning, registry governance, and runtime policy enforcement.
Support containerized deployment of AI models, agent services, evaluation harnesses, and supporting microservices with production-grade reliability and performance.

Infrastructure as Code & Deployment Pipelines

Build and maintain all infrastructure using Infrastructure as Code (Terraform, Bicep, or equivalent) with version control, peer review, automated validation, and drift detection.
Design and operate CI/CD deployment infrastructure that supports automated build, test, security scan, and promotion of AI workloads through environment tiers to production.
Implement deployment patterns including blue-green, canary, rolling updates, and rollback capabilities for AI services and agent runtimes.
Manage pipeline security including secrets injection, credential rotation, service principal governance, and least-privilege deployment identities.

Networking & Security Infrastructure

Design and maintain network architecture including virtual networks, subnets, private endpoints, service endpoints, network security groups, and traffic controls for AI workloads.
Implement network isolation and segmentation that protects AI systems, data flows, and inter-service communication from unauthorized access or lateral movement.
Configure and manage API gateways, load balancers, DNS, TLS/SSL, and ingress controllers for AI application and agent service endpoints.
Partner with security teams to implement infrastructure-level controls for identity, access, encryption at rest and in transit, key management, and audit logging.

Identity & Access Management (Infrastructure Layer)

Implement and manage identity and access controls for cloud resources, container platforms, deployment pipelines, and AI service endpoints using managed identities, service principals, and role-based access control.
Enforce least-privilege access across all infrastructure tiers, with clear separation between development, testing, and production permissions.
Support secrets management, certificate lifecycle, and credential rotation for AI services and agent integrations.

Observability & Operational Excellence

Instrument infrastructure observability including metrics, logs, traces, alerts, and dashboards for cloud resources, container platforms, networking, and deployment pipelines.
Monitor infrastructure health, capacity, performance, cost, and availability for AI workloads with proactive alerting and remediation workflows.
Build and maintain operational runbooks, incident response procedures, and escalation paths for infrastructure-related issues affecting AI and agentic systems.
Drive continuous improvement in infrastructure reliability, deployment speed, cost efficiency, and operational maturity.

Governance & Compliance

Ensure all cloud infrastructure meets Forge security standards, enterprise governance requirements, and regulatory compliance expectations for a regulated financial services environment.
Implement policy-as-code and automated compliance checks for infrastructure configurations, deployment pipelines, and runtime environments.
Maintain infrastructure documentation, architecture diagrams, configuration evidence, and audit artifacts required for governance and regulatory review.
Support deployment gate validation by providing infrastructure readiness evidence for AI and agentic solution releases.

**Required Qualifications:**
The requirements listed below are representative of the knowledge, skill and/or ability required. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.

Bachelor’s degree in Information Systems-related field, or equivalent education and related training
Minimum of five + years of experience in leading edge, complex, state-of-the art technologies and/or techniques with additional experience within software development
Recognized in the industry for their experience and knowledge. May obtain the knowledge through more intense experience, such as working in a technology development company
Strong business and financial acumen and effective communication skills
Ability to establish strong relationships within the technical community
Ability to serve as a visionary concerning future technological capabilities and operational scenarios; ability to create new business models and technologies
Ability to create, manage and drive change
Ability to unify activities within the technology community, coordinating with other businesses and engineering organizations, as needed

Additional Requirements:

5+ years of cloud infrastructure engineering experience with strong hands-on depth in Azure, including compute, networking, identity, storage, and container services.
Demonstrated experience designing and operating cloud environments for production enterprise workloads with high availability, security, and governance requirements.
Strong experience with container orchestration platforms (Kubernetes, AKS, or equivalent), including cluster management, node pools, networking, security, and workload scheduling.
Hands-on experience with Infrastructure as Code using Terraform, Bicep, or equivalent tooling with version control, peer review, and automated validation practices.
Experience designing and operating CI/CD deployment pipelines for cloud-native applications and services.
Strong understanding of cloud networking including virtual networks, subnets, private endpoints, network security groups, DNS, load balancing, and traffic controls.
Experience implementing identity and access management, secrets handling, and least-privilege controls for cloud resources and deployment infrastructure.
Experience with infrastructure observability including metrics, logging, alerting, and dashboards for cloud and container platforms.
Ability to work across architecture, implementation, security, reliability, and operational concerns rather than isolated provisioning tasks.
Strong written and verbal communication skills, especially for architecture documentation, operational runbooks, and cross-functional technical collaboration.

Preferred Requirements:

Experience provisioning and managing GPU compute, AI inference endpoints, or model-serving infrastructure in cloud environments.
Experience with multi-cloud environments, including AWS and Google Cloud alongside Azure as the primary platform.
Experience with Azure-specific AI and platform services including Azure OpenAI, Azure AI Search, Azure API Management, Azure Monitor, Microsoft Entra ID, and Microsoft Fabric.
Experience with container security, image governance, runtime policy enforcement, and supply chain security for containerized workloads.
Experience implementing policy-as-code, automated compliance scanning, and infrastructure drift detection for governed enterprise environments.
Experience in financial services, cybersecurity, or other highly regulated enterprise environments with strong audit, control, and environment separation requirements.
Experience with cost optimization, resource right-sizing, and FinOps practices for cloud AI workloads.
Experience supporting AI-specific infrastructure patterns including vector database hosting, model registry infrastructure, evaluation environments, and agent runtime platforms.
Experience mentoring engineers, reviewing infrastructure design, and operating as a senior technical contributor with broad platform impact.

Other Job Requirements / Working Conditions
Visual / Audio / Speaking
Able to access and interpret client information received from the computer and able to hear and speak with individuals in person and on the phone.
Manual Dexterity / Keyboarding
Able to work standard office equipment, including PC keyboard and mouse, copy/fax machines, and printers.
Availability
Able to work all hours scheduled, including overtime as directed by manager/supervisor and required by business need.
Travel
Minimal and up to 10%
General Description of Available Benefits for Eligible Employees of Truist Financial Corporation:
All regular teammates (not temporary or contingent workers) working 20 hours or more per week are eligible for benefits, though eligibility for specific benefits may be determined by the division of Truist offering the position. Truist offers medical, dental, vision, life insurance, disability, accidental death and dismemberment, tax-preferred savings accounts, and a 401k plan to teammates. Teammates also receive no less than 10 days of vacation (prorated based on date of hire and by full-time or part-time status) during their first year of employment, along with 10 sick days (also prorated), and paid holidays. For more details on Truist’s generous benefit plans, please visit our Benefits site. Depending on the position and division, this job may also be eligible for Truist’s defined benefit pension plan, restricted stock units, and/or a deferred compensation plan. As you advance through the hiring process, you will also learn more about the specific benefits available for any non-temporary position for which you apply, based on full-time or part-time status, position, and division of work.
*Truist is an Equal Opportunity Employer that does not discriminate on the basis of race, gender, color, religion, citizenship or national origin, age, sexual orientation, gender identity, disability, veteran status, or other classification protected by law. Truist is a Drug Free Workplace.*
EEO is the Law E-Verify IER Right to Work

Ready to apply?

You'll be redirected to Truist's application page.

Similar roles