Lead AI Engineer - SRE, LLM Agents, Full-Stack Architecture

United StatesOnsiteFull TimeLeadPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

iVedha Inc. is seeking a Lead AI Engineer for a leading financial institution to design, build, and operationalize next-generation agentic AI systems. This leadership role focuses on LLM agents, Site Reliability Engineering (SRE), and full-stack architecture within a regulated banking environment. Responsibilities include architecting multi-agent LLM systems, implementing MCP servers, developing RAG pipelines, and leading AI observability with the ELK stack. The role requires expert proficiency in Node.js (TypeScript) and Python, deep AI/ML understanding, SRE best practices, and experience with enterprise AI tools. Candidates must demonstrate awareness of banking and compliance requirements like SOC 2 Type II and PII protection. This is a high-impact opportunity to shape AI transformation in financial services.

About iVedha:

iVedha Inc. is a global AI-first digital transformation company with over 25 years of excellence. Powered by the
iVedha Fabric - our AI-native operating system
, we unify cloud, data, AI, security, and people to deliver measurable, resilient outcomes. Our expertise spans
Agentic AI, Generative AI, Cloud Engineering, Cybersecurity, Data Modernization, Application Transformation,
and
Talent Enablement
.

Join our team of forward-thinking innovators shaping the future of intelligent enterprises, where automation, observability, and AI-driven quality assurance redefine delivery velocity.

About the Opportunity

A leading financial institution is seeking a highly experienced Lead AI Engineer to join its advanced technology division. This is a high-impact, leadership-track role at the intersection of AI engineering, Site Reliability, and enterprise-grade software architecture. The successful candidate will design, build, and operationalize the next generation of agentic AI systems within a regulated banking environment — driving intelligent automation while maintaining the rigorous security, compliance, and availability standards demanded by the financial services industry.

You will architect multi-agent LLM systems, implement Model Context Protocol (MCP) servers, build production-grade RAG pipelines, and lead AI observability practices using the ELK stack. This role requires deep technical expertise combined with the leadership acumen to mentor engineers and influence cross-functional technical decisions.

Key Responsibilities

Pillar 1 — AI Architecture & Agentic Systems

Design and implement sophisticated LLM-powered agentic workflows and multi-agent architectures capable of autonomous reasoning, planning, and tool execution within secure financial system boundaries.
Architect and deploy scalable Model Context Protocol (MCP) servers to enable standardized, secure, and rich context management between AI models, internal banking APIs, and external data sources.
Develop production-grade Retrieval-Augmented Generation (RAG) and GraphRAG pipelines that ground AI agents in accurate, real-time enterprise financial data with full auditability.
Leverage expertise in Meta AI (Llama ecosystem), Google AI (Gemini, Vertex AI), and Microsoft Copilot to build and integrate cutting-edge AI features while adhering to financial data handling policies.
Implement prompt versioning, model drift detection, and automated evaluation pipelines to maintain AI system quality and regulatory compliance over time.

Pillar 2 — Full-Stack Engineering

Lead end-to-end development of robust, scalable AI applications using Node.js (TypeScript) and Python (FastAPI/Django) — both languages are required.
Champion AI-assisted developer workflows ('Vibe Coding') using advanced tools such as Cursor and GitHub Copilot to improve team productivity and code quality.
Design and implement secure, high-performance RESTful and GraphQL APIs to serve LLM inferences and agentic actions to frontend and downstream systems.
Develop and maintain Bash and Python automation scripts for infrastructure management, deployment orchestration, and operational efficiency.
Mentor junior and mid-level engineers in AI-native development practices and modern architectural patterns.

Pillar 3 — Site Reliability Engineering & AI Observability

Implement comprehensive observability stacks using the ELK Stack (Elasticsearch, Logstash, Kibana) specifically tuned for LLM performance metrics: latency, token usage, hallucination rates, and model drift indicators.
Apply SRE best practices to AI workloads — ensuring high availability, fault tolerance, incident response playbooks, and SLO/SLA management for LLM inference services.
Build and maintain CI/CD pipelines tailored for machine learning models, including prompt versioning, model evaluation gates, shadow deployments, and automated rollback.
Design alerting, on-call runbooks, and escalation paths for AI system incidents within a regulated financial services environment.

Required Qualifications:

- Programming Languages
- Expert-level proficiency in Node.js (TypeScript/JavaScript) and Python. Both are required. Bash scripting for infrastructure automation is mandatory.
- AI & Machine Learning
- Deep understanding of LLM architectures, prompt engineering, fine-tuning techniques (LoRA/qLoRA), and embedding models. Proven experience building and operating production-grade LLM applications.
- Agentic Frameworks
- Hands-on experience designing autonomous agents and implementing Model Context Protocol (MCP) servers for standardized tool and context management.
- RAG & Vector Databases
- Strong experience building RAG and GraphRAG pipelines. Proficiency with vector databases (Pinecone, Milvus, or Weaviate) and embedding model selection strategies.
- Observability & SRE
- Extensive hands-on experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for distributed system logging, monitoring, and AI-specific metrics tracking.
- Cloud & Infrastructure
- Proven experience with cloud-native architectures. Azure and AKS (Azure Kubernetes Service) experience strongly preferred for this engagement.
- Enterprise AI Tools
- Demonstrated expertise with Microsoft Copilot (Copilot Studio extensibility, custom connectors), Meta AI open-source models, and Google AI infrastructure (Gemini/Vertex AI).
- Leadership -
8+ years of progressive software engineering experience. Minimum 3 years in a technical leadership or architectural role with a focus on AI/ML systems.

Banking & Compliance Requirements:

Given the regulated nature of this environment, candidates must demonstrate awareness of and experience with the following:

Working knowledge of SOC 2 Type II compliance principles and their impact on AI system design and data handling.
Understanding of financial data classification, PII protection, and audit trail requirements for AI-generated outputs.
Experience implementing secure credential management (e.g., Azure Key Vault, HashiCorp Vault) in production AI systems.
Familiarity with model governance requirements — including explainability, version control, and documentation for AI systems in regulated environments.
Knowledge of zero-trust security principles and least-privilege access patterns for AI agent tool integrations.

Preferred Qualifications:

Experience building or integrating AI observability platforms with OpenTelemetry for unified tracing across AI and infrastructure layers.
Elastic Certified Engineer or Elastic Certified Observability Engineer certification.
Familiarity with Elastic Agent and Fleet management for centralized log collection in enterprise environments.
Prior experience in financial services, banking technology, or fintech with exposure to trading systems, fraud detection, or compliance platforms.
Contributions to open-source AI/ML projects or published research in LLM applications.

Why This Role

This is a rare opportunity to be at the forefront of AI engineering within a major financial institution — building systems that push the boundaries of what autonomous agents can achieve within a complex, regulated enterprise. You will have direct architectural influence over the institution's AI transformation roadmap, work with cutting-edge models and frameworks, and lead a high-caliber engineering team. Your decisions will shape how AI is responsibly deployed in financial services for years to come.

Ready to apply?

You'll be redirected to iVedha Inc.'s application page.