LLM Infrastructure Engineer
Role summary
We are seeking a Senior Python / AI API Engineer to develop and deploy production-ready services for Large Language Model (LLM) applications. This role involves creating high-performance APIs for model inference, optimizing GPU usage, and deploying AI services within cloud environments. The ideal candidate is a hands-on engineer with proven experience in shipping AI systems to production, understanding the complexities of scalable inference and model serving. Key responsibilities include developing APIs with Python and FastAPI, building LLM inference services with HuggingFace Transformers and PyTorch, optimizing GPU workloads, and deploying containerized applications on Azure.
We are looking for a Senior Python / AI API Engineer to build and deploy production-grade services powering Large Language Model (LLM) applications. This role focuses on developing high-performance APIs for model inference, optimizing GPU workloads, and deploying AI services in cloud environments.
This is an engineering-focused role, not research. We are looking for someone who has built and shipped AI systems into production and understands the challenges of scalable inference and model serving.
Key Responsibilities
- Develop high-performance APIs using Python (3.10+) and FastAPI
- Build and deploy LLM inference services using HuggingFace Transformers and PyTorch
- Optimize GPU workloads and CUDA memory usage
- Implement streaming inference APIs for real-time model responses
- Containerize and deploy services using Docker and GPU-enabled infrastructure
- Deploy AI workloads in Azure environments (AKS, ACI, or Container Apps)
Required Skills
- Strong Python development experience (3.10+)
- Hands-on experience building production APIs with FastAPI
- Experience with HuggingFace Transformers and PyTorch
- Solid understanding of REST API design
- Experience deploying containerized applications with Docker
Nice to Have
- Experience with OpenAI-compatible APIs, vLLM, or Text Generation Inference (TGI)
- Experience deploying AI workloads on Azure GPU infrastructure
- Familiarity with LoRA / PEFT fine-tuning
- Exposure to legal or financial NLP use cases
Ideal Candidate: A hands-on engineer who understands how LLM systems run in production-from model loading and tokenization to GPU deployment and scalable APIs.