LLM Infrastructure Engineer

Houston, Texas, United StatesOnsiteFull TimePosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Senior Python / AI API Engineer to develop and deploy production-ready services for Large Language Model (LLM) applications. This role involves creating high-performance APIs for model inference, optimizing GPU usage, and deploying AI services within cloud environments. The ideal candidate is a hands-on engineer with proven experience in shipping AI systems to production, understanding the complexities of scalable inference and model serving. Key responsibilities include developing APIs with Python and FastAPI, building LLM inference services with HuggingFace Transformers and PyTorch, optimizing GPU workloads, and deploying containerized applications on Azure.

We are looking for a Senior Python / AI API Engineer to build and deploy production-grade services powering Large Language Model (LLM) applications. This role focuses on developing high-performance APIs for model inference, optimizing GPU workloads, and deploying AI services in cloud environments.

This is an engineering-focused role, not research. We are looking for someone who has built and shipped AI systems into production and understands the challenges of scalable inference and model serving.

Key Responsibilities

Develop high-performance APIs using Python (3.10+) and FastAPI
Build and deploy LLM inference services using HuggingFace Transformers and PyTorch
Optimize GPU workloads and CUDA memory usage
Implement streaming inference APIs for real-time model responses
Containerize and deploy services using Docker and GPU-enabled infrastructure
Deploy AI workloads in Azure environments (AKS, ACI, or Container Apps)

Required Skills

Strong Python development experience (3.10+)
Hands-on experience building production APIs with FastAPI
Experience with HuggingFace Transformers and PyTorch
Solid understanding of REST API design
Experience deploying containerized applications with Docker

Nice to Have

Experience with OpenAI-compatible APIs, vLLM, or Text Generation Inference (TGI)
Experience deploying AI workloads on Azure GPU infrastructure
Familiarity with LoRA / PEFT fine-tuning
Exposure to legal or financial NLP use cases

Ideal Candidate: A hands-on engineer who understands how LLM systems run in production-from model loading and tokenization to GPU deployment and scalable APIs.

Ready to apply?

You'll be redirected to AMSYS Innovative Solutions's application page.