Data/ML Engineer
Role summary
Seeking a hands-on Senior Data/ML Engineer with over 10 years of experience to provide deep technical leadership and deliver production-ready solutions. This hybrid role requires expertise in Python and Java for building scalable services on Red Hat OpenShift and cloud platforms. Key responsibilities include performing advanced SQL data analysis, developing ML models, and applying Generative AI (Gemini/GPT) for data triage and test case generation. The ideal candidate will independently solve complex data problems from definition to delivery, set high standards for SDLC and CI/CD, and mentor other engineers. Experience with data lakes, containerization, and a strong understanding of ML techniques are essential. Payments domain experience is a plus.
Job Title: Data/ML Engineer
Location: Charlotte, NC / Boston, MA / Dallas, TX – Hybrid Role (3 days a week in office)
Duration: 12 months likely to extend or convert
Interview: 2 rounds
30 min MS Teams calls
Job Description:
- We are seeking a experienced engineer (>10 years) who can independently analyze data given a problem statement and translate insights into production ready solutions.
- Engineer to provide deep technical leadership across mission critical platforms.
- You will design and deliver scalable services in Python and Java on Red Hat OpenShift (OCP) and cloud, while serving as a hands-on expert in GenAI (Gemini/GPT), LLM evaluation, and agentic frameworks.
- You’ll set the bar for architecture, SDLC excellence, and CI/CD automation, and mentor engineers to raise the craft across teams.
Key Responsibilities:
- Perform data analysis and exploration using SQL and statistical techniques to solve business problems.
- Design, develop, and implement solutions using Python or Java, leveraging libraries such as NumPy, SciPy, Matplotlib, and Scikit learn.
- Build and evaluate machine learning models including Random Forest and XGBoost.
- Apply AI assisted techniques by crafting effective prompts using Gemini models to accelerate data analysis, feature exploration, and insight generation.
- Communicate findings clearly and partner with engineering and business teams to drive outcomes.
Required Skills:
- Strong SQL and data analysis skills.
- Proficiency in Python or Java for data science and ML workloads.
- Hands on experience with ML frameworks and model development.
- Ability to work independently end to end from problem definition to solution delivery.
- Experience using generative AI models to augment analytical workflows.
Detailed JD (in addition to above):
About the Role:
- We’re hiring a hands-on Senior Data Engineer / Data Analyst who can both build data pipelines and analyze large datasets—not just one or the other.
- You’ll design and deliver scalable services in Python and Java (for implementations) on Red Hat OpenShift (OCP) and cloud platforms.
- You’ll operate inside agile backlogs, break down complex data problems independently, and be the go-to technical leader for SQL, Python, data modeling, ML application, and practical GenAI use (Gemini/GPT). Payments experience (e.g., wire payments) is a plus, not required.
What You’ll Do
- Own full-cycle data problem solving: profile large datasets, design pipelines, engineer transformations, and perform deep analysis to identify patterns, outliers, and root causes.
- Implement production-grade code: develop data services and utilities in Python (primary) and Java (for service implementations) with strong testing, observability, and reliability.
- Query at scale with SQL: write advanced SQL for data exploration, deduplication, quality checks, and performance-tuned analytics across data lakes/warehouses.
- Work from the backlog: take scoped stories, clarify requirements, and drive them to done with minimal supervision—raising risks early and proposing solutions.
- Payments data scenario (example you’ll tackle):
- Payment objects arrive as XML into a data lake. You will parse, normalize, and analyze them to identify unique payment examples, apply deduplication strategies, and finally generate representative test cases that cover critical edge conditions.
GenAI & ML in practice:
- Write effective prompts and apply LLM-powered techniques (Gemini/GPT) to accelerate data triage, test generation, and anomaly detection (with careful evaluation & guardrails).
- Apply machine learning knowledge (e.g., feature extraction, similarity measures, dedup/record-linkage techniques) to large-scale data problems.
- Set engineering standards: raise the bar on architecture, SDLC excellence, CI/CD automation, and code review quality; mentor engineers across teams.
- Platform leadership: contribute to services running on Red Hat OpenShift (OCP) and cloud; collaborate with SRE and platform teams for resilience, scaling, and cost efficiency.
Required Qualifications (Must-Have)
- We’re specifically seeking someone strong in both data engineering and data analysis, who can code at a senior level.
- Expert Python for data engineering & analysis (pandas, PySpark or similar, modular design, testing).
- Advanced SQL (analytical window functions, performance optimization, CTEs, partitioning, large-scale joins).
- Java experience for service implementations (APIs, data services, utilities), with strong SDLC discipline.
- Proven with large datasets: profiling, cleaning, deduping, and synthesizing insights; comfort with semi-structured data (XML/JSON).
- Data lake / warehouse experience (e.g., parquet, object storage, lakehouse patterns).
- Hands-on CI/CD (Git, pipelines, build/test/release automation) and containerized deployments (Docker/K8s; OpenShift/OCP highly preferred).
- Independent problem solver: break down ambiguous data issues, form hypotheses, validate with code, and communicate outcomes clearly.
- Practical GenAI usage: ability to craft prompts and evaluate LLM outputs for data triage, test case generation, and analysis acceleration; disciplined about validation and bias/error checking.
- Foundational ML knowledge: familiarity with applying models/techniques relevant to data quality (e.g., clustering, similarity, dedup/record linkage, anomaly detection)—you know when and how to apply them.
Preferred Qualifications (Nice-to-Have)
- Payments domain (especially wire payments)—schemas, statuses, exceptions, reconciliation.
- Experience with streaming (Kafka), workflow/orchestration (Airflow), and feature engineering for ML.
- LLM evaluation methods, agentic frameworks, and prompt chaining; experience balancing precision/recall for operational use-cases.
- Performance tuning across Python/SQL/Java and data storage formats.
- Familiarity with Microsoft SQL Server (or similar enterprise RDBMS).
- Experience hardening solutions for auditability, lineage, and data quality (DQ frameworks, profiling at ingest).
Day-to-Day Responsibilities
- Translate backlog stories into technical plans; estimate, design, implement, test, and release.
- Build and optimize Python-based data processing jobs and SQL queries to support analytics and test-case derivation.
- Parse and normalize XML payment objects; implement dedup logic to identify unique records; generate high-coverage test cases.
- Create internal tools/scripts to improve developer productivity and data observability.
- Use GenAI (Gemini/GPT) responsibly: write prompts, evaluate outputs, and integrate where it truly adds value—always with validation.
- Lead by example on code quality, documentation, and incident-free deployments.
Tech Stack
- Python, SQL, Java, Red Hat OpenShift (OCP), Kubernetes, Docker, Object Storage/Lakehouse (Parquet/Delta), Airflow (or similar), GitHub/GitLab CI, Observability (logs/metrics/traces), XML/JSON parsing, Cloud: AWS/Azure/GCP
Education & Experience:
- Bachelor’s in CS, Engineering, Math, or equivalent practical experience.
- Typically 5–10+ years of combined data engineering/analysis experience (flexible with demonstrated impact and portfolio).