Posted on 2025/12/01

ML Infrastructure Engineer (Staff / Principal)

Jobgether

California, United States

Full-time

Apply Promote

Qualifications

Extensive experience in distributed ML training and inference on large-scale GPU clusters
Proficiency in PyTorch, PyTorch Lightning, PyTorch Geometric, Ray, or similar frameworks
Strong engineering skills with the ability to design, implement, and maintain robust, scalable systems
Experience optimizing GPU workloads and performance engineering for high-throughput ML pipelines
Independent thinker with a strong sense of ownership and ability to deliver from first principles to production-quality systems
Curiosity and problem-solving mindset for working at the intersection of AI, physics, chemistry, and biology
Experience building and maintaining cluster infrastructure with Kubernetes and Terraform
Expertise in GPU programming, XLA, Triton, CUDA, or deep learning compiler stacks
Familiarity with molecular systems (proteins, small molecules, 3D structures), ML force fields, or point cloud data
Experience contributing to highly collaborative, cross-functional teams in research or production ML environments

Benefits

Competitive salary and equity package
Comprehensive health benefits: medical, dental, and vision fully covered for employees
401(k) plan
Open (unlimited) PTO policy and paid family leave (maternity and paternity)
Life, long-term, and short-term disability insurance
Free meals at office locations and other employee perks
Opportunities for growth, mentorship, and hands-on impact in cutting-edge molecular AI research

Responsibilities

You will work at the intersection of machine learning, physics, and computational chemistry, driving scalable, high-performance systems that accelerate AI research in molecular modeling
You will collaborate closely with researchers, engineers, and scientists, mentoring junior team members while contributing to long-term technical strategy
Lead engineering efforts for building and scaling distributed ML training and inference infrastructure across GPU clusters and cloud environments
Optimize model efficiency in terms of throughput, latency, memory, and GPU utilization, pushing hardware to its performance limits
Design and implement MLOps tools and frameworks for automated, reliable deployment and evaluation of AI models
Collaborate with researchers and cross-functional teams to integrate infrastructure with generative and predictive AI workflows
Drive long-term platform vision, contributing to architectural decisions, tooling improvements, and best practices
Mentor junior engineers and research interns, fostering a culture of technical excellence and innovation

Full Description

This position is posted by Jobgether on behalf of a partner company.

We are currently looking for a ML Infrastructure Engineer (Staff / Principal) in California (USA).

This role offers the opportunity to lead the development and optimization of cutting-edge ML infrastructure for large-scale generative and predictive AI models.

You will work at the intersection of machine learning, physics, and computational chemistry, driving scalable, high-performance systems that accelerate AI research in molecular modeling.

The position involves designing distributed training pipelines, optimizing GPU operations, and building robust MLOps frameworks that push the boundaries of AI performance.

You will collaborate closely with researchers, engineers, and scientists, mentoring junior team members while contributing to long-term technical strategy.

This is a hands-on, high-impact role where your work directly enables groundbreaking discoveries in molecular AI.

• Accountabilities:

• Lead engineering efforts for building and scaling distributed ML training and inference infrastructure across GPU clusters and cloud environments.

• Optimize model efficiency in terms of throughput, latency, memory, and GPU utilization, pushing hardware to its performance limits.

• Design and implement MLOps tools and frameworks for automated, reliable deployment and evaluation of AI models.

• Collaborate with researchers and cross-functional teams to integrate infrastructure with generative and predictive AI workflows.

• Drive long-term platform vision, contributing to architectural decisions, tooling improvements, and best practices.

• Mentor junior engineers and research interns, fostering a culture of technical excellence and innovation.

• * Requirements:

• Extensive experience in distributed ML training and inference on large-scale GPU clusters.

• Proficiency in PyTorch, PyTorch Lightning, PyTorch Geometric, Ray, or similar frameworks.

• Strong engineering skills with the ability to design, implement, and maintain robust, scalable systems.

• Experience optimizing GPU workloads and performance engineering for high-throughput ML pipelines.

• Independent thinker with a strong sense of ownership and ability to deliver from first principles to production-quality systems.

• Curiosity and problem-solving mindset for working at the intersection of AI, physics, chemistry, and biology.

• Nice to Have:

• Experience building and maintaining cluster infrastructure with Kubernetes and Terraform.

• Expertise in GPU programming, XLA, Triton, CUDA, or deep learning compiler stacks.

• Familiarity with molecular systems (proteins, small molecules, 3D structures), ML force fields, or point cloud data.

• Experience contributing to highly collaborative, cross-functional teams in research or production ML environments.

• * Benefits:

• Competitive salary and equity package.

• Comprehensive health benefits: medical, dental, and vision fully covered for employees.

• 401(k) plan.

• Open (unlimited) PTO policy and paid family leave (maternity and paternity).

• Life, long-term, and short-term disability insurance.

• Free meals at office locations and other employee perks.

• Opportunities for growth, mentorship, and hands-on impact in cutting-edge molecular AI research.

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.

🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements.

📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.

🎯 Based on this analysis, we automatically shortlist the three candidates with the highest match to the role.

🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.

The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role.

Once the shortlist is completed, we share it directly with the company that owns the job opening.

The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.

Thank you for your interest!

#LI-CL1

Apply Promote

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Learn More