< More Jobs

Posted on 2025/12/01

ML Infrastructure Engineer (Staff / Principal)

Jobgether

California, United States

Full-time

Qualifications

  • Extensive experience in distributed ML training and inference on large-scale GPU clusters
  • Proficiency in PyTorch, PyTorch Lightning, PyTorch Geometric, Ray, or similar frameworks
  • Strong engineering skills with the ability to design, implement, and maintain robust, scalable systems
  • Experience optimizing GPU workloads and performance engineering for high-throughput ML pipelines
  • Independent thinker with a strong sense of ownership and ability to deliver from first principles to production-quality systems
  • Curiosity and problem-solving mindset for working at the intersection of AI, physics, chemistry, and biology
  • Experience building and maintaining cluster infrastructure with Kubernetes and Terraform
  • Expertise in GPU programming, XLA, Triton, CUDA, or deep learning compiler stacks
  • Familiarity with molecular systems (proteins, small molecules, 3D structures), ML force fields, or point cloud data
  • Experience contributing to highly collaborative, cross-functional teams in research or production ML environments

Benefits

  • Competitive salary and equity package
  • Comprehensive health benefits: medical, dental, and vision fully covered for employees
  • 401(k) plan
  • Open (unlimited) PTO policy and paid family leave (maternity and paternity)
  • Life, long-term, and short-term disability insurance
  • Free meals at office locations and other employee perks
  • Opportunities for growth, mentorship, and hands-on impact in cutting-edge molecular AI research

Responsibilities

  • You will work at the intersection of machine learning, physics, and computational chemistry, driving scalable, high-performance systems that accelerate AI research in molecular modeling
  • You will collaborate closely with researchers, engineers, and scientists, mentoring junior team members while contributing to long-term technical strategy
  • Lead engineering efforts for building and scaling distributed ML training and inference infrastructure across GPU clusters and cloud environments
  • Optimize model efficiency in terms of throughput, latency, memory, and GPU utilization, pushing hardware to its performance limits
  • Design and implement MLOps tools and frameworks for automated, reliable deployment and evaluation of AI models
  • Collaborate with researchers and cross-functional teams to integrate infrastructure with generative and predictive AI workflows
  • Drive long-term platform vision, contributing to architectural decisions, tooling improvements, and best practices
  • Mentor junior engineers and research interns, fostering a culture of technical excellence and innovation

Full Description

This position is posted by Jobgether on behalf of a partner company.

We are currently looking for a ML Infrastructure Engineer (Staff / Principal) in California (USA).

This role offers the opportunity to lead the development and optimization of cutting-edge ML infrastructure for large-scale generative and predictive AI models.

You will work at the intersection of machine learning, physics, and computational chemistry, driving scalable, high-performance systems that accelerate AI research in molecular modeling.

The position involves designing distributed training pipelines, optimizing GPU operations, and building robust MLOps frameworks that push the boundaries of AI performance.

You will collaborate closely with researchers, engineers, and scientists, mentoring junior team members while contributing to long-term technical strategy.

This is a hands-on, high-impact role where your work directly enables groundbreaking discoveries in molecular AI.

• Accountabilities:

• Lead engineering efforts for building and scaling distributed ML training and inference infrastructure across GPU clusters and cloud environments.

• Optimize model efficiency in terms of throughput, latency, memory, and GPU utilization, pushing hardware to its performance limits.

• Design and implement MLOps tools and frameworks for automated, reliable deployment and evaluation of AI models.

• Collaborate with researchers and cross-functional teams to integrate infrastructure with generative and predictive AI workflows.

• Drive long-term platform vision, contributing to architectural decisions, tooling improvements, and best practices.

• Mentor junior engineers and research interns, fostering a culture of technical excellence and innovation.

• * Requirements:

• Extensive experience in distributed ML training and inference on large-scale GPU clusters.

• Proficiency in PyTorch, PyTorch Lightning, PyTorch Geometric, Ray, or similar frameworks.

• Strong engineering skills with the ability to design, implement, and maintain robust, scalable systems.

• Experience optimizing GPU workloads and performance engineering for high-throughput ML pipelines.

• Independent thinker with a strong sense of ownership and ability to deliver from first principles to production-quality systems.

• Curiosity and problem-solving mindset for working at the intersection of AI, physics, chemistry, and biology.

• Nice to Have:

• Experience building and maintaining cluster infrastructure with Kubernetes and Terraform.

• Expertise in GPU programming, XLA, Triton, CUDA, or deep learning compiler stacks.

• Familiarity with molecular systems (proteins, small molecules, 3D structures), ML force fields, or point cloud data.

• Experience contributing to highly collaborative, cross-functional teams in research or production ML environments.

• * Benefits:

• Competitive salary and equity package.

• Comprehensive health benefits: medical, dental, and vision fully covered for employees.

• 401(k) plan.

• Open (unlimited) PTO policy and paid family leave (maternity and paternity).

• Life, long-term, and short-term disability insurance.

• Free meals at office locations and other employee perks.

• Opportunities for growth, mentorship, and hands-on impact in cutting-edge molecular AI research.

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.

🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements.

📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.

🎯 Based on this analysis, we automatically shortlist the three candidates with the highest match to the role.

🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.

The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role.

Once the shortlist is completed, we share it directly with the company that owns the job opening.

The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.

Thank you for your interest!

#LI-CL1