< More Jobs

Posted on 2026/01/16

Senior Specialist - Infrastructure

Group 42

Abu Dhabi - United Arab Emirates

Full-time

Full Description

We are seeking a highly skilled Senior Engineer HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads.

This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms.

The ideal candidate will bring deep technical expertisein HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments.

Responsibilities:

Provide daily operational support for HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes).

Drive initiatives to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.

Ensure the timely and effective resolution of incidents and service requests, maintaining system reliability and uptime.

Continuously monitor system health, performance, and utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).

Manage and support user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).

Define, implement, and manage job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure fairness, efficiency, and workload optimization.

Conduct root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives.

Provide mentorship and guidance to junior engineers, fostering skills development and a collaborative environment.

Participate in on-call rotation as needed.

Ensure compliance with security and operational policies, assisting with audits and maintaining documentation for change and incident management processes.

Qualifications:

Bachelor s or Master s degree in Computer Science, Engineering, or a related technical field.

Minimum of 5 years of experience in HPC operations, systems engineering, or DevOps roles.

Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.

Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.

In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.

Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.

Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.

Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).

Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.