< More Jobs

Posted on 2026/01/16

Lead Engineer - HPC Operations

Core42

Dubai - United Arab Emirates

Full-time

Full Description

Oversee daily operations of HPC infrastructure, including compute, GPU, storage, networking, and scheduler platforms (e.g., Slurm, Kubernetes).

Drive continuous optimization of system performance, availability, and resource utilization, minimizing downtime and operational risk.

Serve as the primary escalation point for L2 support teams, ensuring rapid diagnosis and resolution of complex incidents and service requests.

Continuously monitor system health and performance using observability platforms such as Prometheus, Grafana, and DCGM, proactively identifying issues.

Manage user environments for AI and ML workloads, including container orchestration (Docker, Kubernetes) and workflow platforms such as MLflow and Kubeflow.

Define and enforce scheduling policies, priorities, partitions, and quotas within Slurm and Kubernetes to ensure fairness, efficiency, and workload optimization.

Lead root cause analysis (RCA) activities, produce post-mortem documentation, and implement preventive and continuous improvement actions.

Drive automation initiatives using scripting and infrastructure-as-code tools to improve reliability, repeatability, and operational efficiency.

Provide technical leadership, mentorship, and guidance to junior engineers; contribute to skills development and operational best practices.

Participate in on-call rotations as required.

Ensure adherence to security, operational, and compliance policies; support audits and maintain documentation for change, incident, and access management processes.

Required Skills & QualificationsBachelor s or Master s degree in Computer Science, Engineering, or a related technical field.

Minimum of 8 years experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.

Advanced expertise in designing, configuring, operating, and optimizing complex HPC environments, including hardware, software, and storage systems.

Hands-on experience managing Slurm clusters and/or Kubernetes-based platforms supporting AI/ML workloads.

Deep knowledge of GPU resource management, workload scheduling, and performance tuning for AI and machine learning use cases.

Strong proficiency with monitoring and observability tools such as Prometheus, Grafana, and DCGM.

Advanced scripting and automation skills using Python, Bash, Ansible, and Terraform.

Strong Linux administration skills (RHEL, CentOS, Ubuntu) and solid understanding of high-speed networking (RDMA, InfiniBand, RoCE) and storage technologies (NFS, Lustre, Ceph).

Preferred Skills & QualificationsExperience operating large-scale, multi-tenant AI or research computing platforms.

Familiarity with MLOps frameworks and production ML pipelines.

Strong documentation, communication, and cross-functional collaboration skills.

Experience working in regulated or sovereign cloud environments.

Bachelor s or Master s degree in Computer Science, Engineering, or a related technical field.

Minimum of 8 years experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.

Advanced expertise in designing, configuring, operating, and optimizing complex HPC environments, including hardware, software, and storage systems.

Hands-on experience managing Slurm clusters and/or Kubernetes-based platforms supporting AI/ML workloads.

Deep knowledge of GPU resource management, workload scheduling, and performance tuning for AI and machine learning use cases.

Strong proficiency with monitoring and observability tools such as Prometheus, Grafana, and DCGM.

Advanced scripting and automation skills using Python, Bash, Ansible, and Terraform.

Strong Linux administration skills (RHEL, CentOS, Ubuntu) and solid understanding of high-speed networking (RDMA, InfiniBand, RoCE) and storage technologies (NFS, Lustre, Ceph).

Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.