< More Jobs

Posted on 2026/02/07

Platform Engineer

Dautom

Dubai - United Arab Emirates

Full-time

Full Description

The Platform Engineer is a platform specialist responsible for architecting, building, and operating high-performance AI infrastructure to support advanced AI workloads, including LLMs, GenAI, Computer Vision, and MLOps.

This role will focus on managing GPU clusters (NVIDIA A100/H100), deploying and maintaining Red Hat OpenShift AI (RHODS), and ensuring secure, scalable, and cost-efficient AI platforms across the company's Sovereign Cloud and hybrid/multi-cloud environments.

The engineer will enable enterprise-grade AI adoption for over 200 government entities.

Key Responsibilities:

• GPU & AI Platform Architecture: Design and implement GPU-based compute clusters.

Define reference architectures for LLM hosting, Vector Databases, MLOps, and high-performance storage/networking.

• Deliverables: Fully operational GPU-based AI infrastructure, GPU Cluster Uptime and Performance Utilization, Reduction in Cost per Training/Inference Workload.

• GPU Cluster Operations: Install, configure, and optimize core components: CUDA, cuDNN, NCCL, NVIDIA Drivers, and GPU Operators. Implement GPU partitioning, scheduling, and performance tuning.

• Deliverables: High-availability architecture for all AI workloads, complete documentation, and runbooks.

• OpenShift AI (RHODS) Management: Deploy, configure, and maintain the Red Hat OpenShift AI (RHODS) platform for multi-tenant use.

• Deliverables: Production-ready OpenShift AI (RHODS) platform, AI Project Onboarding Speed.

• LLM & Model Serving: Build and manage infrastructure for hosting and serving open-source LLM frameworks and supporting RAG pipelines, LoRA adapters, and Vector Databases.

• Deliverables: Multi-model LLM serving environment for entities, MLOps Pipeline Success Rate and Deployment Frequency.

• MLOps & Automation: Implement Infrastructure as Code (IaC) and GitOps for the automated lifecycle management of the AI platform.

• Deliverables: Infrastructure automation via Terraform & Ansible, Automation Coverage for AI Infrastructure.

Required Qualifications & Experience:

• 7–12 years in Cloud Infrastructure, DevOps, ML Infrastructure, or Platform Engineering.

• Deep Hands-On Expertise with GPU Systems (NVIDIA A100/H100), Linux, Containers, and Kubernetes.

• Experience with OpenShift AI (RHODS) or equivalent Kubernetes GPU orchestration.

• Familiarity with LLM Hosting and supporting Vector Databases.

Essential Skills & Competencies:

• Technical: Deep understanding of GPU compute, HPC architectures, and ML performance profiling.

• Soft Skills: Strong troubleshooting, optimization, and performance engineering mindset.

Excellent cross-functional collaboration and documentation skills.

Preferred Certifications:

• NVIDIA Deep Learning / AI Infrastructure Certification

• Red Hat OpenShift AI specialization

• Kubernetes CKA/CKAD

• Azure AI or Oracle Cloud AI certifications

• Terraform & Ansible certifications

Work Conditions:

• Full-time, on-site position.

Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.