Posted on 2026/02/07
Platform Engineer
Dautom
Dubai - United Arab Emirates
Full Description
The Platform Engineer is a platform specialist responsible for architecting, building, and operating high-performance AI infrastructure to support advanced AI workloads, including LLMs, GenAI, Computer Vision, and MLOps.
This role will focus on managing GPU clusters (NVIDIA A100/H100), deploying and maintaining Red Hat OpenShift AI (RHODS), and ensuring secure, scalable, and cost-efficient AI platforms across the company's Sovereign Cloud and hybrid/multi-cloud environments.
The engineer will enable enterprise-grade AI adoption for over 200 government entities.
Key Responsibilities:
• GPU & AI Platform Architecture: Design and implement GPU-based compute clusters.
Define reference architectures for LLM hosting, Vector Databases, MLOps, and high-performance storage/networking.
• Deliverables: Fully operational GPU-based AI infrastructure, GPU Cluster Uptime and Performance Utilization, Reduction in Cost per Training/Inference Workload.
• GPU Cluster Operations: Install, configure, and optimize core components: CUDA, cuDNN, NCCL, NVIDIA Drivers, and GPU Operators. Implement GPU partitioning, scheduling, and performance tuning.
• Deliverables: High-availability architecture for all AI workloads, complete documentation, and runbooks.
• OpenShift AI (RHODS) Management: Deploy, configure, and maintain the Red Hat OpenShift AI (RHODS) platform for multi-tenant use.
• Deliverables: Production-ready OpenShift AI (RHODS) platform, AI Project Onboarding Speed.
• LLM & Model Serving: Build and manage infrastructure for hosting and serving open-source LLM frameworks and supporting RAG pipelines, LoRA adapters, and Vector Databases.
• Deliverables: Multi-model LLM serving environment for entities, MLOps Pipeline Success Rate and Deployment Frequency.
• MLOps & Automation: Implement Infrastructure as Code (IaC) and GitOps for the automated lifecycle management of the AI platform.
• Deliverables: Infrastructure automation via Terraform & Ansible, Automation Coverage for AI Infrastructure.
Required Qualifications & Experience:
• 7–12 years in Cloud Infrastructure, DevOps, ML Infrastructure, or Platform Engineering.
• Deep Hands-On Expertise with GPU Systems (NVIDIA A100/H100), Linux, Containers, and Kubernetes.
• Experience with OpenShift AI (RHODS) or equivalent Kubernetes GPU orchestration.
• Familiarity with LLM Hosting and supporting Vector Databases.
Essential Skills & Competencies:
• Technical: Deep understanding of GPU compute, HPC architectures, and ML performance profiling.
• Soft Skills: Strong troubleshooting, optimization, and performance engineering mindset.
Excellent cross-functional collaboration and documentation skills.
Preferred Certifications:
• NVIDIA Deep Learning / AI Infrastructure Certification
• Red Hat OpenShift AI specialization
• Kubernetes CKA/CKAD
• Azure AI or Oracle Cloud AI certifications
• Terraform & Ansible certifications
Work Conditions:
• Full-time, on-site position.

Zero to AI Engineer
Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.
Find AI, ML, Data Science Jobs By Location
Find Jobs By Position