Posted on 4/30/2025

AI Infrastructure Engineer

beBee Careers

Hong Kong

Full-time

Apply Promote

Full Description

Job Description:

We are seeking a skilled and motivated AI Infrastructure Engineer to join our dynamic team. As an integral member of the InfraOps team, you will play a key role in managing and optimizing our GPU-based compute infrastructure (across multiple locations and partners), ensuring maximum performance, scalability, and reliability.

This is a mid-senior level position with a focus on DevOps or SRE, Sales Engineering, or Solution Architect role focused on GPU compute. You will have experience in managing GPU-based compute infrastructure, including NVIDIA GPUs and CUDA programming.

Responsibilities:

• Deploy, configure, and maintain GPU-based compute infrastructure, including servers, storage, networking, and associated software stack.

• Implement robust monitoring and alerting systems to proactively identify performance bottlenecks, resource constraints, and potential failures.

• Develop automation scripts and tools to streamline deployment, configuration, and management of infrastructure components.

• Implement security best practices to safeguard sensitive data and ensure compliance with relevant regulations and industry standards.

• Provide tier-3 support for infrastructure-related issues, investigating root causes and implementing timely resolutions.

• Collaborate with cross-functional teams to forecast resource requirements, plan capacity upgrades, and scale infrastructure to accommodate growing workloads and user demands.

Requirements:

• Experience in infrastructure operations, preferably in a DevOps or SRE role or Sales Engineering or Solution Architect role - focused on GPU compute.

• Proficiency in managing GPU-based compute infrastructure, including NVIDIA GPUs and CUDA programming.

• Strong expertise in Linux system administration and shell scripting (e.g., Bash, Python).

• Experience with configuration management tools (e.g., Ansible, Chef, Puppet) and version control systems (e.g., Git).

• Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).

• Solid understanding of networking concepts, protocols, and troubleshooting techniques.

• Excellent analytical and problem-solving skills, with a proactive and results-oriented mindset.

• Effective communication skills and the ability to collaborate effectively with cross-functional teams.

• Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and hybrid cloud architectures.

• Knowledge of HPC frameworks and job scheduling systems (e.g., Slurm, PBS Pro).

• Familiarity with GPU-accelerated libraries and frameworks (e.g., TensorFlow, PyTorch, CUDA Toolkit).

• Understanding of cybersecurity principles and practices, including encryption, access controls, and threat detection/prevention.

• Bonus if you know Web3 (cryptocurrency, tokenization of RWAs, mining/staking, etc.).

Apply Promote

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!