Full Description
Job Description:
We are seeking a skilled and motivated AI Infrastructure Engineer to join our dynamic team. As an integral member of the InfraOps team, you will play a key role in managing and optimizing our GPU-based compute infrastructure (across multiple locations and partners), ensuring maximum performance, scalability, and reliability.
This is a mid-senior level position with a focus on DevOps or SRE, Sales Engineering, or Solution Architect role focused on GPU compute. You will have experience in managing GPU-based compute infrastructure, including NVIDIA GPUs and CUDA programming.
Responsibilities:
• Deploy, configure, and maintain GPU-based compute infrastructure, including servers, storage, networking, and associated software stack.
• Implement robust monitoring and alerting systems to proactively identify performance bottlenecks, resource constraints, and potential failures.
• Develop automation scripts and tools to streamline deployment, configuration, and management of infrastructure components.
• Implement security best practices to safeguard sensitive data and ensure compliance with relevant regulations and industry standards.
• Provide tier-3 support for infrastructure-related issues, investigating root causes and implementing timely resolutions.
• Collaborate with cross-functional teams to forecast resource requirements, plan capacity upgrades, and scale infrastructure to accommodate growing workloads and user demands.
Requirements:
• Experience in infrastructure operations, preferably in a DevOps or SRE role or Sales Engineering or Solution Architect role - focused on GPU compute.
• Proficiency in managing GPU-based compute infrastructure, including NVIDIA GPUs and CUDA programming.
• Strong expertise in Linux system administration and shell scripting (e.g., Bash, Python).
• Experience with configuration management tools (e.g., Ansible, Chef, Puppet) and version control systems (e.g., Git).
• Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).
• Solid understanding of networking concepts, protocols, and troubleshooting techniques.
• Excellent analytical and problem-solving skills, with a proactive and results-oriented mindset.
• Effective communication skills and the ability to collaborate effectively with cross-functional teams.
• Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and hybrid cloud architectures.
• Knowledge of HPC frameworks and job scheduling systems (e.g., Slurm, PBS Pro).
• Familiarity with GPU-accelerated libraries and frameworks (e.g., TensorFlow, PyTorch, CUDA Toolkit).
• Understanding of cybersecurity principles and practices, including encryption, access controls, and threat detection/prevention.
• Bonus if you know Web3 (cryptocurrency, tokenization of RWAs, mining/staking, etc.).
Find AI, ML, Data Science Jobs By Location
Find Jobs By Position
Subscribe to the AI Search Newsletter
Get top updates in AI to your inbox every weekend. It's free!