Posted on 2026/02/04

Senior AI Infrastructure Engineer

Assail

Boston, MA, United States

Full-time

Apply Promote

Qualifications

7+ years in DevOps, SRE, or Infrastructure Engineering
1–2+ years owning production AI workloads, including GPU-backed inference or large-scale model serving
Demonstrated ownership of production systems under load, not just "set it up once and move on"
PostgreSQL 16: Tenant data with Row-Level Security, PgBouncer connection pooling
Neo4j 5.x: Graph database for attack chains and knowledge graphs
Qdrant 1.7: Vector database for semantic search and pattern matching
Redis 7: Short-term memory, caching, and pub/sub (24hr TTL patterns)
MinIO: S3-compatible object storage for APK artifacts and reports
Observability & Reliability
Implement end-to-end tracing with Jaeger across gRPC services, Kafka, and model inference
Enforce multi-tenant isolation at every layer:
Kubernetes namespaces and NetworkPolicies
Tenant-scoped Kafka topics (tenant-{id}-missions, tenant-{id}-agent-events)
Row-Level Security in PostgreSQL
Filtered vector search in Qdrant by tenant_id
Operate container security scanning with Trivy, Cosign (image signing), and Syft (SBOM generation)
Orchestration & Containers
Kubernetes (Expert) — K3s, EKS, or equivalent
Podman and Buildah (rootless container runtime)
Infrastructure as Code: Terraform (we deploy on AWS)
AWS: EKS, IAM, VPC, or equivalent cloud experience
Experience with local/on-prem GPU environments alongside cloud
vLLM (required) — production deployment and optimization
NVIDIA GPU operations (CUDA, driver management, memory optimization)
Familiarity with quantization (INT4/INT8 via BitsAndBytes) and model optimization
Apache Kafka — production operations, consumer groups, exactly-once semantics
PostgreSQL — Row-Level Security, connection pooling, replication
Neo4j — graph database operations and Cypher queries
Qdrant or similar vector database — HNSW indexing, filtered search
Jaeger or OpenTelemetry for distributed tracing
Experience with gRPC observability
Networking & Security
gRPC and Protobuf — service mesh patterns, load balancing
VPC design, private networking, Kubernetes NetworkPolicies
TLS/mTLS configuration for databases and services
Required Domain Awareness
Operating AI systems in adversarial environments (security product context)
Preventing data leakage across multi-tenant boundaries
Supporting reproducible, auditable AI outputs for security findings
Understanding the blast radius of misconfigured AI infrastructure in a pentest platform

Benefits

HashiCorp Vault for secrets management

Responsibilities

Scalable Inference & Serving
Deploy and operate our vLLM-based inference stack serving a custom fine-tuned 14B+ parameter security model
Optimize Time to First Token (TTFT) and tail latency under concurrent load from our 145-agent swarm
Manage multi-model routing across specialized functions (code analysis, deobfuscation, reasoning)
Ensure OpenAI-compatible API availability with <100ms p99 latency targets
GPU Orchestration & Capacity Planning
Manage and optimize NVIDIA RTX GPU utilization (RTX 5090 / CUDA 12.1+) within our Kubernetes clusters
Configure GPU passthrough, tensor parallelism, and memory optimization for vLLM inference
Design scheduling and autoscaling strategies to minimize idle GPU spend while supporting burst agent workloads
Forecast GPU capacity needs as the agent swarm scales (currently 145 agents across 10 types)
Kubernetes & Container Operations
Own our K3s-based Kubernetes infrastructure running on RHEL 10
Manage StatefulSets for stateful services (Neo4j, Qdrant, Kafka, Zookeeper, Android emulators)
Configure HPA (Horizontal Pod Autoscaler) for agent deployments (1-20 replicas per agent type)
Operate Podman rootless containers with Buildah for secure image builds
Maintain local container registry and image lifecycle
Event Streaming & Distributed Systems
Operate our Apache Kafka 3.6 backbone handling 10,000+ messages/sec for agent coordination
Monitor consumer lag, partition health, and message throughput across tenant-scoped topics
Ensure exactly-once delivery semantics for mission-critical agent task distribution
Database Operations (5 Specialized Stores)
Build CI/CD pipelines that support:
Model weight deployment and LoRA adapter merges
Configuration and prompt updates
Automated testing and canary releases for AI features
Integrate with our custom deployment tooling (ares-cc.py) and Helm charts
Enable fast rollback when model behavior or inference performance regresses
Extend our Prometheus metrics for:
Per-agent task duration and failure rates
vLLM request latency and GPU utilization
Kafka consumer lag and throughput
Database query performance (Neo4j graph traversals, Qdrant vector searches)
Maintain Grafana dashboards for Agent Swarm Overview, Mission Performance, Infrastructure Health, and DAST Metrics
Design graceful degradation when models time out or agents fail
Manage secrets via HashiCorp Vault (30+ secrets across 6 categories)
Maintain mTLS for all service-to-service communication (gRPC, database connections)
Enforce SELinux in production (RHEL 10)
Helm for chart management
Experience with GPU scheduling, node pools, and StatefulSets
Redis — pub/sub, caching patterns, Streams
Monitoring & Observability
Prometheus — custom metrics, alerting rules
Grafana — dashboard creation and maintenance

Full Description

Focus: Reliability, Performance, Cost Control, and Secure AI Operations

Level: Senior

Experience

• 7+ years in DevOps, SRE, or Infrastructure Engineering

• 1–2+ years owning production AI workloads, including GPU-backed inference or large-scale model serving

• Demonstrated ownership of production systems under load, not just "set it up once and move on"

Responsibilities

Scalable Inference & Serving

• Deploy and operate our vLLM-based inference stack serving a custom fine-tuned 14B+ parameter security model

• Optimize Time to First Token (TTFT) and tail latency under concurrent load from our 145-agent swarm

• Manage multi-model routing across specialized functions (code analysis, deobfuscation, reasoning)

• Ensure OpenAI-compatible API availability with <100ms p99 latency targets

GPU Orchestration & Capacity Planning

• Manage and optimize NVIDIA RTX GPU utilization (RTX 5090 / CUDA 12.1+) within our Kubernetes clusters

• Configure GPU passthrough, tensor parallelism, and memory optimization for vLLM inference

• Design scheduling and autoscaling strategies to minimize idle GPU spend while supporting burst agent workloads

• Forecast GPU capacity needs as the agent swarm scales (currently 145 agents across 10 types)

Kubernetes & Container Operations

• Own our K3s-based Kubernetes infrastructure running on RHEL 10

• Manage StatefulSets for stateful services (Neo4j, Qdrant, Kafka, Zookeeper, Android emulators)

• Configure HPA (Horizontal Pod Autoscaler) for agent deployments (1-20 replicas per agent type)

• Operate Podman rootless containers with Buildah for secure image builds

• Maintain local container registry and image lifecycle

Event Streaming & Distributed Systems

• Operate our Apache Kafka 3.6 backbone handling 10,000+ messages/sec for agent coordination

• Monitor consumer lag, partition health, and message throughput across tenant-scoped topics

• Ensure exactly-once delivery semantics for mission-critical agent task distribution

Database Operations (5 Specialized Stores)

• PostgreSQL 16: Tenant data with Row-Level Security, PgBouncer connection pooling

• Neo4j 5.x: Graph database for attack chains and knowledge graphs

• Qdrant 1.7: Vector database for semantic search and pattern matching

• Redis 7: Short-term memory, caching, and pub/sub (24hr TTL patterns)

• MinIO: S3-compatible object storage for APK artifacts and reports

MLOps / LLMOps Pipelines

• Build CI/CD pipelines that support:

• Model weight deployment and LoRA adapter merges

• Configuration and prompt updates

• Automated testing and canary releases for AI features

• Integrate with our custom deployment tooling (ares-cc.py) and Helm charts

• Enable fast rollback when model behavior or inference performance regresses

Observability & Reliability

• Implement end-to-end tracing with Jaeger across gRPC services, Kafka, and model inference

• Extend our Prometheus metrics for:

• Per-agent task duration and failure rates

• vLLM request latency and GPU utilization

• Kafka consumer lag and throughput

• Database query performance (Neo4j graph traversals, Qdrant vector searches)

• Maintain Grafana dashboards for Agent Swarm Overview, Mission Performance, Infrastructure Health, and DAST Metrics

• Design graceful degradation when models time out or agents fail

Security, Privacy & Isolation

• Enforce multi-tenant isolation at every layer:

• Kubernetes namespaces and NetworkPolicies

• Tenant-scoped Kafka topics (tenant-{id}-missions, tenant-{id}-agent-events)

• Row-Level Security in PostgreSQL

• Filtered vector search in Qdrant by tenant_id

• Manage secrets via HashiCorp Vault (30+ secrets across 6 categories)

• Maintain mTLS for all service-to-service communication (gRPC, database connections)

• Operate container security scanning with Trivy, Cosign (image signing), and Syft (SBOM generation)

• Enforce SELinux in production (RHEL 10)

Technical Requirements

Orchestration & Containers

• Kubernetes (Expert) — K3s, EKS, or equivalent

• Podman and Buildah (rootless container runtime)

• Helm for chart management

• Experience with GPU scheduling, node pools, and StatefulSets

Infrastructure & Cloud

• Infrastructure as Code: Terraform (we deploy on AWS)

• AWS: EKS, IAM, VPC, or equivalent cloud experience

• Experience with local/on-prem GPU environments alongside cloud

Inference & Acceleration

• vLLM (required) — production deployment and optimization

• NVIDIA GPU operations (CUDA, driver management, memory optimization)

• Familiarity with quantization (INT4/INT8 via BitsAndBytes) and model optimization

Data Infrastructure

• Apache Kafka — production operations, consumer groups, exactly-once semantics

• PostgreSQL — Row-Level Security, connection pooling, replication

• Neo4j — graph database operations and Cypher queries

• Qdrant or similar vector database — HNSW indexing, filtered search

• Redis — pub/sub, caching patterns, Streams

Monitoring & Observability

• Prometheus — custom metrics, alerting rules

• Grafana — dashboard creation and maintenance

• Jaeger or OpenTelemetry for distributed tracing

• Experience with gRPC observability

Networking & Security

• gRPC and Protobuf — service mesh patterns, load balancing

• VPC design, private networking, Kubernetes NetworkPolicies

• HashiCorp Vault for secrets management

• TLS/mTLS configuration for databases and services

Required Domain Awareness

• Operating AI systems in adversarial environments (security product context)

• Preventing data leakage across multi-tenant boundaries

• Supporting reproducible, auditable AI outputs for security findings

• Understanding the blast radius of misconfigured AI infrastructure in a pentest platform

Preferred Experience

• Experience with Android emulator farms or mobile device infrastructure at scale

• Prior work supporting security products or regulated data environments

• Familiarity with Frida, mitmproxy, or similar runtime instrumentation tools

• Background in gRPC-based microservices and event-driven architectures

• Experience with supply chain security (Trivy, Cosign, SBOM generation)

What This Role Is Not

• Not a "just keep the cluster green" job

• Not infra-for-infra's-sake

• Not a research or model-tuning role

This role owns uptime, performance, cost discipline, and security posture for an autonomous AI penetration testing platform with a 145-agent swarm operating under hostile inputs.

Apply Promote

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Learn More