V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms
Javier J. Poveda Rodrigo, Mohamed Amine Ahmdi, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini
2025-03-25
Summary
This paper talks about making AI language models run faster on a new type of computer chip called RISC-V, which is becoming popular because it's open-source and not controlled by any one company.
What's the problem?
Most big AI language models run on special computer chips called GPUs, which are expensive. Regular computer processors (CPUs) could be a cheaper option, but they're not as good at running these AI models, especially newer types that are designed for reasoning tasks.
What's the solution?
The researchers worked on making AI models run better on a specific RISC-V chip called the Sophon SG2042. They focused on two AI models that are good at reasoning tasks and found ways to make them run much faster on this chip. They were able to speed up how quickly the AI can process text and generate new text by about 3 times compared to when they started.
Why it matters?
This work matters because it could make it cheaper and easier for more people to use powerful AI language models. By making these models run well on open-source chips like RISC-V, it could lead to more innovation and competition in AI technology, rather than having it controlled by a few big companies that make GPUs.
Abstract
The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.