Performance Trade-offs of Optimizing Small Language Models for E-Commerce
Josip Tomo Licardo, Nikola Tankovic
2025-10-31
Summary
This paper explores whether smaller, more efficient AI models can perform as well as huge, expensive ones for specific tasks, focusing on understanding what customers want when shopping online.
What's the problem?
The really good AI language models, like GPT-4, are too big and costly to use directly for everyday business needs like figuring out what a customer means when they type a search query on an e-commerce site. Running these large models requires a lot of computing power, takes time, and is therefore expensive, making them impractical for many companies.
What's the solution?
Researchers took a smaller, open-source AI model (Llama 3.2 with one billion parameters) and customized it for understanding e-commerce searches. They created fake search queries to train the model, then used techniques called QLoRA and post-training quantization (GPTQ and GGUF) to make it even smaller and faster. They tested how well it worked on different computer hardware, like GPUs and CPUs.
Why it matters?
The study showed that this smaller, optimized model could understand customer intent just as accurately as the much larger GPT-4, but at a significantly lower cost. It also highlighted that the best way to optimize the model depends on the hardware you're using, with different techniques working better for GPUs versus CPUs, proving that smaller, specialized AI models are a practical and effective alternative to relying on massive, general-purpose models.
Abstract
Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training quantization techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 99% accuracy, matching the performance of the significantly larger GPT-4.1 model. A detailed performance analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older GPU architecture (NVIDIA T4) due to dequantization overhead. Conversely, GGUF formats on a CPU achieved a speedup of up to 18x in inference throughput and a reduction of over 90% in RAM consumption compared to the FP16 baseline. We conclude that small, properly optimized open-weight models are not just a viable but a more suitable alternative for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.