PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Iliass Ayaou, Denis Cavallucci

2025-10-29

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Summary

This paper introduces a new way to test how well computer programs understand the meaning of patents, which are technical documents describing inventions. They created a large and challenging set of tests, called PatenTEB, specifically designed for patents, and also built new computer models, called patembed, that perform well on these tests.

What's the problem?

Existing methods for evaluating how well computers understand text don't accurately reflect the unique difficulties of patent documents. Patents use very specific language, often refer to prior inventions, and require understanding complex technical details. Current tests, designed for general text, don't capture these challenges, making it hard to know if a program *really* understands patents or is just getting lucky.

What's the solution?

The researchers created PatenTEB, a benchmark with over two million examples covering different tasks like finding relevant patents, categorizing them, identifying paraphrases, and grouping similar patents. This benchmark includes realistic scenarios found in patent work, like matching small parts of a patent to the full document. They then used this benchmark to train a family of models, patembed, ranging in size and complexity, using a technique called multi-task learning where the models learn to do several tasks at once. This helps them generalize better.

Why it matters?

This work is important because it provides a better way to measure and improve the ability of computers to work with patents. This can lead to faster and more accurate prior art searches (checking if an invention already exists), better understanding of technology trends, and more efficient patent analysis, ultimately speeding up the innovation process.

Abstract

Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.

View Paper