NRGBoost: Energy-Based Generative Boosted Trees

João Bravo

2024-10-07

NRGBoost: Energy-Based Generative Boosted Trees

Summary

This paper presents NRGBoost, a new approach that enhances tree-based machine learning models by using energy-based generative methods to improve their performance in handling tabular data.

What's the problem?

Even though deep learning models are popular for analyzing unstructured data, tree-based methods like Random Forests and Gradient Boosted Decision Trees (GBDT) are still widely used for structured data in fields like finance and healthcare. However, these traditional methods often focus only on generating predictions without understanding the underlying data distribution. This limits their ability to perform tasks like sampling new data points effectively.

What's the solution?

To address this issue, the authors propose NRGBoost, an energy-based generative boosting algorithm that builds upon existing tree-based methods. This new method allows the model to explicitly understand the data density, which means it can learn the structure of the data better. NRGBoost is designed to work similarly to popular boosting algorithms like XGBoost but also enables the model to handle inference tasks over any input variable. The authors demonstrate that NRGBoost can achieve performance similar to GBDT on various real-world datasets while also being competitive with neural networks for generating new samples.

Why it matters?

This research is important because it shows how traditional machine learning techniques can be improved by integrating generative approaches. By enhancing the capabilities of tree-based models, NRGBoost could lead to better predictions and more effective data analysis in various applications, making it a valuable tool for researchers and practitioners working with structured data.

Abstract

Despite the rise to dominance of deep learning in unstructured data domains, tree-based methods such as Random Forests (RF) and Gradient Boosted Decision Trees (GBDT) are still the workhorses for handling discriminative tasks on tabular data. We explore generative extensions of these popular algorithms with a focus on explicitly modeling the data density (up to a normalization constant), thus enabling other applications besides sampling. As our main contribution we propose an energy-based generative boosting algorithm that is analogous to the second order boosting implemented in popular packages like XGBoost. We show that, despite producing a generative model capable of handling inference tasks over any input variable, our proposed algorithm can achieve similar discriminative performance to GBDT on a number of real world tabular datasets, outperforming alternative generative approaches. At the same time, we show that it is also competitive with neural network based models for sampling.

View Paper