$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Kuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, Artur Kadurin
2024-06-21

Summary
This paper introduces a new dataset and benchmark called nabla^2DFT, which is designed to improve the training of neural network potentials (NNPs) used in quantum chemistry, particularly for drug discovery.
What's the problem?
Computational quantum chemistry methods provide detailed information about molecules, which is essential for developing new drugs. However, these methods can be very complex and slow, making them hard to use on a large scale. Neural network potentials (NNPs) offer a faster alternative but need large and varied datasets to learn effectively. Without these datasets, NNPs cannot perform well, limiting their usefulness in real-world applications.
What's the solution?
The researchers created the nabla^2DFT dataset, which includes a significantly larger number of molecular structures and conformations compared to existing datasets. This new dataset features various types of data related to molecular properties, such as energies and forces, and includes advanced tasks for evaluating NNPs. They also developed a benchmark to test how well NNPs can predict different molecular properties and optimized their framework to train multiple models effectively.
Why it matters?
This work is important because it provides the necessary resources to enhance the performance of neural network potentials in quantum chemistry. By making it easier to train these models with a rich dataset, researchers can improve drug discovery processes and other areas in chemical science, ultimately leading to better medicines and advancements in technology.
Abstract
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called nabla^2DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level (omegaB97X-D/def2-SVP) for each conformation. Moreover, nabla^2DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.