ProgressGym: Alignment with a Millennium of Moral Progress

Tianyi Qiu, Yang Zhang, Xuchuan Huang, Jasmine Xinze Li, Jiaming Ji, Yaodong Yang

2024-07-02

ProgressGym: Alignment with a Millennium of Moral Progress

Summary

This paper talks about ProgressGym, a new framework designed to help AI systems learn from human moral progress over time, ensuring that they align better with evolving societal values.

What's the problem?

The main problem is that as AI systems, like large language models (LLMs), become more influential, they can reinforce outdated or misguided moral beliefs in society. This can lead to the continuation of harmful practices because these systems might not understand how moral values change over time.

What's the solution?

To address this issue, the authors introduce progress alignment, which allows AI to learn from historical moral progress. They created ProgressGym, which uses 900 years of historical texts and 18 historical LLMs to develop benchmarks for understanding moral changes. This framework includes three main challenges: tracking how values evolve (PG-Follow), predicting future moral shifts (PG-Predict), and managing the interaction between human values and AI responses (PG-Coevolve). They also propose algorithms to help AI systems adapt to these changes.

Why it matters?

This research is important because it helps ensure that AI systems can adapt to changing human values and avoid reinforcing harmful beliefs. By teaching AI about moral progress, we can create more ethical and responsible AI technologies that better serve society.

Abstract

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.

View Paper