CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui
2025-10-31
Summary
This research explores how to use artificial intelligence, specifically large vision-language models, to understand the socio-economic conditions of cities using images from sources like Google Street View and satellite imagery. The goal is to help track progress towards global sustainability goals by automatically assessing things like wealth and poverty levels in different areas.
What's the problem?
While these AI models are getting better at understanding images and language together, they currently struggle to accurately and reliably predict socio-economic status just from looking at pictures of cities. They also aren't very good at explaining *why* they made a certain prediction, making it hard to trust their results or apply them to new places. Essentially, the models see the pictures but don't 'reason' about them in a way that leads to useful insights.
What's the solution?
The researchers developed a new system called CityRiSE that uses a technique called reinforcement learning to 'train' the AI model. Think of it like teaching a dog a trick with rewards. CityRiSE provides the AI with rewards for focusing on the visual cues that are actually important for determining socio-economic status, like the condition of buildings or the types of businesses present. This guides the AI to develop a structured way of reasoning about the images, leading to more accurate predictions.
Why it matters?
This work is important because it shows how we can combine AI with reinforcement learning to create tools that can automatically and accurately assess urban conditions. This could be incredibly valuable for city planners, policymakers, and organizations working on sustainable development, allowing them to identify areas that need investment and track the impact of interventions. It also makes the AI's decisions more understandable and reliable, and allows it to work well even in cities it hasn't 'seen' before.
Abstract
Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce CityRiSE, a novel framework for Reasoning urban Socio-Economic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.