A multimodal framework using satellite imagery and text data outperforms vision-only models in predicting household wealth, with LLM-generated text proving more effective than agent-retrieved text.

This paper talks about a new method that uses both satellite images and text data together to better predict how wealthy or poor households are, and it shows that using text generated by large language models works better than text collected by agents.

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract