Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques

Marvin Schmitt, Anne Schwerk, Sebastian Lempert

2026-01-16

Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques

Summary

This paper explores how to get better results from powerful AI language models, like GPT-4o-mini and gemini-1.5-flash, when they're used to figure out if a piece of text is positive, negative, or neutral – a task called sentiment analysis.

What's the problem?

Large language models are really good at understanding and generating text, but they don't always accurately determine the emotional tone, or 'sentiment,' of what they're reading. Simply asking the model to classify sentiment isn't always enough, especially when dealing with complex things like sarcasm or identifying the sentiment towards specific parts of a text.

What's the solution?

The researchers tested different 'prompting' techniques to see if they could improve the models' performance. Prompting is basically how you phrase your question to the AI. They tried 'few-shot learning' – giving the model a few examples to learn from – and 'chain-of-thought prompting' – encouraging the model to explain its reasoning step-by-step. They also used 'self-consistency,' where the model generates multiple answers and picks the most common one. They then compared the results to a simple, direct question. They measured how well the models did using metrics like accuracy and precision.

Why it matters?

The study found that using these advanced prompting techniques significantly improved sentiment analysis. Importantly, the *best* technique depended on the specific model and the task. For example, giving examples ('few-shot') worked best with GPT-4o-mini, while getting the model to think step-by-step ('chain-of-thought') was much better at detecting irony with gemini-1.5-flash. This shows that you can't just use one prompting strategy for all AI models and tasks; you need to carefully design your prompts to match the model's strengths and the complexity of what you're trying to analyze.

Abstract

This study investigates the use of prompt engineering to enhance large language models (LLMs), specifically GPT-4o-mini and gemini-1.5-flash, in sentiment analysis tasks. It evaluates advanced prompting techniques like few-shot learning, chain-of-thought prompting, and self-consistency against a baseline. Key tasks include sentiment classification, aspect-based sentiment analysis, and detecting subtle nuances such as irony. The research details the theoretical background, datasets, and methods used, assessing performance of LLMs as measured by accuracy, recall, precision, and F1 score. Findings reveal that advanced prompting significantly improves sentiment analysis, with the few-shot approach excelling in GPT-4o-mini and chain-of-thought prompting boosting irony detection in gemini-1.5-flash by up to 46%. Thus, while advanced prompting techniques overall improve performance, the fact that few-shot prompting works best for GPT-4o-mini and chain-of-thought excels in gemini-1.5-flash for irony detection suggests that prompting strategies must be tailored to both the model and the task. This highlights the importance of aligning prompt design with both the LLM's architecture and the semantic complexity of the task.

View Paper