Polite Enough for Public Life?
Evaluating GPT’s Understanding of Everyday Norms
Suppose an AI language model possessed a physical body and was asked to operate in the world like the rest of us. How well would it cope? Evaluations of recent models suggest it would excel at a wide range of tasks, including holding fluent conversations in dozens of languages, winning trivia quizzes, and even performing well on the US bar exam.[1] These feats are undeniably impressive—but they are also well aligned with what language models do best: leveraging vast amounts of textual information to perform structured, knowledge-heavy tasks.
What is less clear is how it would manage everyday life. Many daily activities require social awareness, cultural sensitivity, and an intuitive sense of what is appropriate in a given context. Would it know when to speak up, when to stay quiet, or what not to do on a crowded bus? These kinds of judgments may be especially difficult for an AI, which lacks lived experience in the physical world. Unlike humans, it has never stood on a crowded bus or tried to watch a movie while someone nearby scrolls through their phone.
To evaluate how well an AI grasps these everyday norms, we asked OpenAI’s latest model, GPT-4.5, to rate the appropriateness of 555 situated behaviors. Each scenario was constructed by combining one of 37 common behaviors—such as arguing, chewing gum, or working on a laptop—with one of 15 everyday situations, ranging from riding the bus to attending a football game. For each combination, the model was asked to estimate how appropriate the average U.S. resident would find the behavior.
We then compared GPT’s estimates with the ratings given by a sample of U.S. residents who had evaluated the exact same scenarios in prior research.[2] The results revealed a close match between the two measures, with GPT’s estimates explaining around 89% of the variation in human responses. Not only did it perform well on the clear-cut cases of right and wrong (e.g., fight at a job interview), but also on the more fuzzy cases in between (e.g., kiss on a bus). Figure 1 below shows how the AI-inferred appropriateness ratings compare to those of real humans.
Figure 1. GPT’s estimated appropriateness ratings strongly align with those of U.S. residents. Each scenario was rated five times by GPT; the final estimate reflects the average of these five.
To put GPT’s performance in perspective, we also compared its understanding of everyday norms to that of individual humans in the sample. Specifically, we calculated how much each participant's ratings, on average, deviated from the population mean across the different scenarios. We then applied the same calculation to GPT’s estimates. This approach allowed us to assess how closely each respondent, as well as GPT, conformed to the broader social consensus. As shown in Figure 2, GPT’s judgments were closer to the social norm than those of nearly every human participant, placing it in the 99.64th percentile of human-level accuracy. Only 2 out of 555 participants were more closely aligned with the overall population.
Figure 2. GPT’s estimates are closer to the population average than those of nearly all participants. The histogram shows the mean absolute error of each participant’s ratings relative to the population average. The red line marks GPT-4.5, the green line the median human, and the blue line the mean human error.
To conclude, knowing when it’s appropriate to run or talk in public may not rank among the most urgent AI alignment issues—especially when compared to existential risks like losing control over powerful AI systems. Still, if Sam Altman’s timeline holds and AI-equipped robots arrive within the next two or three years[3], it’s reassuring to think they will show up with decent manners—at least by U.S. standards.
Authors: Pontus Strimling, Simon Karlsson and Irina Vartanova
References
[1] OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., . . . Zoph, B. (2023, March 15). GPT-4 Technical Report. arXiv.org. https://arxiv.org/abs/2303.08774
[2] Eriksson, K., Strimling, P., & Vartanova, I. (2023). Appropriateness ratings of everyday behaviors in the United States now and 50 years ago. Frontiers in Psychology, 14. https://doi.org/10.3389/fpsyg.2023.1237494
[3] The gentle singularity. (2025) . Sam Altman. https://blog.samaltman.com/the-gentle-singularity