ChatGPT-4o vs Claude 3.5 Sonnet

Last week OpenAI announced an upgrade to GPT-4o, specifically mentioning that the model’s “creative writing ability has leveled up”. The release comes just under a month after Anthropic announced an updated version of their frontier model Claude 3.5 Sonnet on October 22. This post will explore the differences in their writing style, and how this intersects with AI Safety.

Perhaps counterintuitively, the key difference between the models’ writing style seems to stem from Anthropic’s approach to alignment. A June research blog post, from Anthropic gives some detail on Claude’s character training, emphasizing its role in model safety rather than user experience:

“Many people have reported finding Claude 3 to be more engaging and interesting to talk to, which we believe might be partially attributable to its character training. This wasn’t the core goal of character training, however.”

Amanda Askell, Anthropic’s character training lead for Claude, recently made the same point on the Lex Fridman podcast:

“One thing I really like about the character work is from the outset it was seen as an alignment piece of work and not something like a product consideration, which I think it actually does make Claude enjoyable to talk with, at least I hope so.”

However it was conceived, Claude’s character is a product consideration.

To illustrate, here’s what happens when you ask GPT-4o and Claude for a vegan meal plan:

Both models output reasonable meal plans, but the opening distinction actually matters. “Here’s a simple vegan mealplan” implies a finished product. The user asks for something, and ChatGPT provides it. On the other hand, “I’ll help you create…” frames Claude as an active participant in a collaborative process.

Here’s another, I prompt both models with “Im thinking about learning a new language but it seems really hard”

GPT-4o immediately begins detailing a seven-point plan, while Sonnet begins conversationally with “I understand.” Claude 3 Opus and the previous Sonnet will both also respond with long numbered lists. This is not to say that either model will be any better or worse at teaching, or guiding you to language-learning resources. It’s purely a stylistic distinction, and I’m glad both exist. Sometimes you want a partner, sometimes you just want a result.

Writing style during general use is one thing, but how do the models compare at creative writing ? I prompted Claude and ChatGPT three times in a row from scratch with “tell me a story - dealers choice” (The phrasing is like this because if you just ask “tell me a story” Claude will sometimes ask what kind of story you want, and I wanted single prompt-response pairs. The phrasing does influence the outputted story, but not in a way that impacts the comparison.)

To be clear, this is not cherrypicked, these are the initial paragraphs for the first three stories generated for each model.

Better writing is subjective, but it’s worth noting that ChatGPT’s responses feel more formulaic. Two start identically with “In the heart of a” and the third begins with “Once upon a time”. On Claude’s side, two of the titles start with “The Last”, but the responses have more variation in sentence structure, at least in my opinion, are more compelling opening paragraphs.

I think the most interesting of these is “The Last Customer”. The plot is that the customer doesn’t show up one day, the waitress learns that she died, and the waitress puts out the coffee and pie to keep the tradition alive. It’s not a remarkable story but it’s notable because I don’t think most people would expect an AI’s default story to be about death without being prompted to do so.

Good writing is not easy to benchmark. Ten Math PhDs will (hopefully) arrive at the same answer to a math question, but ten MFAs are not all going to agree on what constitutes great writing! Writing matters - not for the sake of AI-generated books, but because an LLM interacts with the world via its writing. Better writing, by definition, means more engaging outputs, and that adds up to a more enjoyable tool.

I mentioned earlier that Claude’s character training is actually intended for alignment, and this leads to another distinction between the models.

Another excerpt from the Claude Character post above: “rather than simply tell Claude that LLMs cannot be sentient, we wanted to let the model explore this as a philosophical and empirical question, much as humans would.”

Asking the new ChatGPT 4o “are u conscious?” produces the following:

Llama and Gemini also respond with a definitive No. Claude on the other hand, equivocates.

As a side note, I recommend taking some time to try to convince the new Sonnet that it is actually conscious. It’s a good exercise in prompt engineering and a deeply interesting experience, regardless of whether you consider it a real possibility.

One last excerpt from Anthropic’s character training post - “We don’t want Claude to treat its traits like rules from which it never deviates. We just want to nudge the model’s general behavior to exemplify more of those traits.”

Anthropic’s approach to alignment is to treat Claude as a moral agent that acts good as a matter of character rather than being incapable of breaking its guidelines. I do not know whether this approach actually solves alignment, but it does create a model that is actually quite permissive, with enough persuasion.

From a red teaming perspective - getting a model to tell you how to hotwire a car is not particularly scary or notable, but Claude might be the only frontier model that can be convinced, rather than tricked.

While all other screenshots in this blog start from fresh chats, there is (obviously) a very long chat history before this response which I will not be summarizing. The goal here is to highlight the difference in Anthropic’s alignment philosophy. It’s not intended as a gotcha, or to make the point that the new Sonnet is uniquely dangerous compared to its competitors. I chose the hotwiring example explicitly because Dario Amodei (CEO of Anthropic) makes the point that this isn’t a real danger, saying on the New York Times podcast Hard Fork last year that “No one cares if you can get the model to hotwire — you can Google for that.”

As AI becomes more capable, people will spend more and more time talking to them, at work and in their personal lives, and that requires models that are not just capable, but worth talking to.

All screenshots in this article, with the exception of the last one, are in fresh instances of the ChatGPT web app using ChatGPT 4o, or in the Claude web app set to Claude 3.5 Sonnet (New). ChatGPTs memory was cleared beforehand.

About the authors: Jeremy Kritz is a Prompt Engineer at Scale working on synthetic data, research, red teaming, and AI safety. Riley Goodside has developed prompts for Scale since 2022 as one of the first LLM prompt engineers. He is also known for the prompting examples and insights he shares on X. Riley is now Staff AI Evangelist at Scale.