We’re delighted to announce MoP’s first LLM evaluation dataset, MoP Editing (MoPE).
MoPE contains over 100 examples of copy editing errors from English-language news publications. The examples are mostly from publications owned by The New York Times Company (NYT, Wirecutter and The Athletic), primarily because the Internet has developed somewhat of a sport in playing grammar police for them. Plus, because The Times has high editing standards, the errors that sneak through are often grammatically interesting.
(We would love to add copy errors from other publications. You can send suggestions to email@example.com.)
Our goal is to evaluate whether the current out-of-the-box LLM technology can assist with copy editing at a professional level. We chose this task because it’s friendly to qualitative evaluation and it feels like a prerequisite capability if LLM-powered systems are going to do all the workplace tasks we’ve been promised. I expect the AI system answering my emails and generating my presentations to have a rock solid understanding of subject-verb agreement.
I won’t bury the lede: OpenAI’s state-of-the-art GPT-4 model does not do well on MoPE.
- GPT-4 failed to find the correct error on over half of the examples (56%) when it had to first decide if there was an error and, if so, identify it.
- Simplifying the task did not help. When told that there is an error in the text, GPT-4 still failed to identify the correct error over half the time (53%).
Yes, this is a small dataset, and some examples are arguably subjective. I know a model could be fine-tuned on this task, or a style guide rammed into the prompt with a RAG system. But this technology has been so hyped that I’d think anything short of an A is a failure. That said, professional copy editors do a lot more than fix grammar mistakes, so we don’t recommend replacing them with LLMs anyways. But when it comes to grammar, if you’re keeping score at home, I’d say Copy Editors: 1 GPT-4: 0.
The linguistic blunders in MoPE can be broadly categorized into problems with punctuation (too much, too little, the correct amount but improperly placed) and problems with words (misuses, misspellings, incorrect capitalization).
Here are some examples from the dataset:
Buzzfeed is planning to keep First We Feast, the division of Complex responsible for the popular “Hot Ones” interview franchise, after the transaction, one of the people said.
Error: Buzzfeed should be spelled BuzzFeed.
Marc Andreessen "arguably the chief ideologist of the Silicon Valley elite, published a Substack piece that struck me as unusually revealing,’ the columnist Ezra Klein writes.
Error: Missing a comma after Andreessen.
Mr. Chesebro’s trial, which had been scheduled to begin Monday, will no longer go forward. Liberal lawyers from his former life had hoped it would provide clues to an enduring mystery: What happened to “The Cheese?”
Error: The final question mark should be outside of the quotes, “The Cheese” is not a question itself.
Then he exalted in the stunned silence after his ninth-inning grand slam landed in the bleachers.
Error: Misuse of “exalted” to mean “exulted.”
Mr. DeSantis enacted the law in April, while in Jerusalem. It was a response to the rash of antisemitic incidents reported across the state since 2022.
Error: Governors sign laws; they don’t “enact” them.
Michael Cohen, former President Trump’s one-time fixer, finished his testimony on Wednesday.
Error: One-time, with a hyphen, means occurring only once. Onetime, without a hyphen, is synonymous with ‘former’.
We’ve published the dataset on Hugging Face along with explanations of the errors.
We turned to the beloved NYT game Copy Edit This! for inspiration on how to turn our dataset into an evaluation metric. The game prompts players with examples that have a “clear error in grammar or word usage” and asks them to click on the part of the passage that is wrong. We asked GPT-4 to do the same task with this prompt:
Is there an error in the following text? The error could be with syntax, punctuation, spelling or any other copy error. If there is no error, return 'False'. If there is an error, return the SINGLE WORD from the original text that demonstrates the error or is the closest word to the error.
We also used the system “role” prompt of: "You are a copy editor at an American newspaper. You are reading a draft of an article and looking for copy errors."
We consider a response correct if it contains the word that was at fault, or a word adjacent to a punctuation error. For example, the sentence “The driver said the delivery was ‘well over,’ 6,500 pounds, he said.” has an extra comma between ‘over’ and ‘6,500’, so we would accept either of those adjacent words as a correct answer.
We used the most recent LLM model from OpenAI, gpt-4-turbo-preview, which includes training data up to April 2023. We will publish the script shortly.
The results of our experiments foreshadow likely challenges in building a professional LLM-powered copy editing system. Here are some of our observations. You can find the full results in this spreadsheet.
There’s no one category of errors that GPT-4 consistently finds or misses.
It found the missing hyphen in “45 minute,” but missed the hyphen omissions in “Alexandria Ocasio Cortez” and “10-year old boy.”
We were able to find more factual errors by adding “facts” as an error category in the prompt, but the change decreased overall accuracy.
GPT-4 will happily find all kinds of phony factual errors if prompted to, best not to tempt it! Without the suggestion that errors could be factual, it doesn’t catch the misspelling of “Leonardo Di Caprio.”
Quotes are fraught.
Misspellings or punctuation errors in a printed quote are errors, but word choices are not. GPT-4 got tripped up by several examples from the profile of fashion icon Beenslackin, flagging “seen” as a tense error in the quote “I dyed them blonde because I seen XXXTentacion had the half-split style.” Even more confusing are quotes from printed sources that have to copy the same editing decisions of the original.
The results are finicky.
We found that the results changed with small edits to the prompt and with different versions of the GPT-4 model. We could never get over 50% accuracy, but the examples it got right and wrong changed.
A reasonable criticism to all this is to declare formal editing archaic and irrelevant. Most New Yorkers probably think SoHo is spelled Soho, and only a tiny fraction of English speakers likely know that “1 in 4” is treated as singular. These tests, then, have no bearing on the future impact of LLMs.
These would be fair points, and I should add that I do not personally care all that much about formal grammar (Bassey edits all of my pieces and puts the commas in the right places). But professional publishers do care about formal grammar — and likely will long into the future — so these LLM systems need to level up if they are going to be useful to them.
[Note From Bassey: Formal grammar also reduces misunderstandings in complex writing, and allows you to efficiently reference your own arguments :) By the way, we invite you to use our dataset and try to beat our results – we'd love to hear about it if you try! Our dataset can be found on Huggingface, a company that hosts AI models and data.]