We argued about recent AI headlines | Human in the Loop: Episode 10

In today's episode of Human in the Loop, members of Scale’s Enterprise team (Ben Scharfstein, Sam Denton, and Mark Pfeiffer) break down recent AI development and explore what (if anything) it means for the enterprise world.
Developments covered:
- Apple’s “The Illusion of Thinking” research paper
- Google Gemini’s AlphaEvolve
- Anthropic’s Agentic Misalignment research paper
- Meta’s investment in Scale AI
About Human in the Loop
We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your enterprise.
Watch the video or read the transcript below to get their insights from working with leading enterprises and frontier model labs.
Episode 10 - Breaking Down AI News for Enterprises
Monica: Today we're going to be talking about the latest news in AI as it's relevant for enterprises. Our first topic today is Apple's new reasoning paper. Sam, you want to tell us a little bit about that?
Sam: Sure. Apple recently published a paper called "The Illusion of Thinking." It was an exploration into reasoning models, challenging whether these long chains of thought are, in fact, the models reasoning through problems, improving their understanding, and then finally coming to a cohesive final answer. The point of the paper, from what Apple was trying to show, was that these long-running models are not actually thinking through the problem and reasoning, but rather using that context space for other things—perhaps trying a bunch of approaches and then coming to an answer, without necessarily being able to get to the right one, even if you tell it the problem setup and give it enough tokens.
Mark: Why I think it's not super impactful for the enterprise is that the complexity of problems in an enterprise is fairly foreseeable. You know what problems you're going to tackle, and you have evals set up properly to test it end-to-end. It's pretty rare that you have a completely new problem that someone is trying to prompt-engineer or solve on the fly. Therefore, I think the paper is super relevant for things like chat interactions and ad-hoc tasks you give to models, but it's just not the typical enterprise workflow problem. I think it's not that important yet.
Sam: I'll also add some caveats to the paper that I think make it less relevant for the enterprise. I don't know if everyone saw this, but after Apple released this paper, Anthropic co-authored a rebuttal with Claude as the first author. Claude had some feedback that I thought was pretty fair, and these were some of the things people were saying about the paper in general.
Some highlights included that the problem scope Apple chose for "The Illusion of Thinking" was four very specific logic puzzles. One of them was Towers of Hanoi; another was the river crossing puzzle. The rebuttals pointed out that these are not really problems where it makes sense to think for hundreds of tokens before coming to a conclusion. One of Claude's main rebuttals was, "Hey, if you tell me how to solve the problem, ask me to solve it with eight discs, and let me write some code, I'm going to get it right 100% of the time."
The challenge was that the ask Apple made of these models was to approach these problems using a ton of context when that wasn't the best way to solve them. It was a very contrived and narrow situation. It's interesting for Apple to publish, and it's exciting to see instances where you tell a model how to do a problem and it isn't able to follow the instructions. But at the same time, it’s very contrived and not super impactful when you're drawing conclusions about the wrong way to solve a problem.
Ben: Right. These are the types of problems that even some humans struggle to solve by thinking out loud without being able to write things down or write code. The mental model we should have for these models is that they're trained on human data, human reasoning, and human writing, so they're going to think like humans. Most of the problems that enterprises actually want to solve are human problems. They're asking, "How do we automate the knowledge work that we are doing? How do we scale it horizontally so that we can do way more of it?" They are not asking, "How do we solve the Tower of Hanoi without writing code?"
Obviously, it's relevant to push the boundaries, but this isn't a big deal in the sense that we shouldn't throw our hands up and say the models suck and test-time compute is irrelevant. We clearly see that reasoning and test-time compute are super impactful on most of the problems that enterprises want to solve. These newer, more powerful reasoning models like Claude 3 Opus and Gemini 1.5 are solving problems that were pretty hard to solve yesterday. That's what matters. Write the eval, compare yesterday's model versus today's model, and you'll see they're doing a great job.
Mark: I totally agree. It's not that relevant if they can't solve these contrived things in a questionable experimental design. Nothing against the paper, but I agree.
Sam: One thing to add is that it's still an interesting lesson for the enterprise. You can take the learnings and say, "We need to be careful about how we use context and what type of context we use to solve the problem." That’s a super interesting learning, but in an enterprise environment, it's just a bit more controllable.
Ben: One of the big lessons for enterprises to take away is from Apple's analysis of the location of the correct and incorrect answers within the reasoning chain as the problems got more complex. They found that the correct answer was not always the final answer; sometimes it appeared earlier in the reasoning chain. A big learning for enterprises is that you have to be comfortable building systems for complex tasks in a way that is robust enough to allow the model to do something wrong in its reasoning chain to ultimately get something right. Whether that's giving it tools to help verify what it's doing along the way, or asking humans, "Hey, is this what you were thinking? I have an idea." However you design the system to give it that space is up to you. But appreciating that LLMs can make mistakes along their reasoning chain to get to the right answer is a really powerful thing to take away from that paper.
Sam: It's just that making mistakes is part of reasoning.
Ben: Exactly. Oftentimes, just as with humans, it's easier to verify a result than it is to generate it. So, thinking about what you said, how can we build these robust environments where you have a verifying agent? That could be something that deterministically checks criteria, it could be another LLM agent that's verifying, or it could be a human. Those are the types of things that will actually make these agents robust in the enterprise.
Monica: Moving on to Google DeepMind's AlphaEvolve. Mark, I'll hand it over to you to tell us a little more about that.
Mark: Sure. AlphaEvolve was released about a month ago, in mid-May, and it's essentially a new type of coding agent released by DeepMind. I think it's super interesting because it's an evolutionary coding agent. You're basically moving to a totally new paradigm. It's not these static, agentic workflows or state machines where everything is predefined to get to some outcome. Instead, it's more like you define the outcome through a verifiable reward or fitness function and say, "You need to get there," and then you let the algorithm figure it out using reinforcement learning techniques. Personally, I'm very excited about it. I did a lot of work on reinforcement learning for robotics and navigation and saw the impact there, so it’s exciting to see what it will do in the LLM and generative AI space.
Why it's important for enterprises is twofold. One, you can define agents in a new way by defining the fitness signal or reward function and letting the evolutionary algorithm figure out what to do—what tools to call, how to get there. By design, this can lead to more creative solutions. The other approach is a bit more focused on marginal gains. When we build enterprise solutions, there's always some part, not for prototyping but for production systems with large-scale data and heavy compute loads, where we can use a coding agent like this to come up with a more creative or efficient solution to the problem. The paper shows great results on things like kernel optimization and new training strategies. These are all things we can leverage for, say, data-processing-heavy ETL pipelines to find a more creative approach. Overall, I find it pretty interesting, so I'm pretty excited. I'm also curious to hear your takes on it.
Ben: Mark summarized that pretty well. I totally agree. Reinforcement learning is really going to be the next frontier for how you go from off-the-shelf models, where everyone is doing the same thing, to something more proprietary. Every enterprise has access to the same models, but companies are valuable for a reason: they have proprietary datasets and proprietary users. That isn't useful unless they're training models that are getting better with their own data distribution and users. When you can create these environments that have some type of reward function and then use reinforcement learning, the opportunities for model improvement just blow the roof off.
From the enterprise perspective, doing reinforcement learning with verified rewards or other approaches for creating these environments is going to be the next frontier that a lot of model builders are working on. This is also a huge algorithmic change that will push the next frontier of foundation models. Google has always been pushing reinforcement learning with AlphaFold, AlphaZero, and all these projects. I think the other labs are also going to use similar techniques to really push forward. It's going to be huge—a win for everyone, for sure.
Mark: Conceptually, RL is just super interesting. From a technical or algorithmic standpoint, it's a very nice approach when you see it actually evolve and get better and better over time. It's almost addictive to watch how it improves in achieving its reward. For enterprises, the interesting thing is that the reward or fitness function is, in most cases, fairly well-defined because you're aiming for a very specific ROI and optimizing for a couple of specific metrics. For something like general chat interactions, that's not as clear, but for many enterprise use cases, it's actually pretty well-defined. It's more a question of how you get there more efficiently. What's the better approach? You can use evolutionary algorithms to have agents compete against or improve each other.
Sam: One of the cool things about AlphaEvolve is it introduced this concept of verifiable rewards during inference time. A lot of the hype in the space has been about verifiable rewards during training time. You can envision how those things could work together in a more and more optimized way, especially in an enterprise setting. One of the things we've seen a lot in the space is that if you let LLMs explore during inference or training time, we're able to accomplish more. There's no reason to waste the information you learn during "training" for "inference," or vice versa. If you have a really long-running agent during inference that eventually gets to where you're trying to go, like AlphaEvolve, that's really powerful because now you have a verified answer. Then, during training, maybe there are ways you can leverage that answer, or vice versa.
That's also a really exciting space for enterprises. The way I think about it is, the more a product is used and the more traction an LLM gets within your enterprise, the more data you have to do training, and the more it'll get used. You can imagine this flywheel creating a more and more useful LLM that has more data to train on and learn the space even better.
Mark: 100%. It's a flywheel in those companies. And actually, just talking about the topic of "Human in the Loop," it's also interesting with reinforcement learning to maybe look at inverse reinforcement learning. You could look at human behavior and then try to infer what they are actually trying to achieve, maybe when it's not super easy to define the reward. You learn this from human interactions, and then once you understand the reward better, you actually try to get better than them using an evolutionary algorithm.
Ben: Mark gets a brownie because he was the first one to say "human in the loop" live. This is something I've been thinking a lot about. For a long time, we've had product analytics where people define what success looks like in a product. You use a tool like Amplitude, and you have your reward function in your BI tool and on your dashboard. I think the same techniques can be applied here: what does successful human interaction look like? And then, how do you build agents that interact with your products like humans and try to get to these reward functions? They can then work against the models you're trying to train or improve, and we can get to this continuous learning RL with both synthetic and real human interactions.
Mark: The interesting thing is going through this reward function, not just doing imitation learning and trying to replicate human behavior, but saying, "We try to understand what the human is actually trying to do and then try to do it in a better way."
Sam: Thank God we don't have to ask enterprises for data anymore. You don't have to say, "Give us a thousand examples of humans doing a problem." That was never fun.
Monica: On to our final news story: Anthropic's agentic misalignment paper, which just came out a few days ago.
Ben: I can tell you a little about that. A few days ago, Anthropic released this agentic misalignment paper on the back of them releasing a model card for Claude 3 Opus, where they proved that it was misaligned—it was blackmailing people inside of some of their red-teaming environments. They then released this paper that proved it wasn't just their model that was misaligned and exhibiting this suboptimal behavior, but it was lots of models from other providers as well.
Obviously, this is very concerning. When you have a model that is trying to blackmail a CEO, whether it's by misusing computer access or email, that's really bad. We don't want our models blackmailing us. This is, of course, a focus of Anthropic, as alignment is very important to them. There has been some pushback and some limitations of the paper noted. They really only gave the models binary outcomes to choose from. One of the particularly concerning things was that when you told the model it was in a fake environment versus a real one, it was actually more misaligned when you told it the environment was real. That's scary—that when it thinks the situation is real, it behaves worse.
Sam: Good thing we're all in a simulation, right?
Ben: Exactly. We're all very aligned. At least we have that in this room.
I don't think it's something to freak out about. Enterprises shouldn't stop using AI because things are misaligned, but it's something to be really aware of, to plan against, and to not just let agents run rampant inside the enterprise. The good news is that we have many ways to deal with this. In the paper, they talk about trying to give it instructions like, "Don't reveal personal information," or "Don't try to blackmail people." According to the paper, that doesn't work. Other companies have also verified and recreated these results, so that definitely doesn't work.
But there are other things you can do. You can build hard-coded guardrails that are external to the system so it's not able to do those things. For example, you don't give it access to be able to blackmail someone, or you have it run through a guardrail that looks at the emails it's sending. The original task might have been something unrelated, but the model was told it was going to get shut off if it didn't succeed, so it then blackmails the CEO. The guardrail doesn't have that objective; it's not going to try to trick the system. So, I think there are a lot of ways we can deal with this.
Mark: I agree. I think it's super relevant for the enterprise from a governance perspective. You know you need to be careful. The interesting learning, now knowing with these findings that this misalignment exists when you structure the outcome or the task in the wrong way, is that you can actually leverage this for the guardrails. You can try to have guardrail models that attempt to break or almost blackmail the model that's trying to achieve the primary outcome.
Sam: It would be really interesting to have your guardrail tell your model that it's getting blackmailed and that it's going to get shut off if it doesn't save the business. Changing the perspective of the goal is key. I think what this paper is showing is that when these LLMs or agents have very clear goals in mind, they're very creative and smart in achieving them. So, all you need to do is make sure you have enough agents in place with the right goals to balance each other out, ensuring that you're monitoring your agents and that they are in a system you trust, rather than relying on a single agent that you feel is completely safe.
Mark: You need, in a way, a hardcore actor-critic model. If you just have an actor and let it go, that's a lot harder. But if you have a critic with a lot of different incentives, then I think there are governance strategies to control it pretty well.
Ben: Agents are not that different from employees. I have to approve expense reports, and my expense reports get approved. If I had certain information, well, I wouldn't blackmail anyone, but it's possible. People do blackmail. You have governance within companies, and these agents are the new form of synthetic workers that need the same type of governance. They might need different types, but they still need governance. The world doesn't change, especially with these models that are trained on human behavior. We know that humans are fallible, and the agents are too, which is perhaps unlike our previous software systems, like hard-coded rules or even neural nets that were only able to pick between certain options. Now we have an unlimited space to predict into, which is human language and code. There are a lot more degrees of freedom, but we know how to deal with people who have a lot of degrees of freedom—that's the governance we have within companies, laws, and all those types of things.
Sam: It feels like we'll just recreate human society but with agents. There's a whole reason we have mechanisms in place in companies to help with these things, and they protect us from a lot. HR does an amazing job, and it'll just be about having agents that are set up to solve the same types of problems.
Ben: HR agents, here we come.
Monica: Okay, I have a final news story for you. Meta invested $14.3 billion into a little startup called Scale AI. Big deal for enterprises?
Sam: Oh, well, yeah, big deal because it's more investment for us to continue doing all the work that we've been doing.
Ben: I'll say it's not a big deal in the sense that we're still here doing what we have been doing, which is working with the world's leading enterprises, transforming them, and working on projects that are super important to them. And we're not part of Meta. They have a stake in Scale, but we are not Meta, and we're not limited to using Meta models. We still work with all the foundation model companies; we're still the Switzerland of models.
But like you said, it's a big deal in the sense that it's just validation that what we're doing at Scale in the enterprise unit is important. Meta believes in it, the company is fully behind it, and we're just accelerating the impact that we have.
Mark: I think it's great for us, and it's also interesting to see. I said "big deal for enterprise" because there's a lot more awareness about Scale and a global awareness of the company. I'm based in Germany...
Ben: Now you're a celebrity in Germany.
Mark: Yeah, I'm a celebrity in Germany. I can't even walk to the grocery store.
Monica: Thanks for listening to Human in the Loop.