Highlights

Large Reasoning Models
Planbench; An Eval dataset for the model’s planning and reasoning capabilities
Performance comparison on Blockworlds vs Mystery Blockworlds b/w o1 Vs LLMs
Efficiency comparison (Time & Money)
Accuracy with small problem Vs larger problems
Performance on unsolvable instances
It performs worse on 1-shot prompt for everything.
“Hallucination? That’s for LLMs, I Gaslight” - o1
How o1 works
LLMs vs o1

Key Takeaways:

1. LRMs are the new LLMs

This paper coined a new term Large Reasoning Models (LRMs) after the release of ‘o1’ in contrast to existing Large Language Models (LLMs) because of the model’s ability to plan and reason ahead. LLMs in the recent past have been outperforming each other in various other benchmarks but one benchmark was further away from reaching a saturation point.

2. Planbench will be the new goal benchmark for SOTA models

Planbench, an Eval dataset created in 2022 that comprises small and big Blocksworld problems (in a given scenario and no. of blocks the model has to arrange the blocks in a certain manner, one at a time). This dataset is perfect for testing a model’s reasoning and planning capabilities.

3. Performance comparison: o1 vs LLMs

Until o1, the best-performing model on this dataset was Llama-3.1-405B which achieved an accuracy of 62.6% in Blocksworld problems but only an accuracy of 0.8% in Mystery Blocksworld problems as compared to o1 that achieves 97.8%, and 52.8% for Blocksworld and Mystery Blocksworld problems simultaneously. (Table 1 vs Table 2)