This paper coined a new term Large Reasoning Models (LRMs) after the release of ‘o1’ in contrast to existing Large Language Models (LLMs) because of the model’s ability to plan and reason ahead. LLMs in the recent past have been outperforming each other in various other benchmarks but one benchmark was further away from reaching a saturation point.
Planbench, an Eval dataset created in 2022 that comprises small and big Blocksworld problems (in a given scenario and no. of blocks the model has to arrange the blocks in a certain manner, one at a time). This dataset is perfect for testing a model’s reasoning and planning capabilities.

Until o1, the best-performing model on this dataset was Llama-3.1-405B which achieved an accuracy of 62.6% in Blocksworld problems but only an accuracy of 0.8% in Mystery Blocksworld problems as compared to o1 that achieves 97.8%, and 52.8% for Blocksworld and Mystery Blocksworld problems simultaneously. (Table 1 vs Table 2)