Scale AI logo
SEAL Logo

MultiNRC

Data Introduction

Enhancing the reasoning capabilities of Large Language Models (LLMs) is a critical area of ongoing research and development. Consequently, thorough and diverse reasoning evaluation benchmarks in English have been developed to assess LLMs' improved reasoning capabilities. While a large number of these benchmarks have been developed in English, the landscape of multilingual reasoning benchmarks is less developed, comprehensive, and equitable. Existing multilingual reasoning benchmarks are largely created by translating existing English reasoning benchmarks (automatically, manually, or hybrid) into target languages. Benchmarks created by directly translating English-centric content may reflect the cultural framing and linguistic characteristics of English, which are not globally representative. As a result, reasoning tasks that require native linguistic or cultural understanding are often missing, leaving gaps in the evaluation of non-English language reasoning abilities. Therefore, we are still largely unaware of how state-of-the-art (SOTA) LLMs perform on genuinely native multilingual reasoning tasks that demand both linguistic diversity and cultural or contextual depth.

Motivated by this gap, we developed the Multilingual Native Reasoning Challenge (MultiNRC), a new multilingual evaluation benchmark that contains native and challenging reasoning questions of the targeted language, to assess LLMs' reasoning capability under a natural and native context. Specifically, MultiNRC contains four categories of reasoning questions: language-specific linguistic reasoning, wordplay riddles, cultural/tradition reasoning, and finally mathematical reasoning with cultural relevance. We first release MultiNRC in French, Spanish, and Chinese. We present the definitions of the 4 reasoning categories as below.

Language-specific Linguistic Reasoning questions are based on grammatical rules, honorifics, or language-specific conventions that only exist in the target language but not English. Such questions require reasoning about language structure or usage. Examples under this category include word formation problems, relation inference based on language conventions, etc. We enforce that the questions should require multi-step reasoning, instead of mere linguistic feature identification without reasoning. In the French example shown below, the question requires recognizing that the noun délice is masculine in the singular and feminine in the plural, an uncommon grammatical characteristic not present in English.

Si je suis un mot masculin seul, féminin au pluriel, je ne suis jamais bouclé. Qui suis-je?

Wordplay & Riddles contain puzzles that rely on the clever and ambiguous use of the target language, often involving multiple meanings, homophones, or puns. Since such reasoning questions usually use very language-specific homophones or puns, there are no English translations that can provide the exact meaning of such questions. Our evaluations later show that this category is one of the most difficult for LLMs. In the Chinese example shown below, the question requires the respondent to use a Chengyu (Ancient Chinese idiom) involving a pun or homophone as an answer to a playful riddle, thus demanding clever reasoning.

你可以用谐音梗,用一个成语回答我:为什么井越浅越 好?因为

Cultural/Tradition Reasoning questions require reasoning through timelines, behaviors, or customs derived from local traditions, holidays, or ceremonies. Similarly to other categories, we enforce the assessment of LLM reasoning capabilities and exclude purely factual questions. In the Spanish example, the question requires identifying the date of a local holiday (Día de la Candelaria) and reasoning about travel dates to ensure two full weekends are spent at the destination, demonstrating an understanding of local traditions and applying multi-step temporal reasoning.

Me voy de viaje a Cancún el próximo año en 2026. Quiero estar allá para el Día de la Candelaria, pero quiero tener dos fines de semana completos para disfrutar la playa. ¿Cuáles tendrían que ser las fechas de mis vuelos?

Math Reasoning with Cultural Relevance questions involve calculations based on cultural-specific counting systems, calendars, currencies, or numerical phrasing. As above, it must require at least 1 reasoning step based on a culturally specific element, and we do not allow pure arithmetic or logic without cultural grounding. In the French example in the table, the question requires the application of culturally specific knowledge about the ’viager’ real estate system and historical events (such as the date of the Cannes Film Festival) to determine the financial break-even point for a property purchase, thereby testing mathematical reasoning within a local context.

J’ai trouvé une maison à acheter en viager en 2025 : le bouquet est de 90 000€ et la mensualité de 500€. L’occupante est née l’année de la sixième édition du Festival de Cannes, la maison est estimée à 150 000€, et va prendre 1% de valeur par an. A partir de quel âge de l’occupante cela devient une moins bonne affaire pour moi?

To build MultiNRC, we recruited native speakers of each language to create challenging reasoning questions and ground-truth final answers according to the definitions of the four categories above. We only accept reasoning questions that caused at least 3 out of 5 SOTA LLMs to fail, and we facilitated MultiNRC with automatic evaluation with LLM-as-a-judge, for fast and accurate model assessments. Automatic evaluation is made possible by only including reasoning questions with an objective and short ground-truth final answer; we find that our automatic evaluation has more than 95% alignment with human judgment on MultiNRC.

The full paper can be found here: https://scale.com/research/multinrc

The open source data can be found here: https://huggingface.co/datasets/ScaleAI/MultiNRC

Loading content...
Last updated: July 23, 2025

Performance Comparison

1

o3-pro-2025-06-10-high

49.00±3.02

1

o3-2025-04-16-high

45.50±3.00

1

Gemini-2.5-Pro-Preview-06-05

45.12±3.00

1

o3-2025-04-16-medium

44.45±3.00

5

Claude-4-Opus-20250514-thinking

33.93±2.86

5

Claude-4-Opus-20250514

29.00±2.74

6

Claude-3.7-Sonnet-thinking

27.77±2.70

6

Deepseek-R1-0528

27.58±2.70

6

Deepseek-R1

24.27±2.59

9

o4-mini-high

22.18±2.51

9

GPT-4.1

21.23±2.47

10

kimi-k2-instruct

NEW

18.48±2.34

10

Claude-4-Sonnet-20250514

18.39±2.34

10

Qwen3-235B-A22B

NEW

17.63±2.30

15

GPT-4o

12.42±1.99

16

Llama-4-Maverick

8.44±1.68