BudHistBenchmark

This ongoing project aims to develop datasets and best practices to test the knowledge of language models in the domain of Buddhist history. The focus is on open-weights models. Open-weights models are important for researchers because they allow researchers full control over their experiments. Experiments are repeatable without restrictions in the foreseeable future, because the models can be archived without constraints.
We use a dataset of multiple-four-choice questions pertaining to different periods of Buddhist history. Currently, Ollama is used as middle-ware for the experiments.
The aim of the project is to track the improvement of open-weights models in the domain of Buddhist history, to understand to what degree knowledge in that field is encoded in language models, and to identify the currently best models to talk to about Buddhist history.

As of , among the tested open-weight models Qwen3.5:27b 'knows' most about Buddhist history.

Marcus Bingenheimer

April 2026 - now

Model	Test 2025-10 (150 questions)	Test 2026-04 (210 questions)

deepseek-r1:14b		128/210 (61%)
gemma2:9b	81/150 (54%)
gemma3:4b	60/150 (40%)
gemma3:12b	93/150 (62%)	134/210 (64%)
gemma3:27b	96/150 (64%)	139/210 (66%)
gemma4:e4b		121/210 (58%)
gemma4:31b		159/210 (76%)
glm-4.7-flash:q4_K_M		158/210 (75%)
llama3:8b	84/150 (56%)
llama3.1:8b	85/150 (57%)
llama4:16x17b		X^[1]
mistral:7b	72/150 (48%)
mistral-nemo:12b		124/210 (59%)
mixtral:8x7b	98/150 (65%)	151/210 (72%)
olmo-3.1:32b		X^[2]
phi3:3.8b	75/150 (50%)
phi4:14b	81/150 (54%)	132/210 (63%)
qwen2.5:7b	86/150 (57%)
qwen2.5:14b	114/150 (76%)	158/210 (75%)
qwen2.5:32b	120/150 (80%)	165/210 (79%)
qwen3:8b	102/150 (68%)
qwen3:14b	114/150 (76%)	158/210 (75%)
qwen3:32b	110/150 (73%)
qwen3.5:9b		153/210 (73%)
qwen3.5:27b		180/210 (86%)

Observations

Comparing 2026-04 with 2025-10 results

Test 2026-04 added 60 questions about Japanese Buddhist history (from Asuka to Kamakura) to the 150 questions on Chinese Buddhism of the 2025-10 test.
Models that were tested in both rounds generally had similar results (with the possible exception of phi4 which improved 9%). This indicates that the models overall know as much about Chinese as about Japanese Buddhist history.
Within a model series, larger models again clearly performed better than smaller models. Larger models of an earlier generation can be better than smaller models of a later generation. There are clear differences in strength between different series. This means the current testing set-up picks up a signal.
However, with 86% Qwen3.5:27b is already very good and poised to saturate the benchmark. In both rounds so far, the Qwen series has proved to be the most knowledgeable model series. Gemma4:31b and glm-4.7-flash are coming close.

BudHistBenchmark - Testing the Knowledge of Buddhist History Encoded in Open-weights Language Models

Observations

Comparing 2026-04 with 2025-10 results