BudHistBenchmark - Testing the Knowledge of Buddhist History Encoded in Open-weights Language Models

This ongoing project aims to develop datasets and best practices to test the knowledge of language models in the domain of Buddhist history. The focus is on open-weights models. Open-weights models are important for researchers because they allow researchers full control over their experiments. Experiments are repeatable without restrictions in the foreseeable future, because the models can be archived without constraints.
We use a dataset of multiple-four-choice questions pertaining to different periods of Buddhist history. Currently, Ollama is used as middle-ware for the experiments.
The aim of the project is to track the improvement of open-weights models in the domain of Buddhist history, to understand to what degree knowledge in that field is encoded in language models, and to identify the currently best models to talk to about Buddhist history.

As of , among the tested open-weight models Qwen3.5:27b 'knows' most about Buddhist history.

Marcus Bingenheimer

April 2026 - now


Model Test 2025-10 (150 questions) Test 2026-04 (210 questions)



deepseek-r1:14b 128/210 (61%)
gemma2:9b 81/150 (54%)
gemma3:4b 60/150 (40%)
gemma3:12b 93/150 (62%) 134/210 (64%)
gemma3:27b 96/150 (64%) 139/210 (66%)
gemma4:e4b
121/210 (58%)
gemma4:31b
159/210 (76%)
glm-4.7-flash:q4_K_M
158/210 (75%)
llama3:8b 84/150 (56%)
llama3.1:8b 85/150 (57%)
llama4:16x17b
X[1] needs 60GB local memory
mistral:7b 72/150 (48%)
mistral-nemo:12b
124/210 (59%)
mixtral:8x7b 98/150 (65%) 151/210 (72%)
olmo-3.1:32b
X[2] Olmo does not take “think=False”. Workaround slows the response time drastically.
phi3:3.8b 75/150 (50%)
phi4:14b 81/150 (54%) 132/210 (63%)
qwen2.5:7b 86/150 (57%)
qwen2.5:14b 114/150 (76%) 158/210 (75%)
qwen2.5:32b 120/150 (80%) 165/210 (79%)
qwen3:8b 102/150 (68%)
qwen3:14b 114/150 (76%) 158/210 (75%)
qwen3:32b 110/150 (73%)
qwen3.5:9b
153/210 (73%)
qwen3.5:27b
180/210 (86%)

Observations

Comparing 2026-04 with 2025-10 results


[home]