Benchmarking LLMs Through Knowledge Graph Analysis

May 2025 Research 5 min read

LLM, AI, Agents -- all the buzzwords we hear today fill our social media feeds, and even the news talks about incredible promises: advancing sciences, bringing forward a golden age. While all of this is nice and great to hear, those who are actually working on deep learning know the truth. We have made a lot of progress with data-driven approaches, and these approaches have been validated by great open-source benchmarks that helped develop better, more efficient algorithms (like MNIST, CIFAR, ImageNet).

Initially, these datasets were a one-stop shop to both train and evaluate a model. But then, the emergence of larger, more complex neural networks -- coined "foundation models" -- showed a transition. First, it shifted to fine-tuning on smaller datasets to show better performance. Then, with the GPT series models, we went to few-shot and eventually zero-shot evaluation (no gradient updates needed). This was a very important paradigm shift. We now have a separate way to evaluate models using data that, in theory, should be completely different from the training sets (generalization then became a discussed topic).

Now, fast forward to today (mid-2025), and we have multiple benchmarks, with newer ones popping up constantly, that evaluate these models. These benchmarks have been seeing saturation at an insanely rapid pace, even the seemingly tough LLM and adversarial ones (like ARC, AGIEval).

While a lot of benchmarks can be really useful to evaluate certain things, they generally fall into three main categories:

LLM-as-a-judge based: Things like Chatbot Arena and others use an LLM as the judge. This is great for scale, but will always be limited by the bias and performance limits of the judge LLM itself.
Human-as-a-judge based: These are the most popular and shared benchmarks right now, thanks to the LMSys Arena. It has been pretty consistent in its ranking and how this reflects real-world performance. Of course, these come with caveats: they can be gamed towards human preference in writing style (which LMSys needed to fix), and they have high computational demands and human input costs.
Static dataset based benchmarks: This is the traditional way. Evaluation is done on a fixed dataset, but this is something that has been saturating at neck-breaking speed. However, these are crucial for providing domain-specific evaluation.

Because of these limitations, I've been thinking more and more recently about a method that would let us have a peek into a model's performance on a certain domain without needing a static dataset, without an LLM as a judge, and without a human in the loop.

The promise is very simple: We ask the LLM to generate a comprehensive knowledge graph on a certain subject. We can then study certain elementary properties of the graph. Particularly, we look into the connectedness of the graph through different metrics.

The underlying hypothesis is: if the model has great comprehensive ability regarding the subject, it will be able to actually connect the different underlying concepts to each other. To measure this, we'll look at the average node degree per concept, which is a really nice, verbosity-invariant metric.

A byproduct of this is we get some really nice graphs to look at (see here: Knowledge Graph Visualizations).

Note that this method isn't perfect at all, but it could be a good proxy for model 'comprehension'. I've evaluated this method on the following models:

Openai O4 Mini High
Openai 4.1
Openai 4.1 nano
Gemini 2.5 Pro Exp
Deepseek v3
Claude Sonnet 3.7 (Non thinking)

And we evaluate it on these different subjects (which are a very broad proxy on general knowledge):

Engineering and applied sciences
Life Sciences
Mathematics and Computer Science
Physical Sciences
Social and behavioral sciences

The prompt for each model is the following:

Generate the most comprehensive knowledge graph possible for the concept "{concept}". Include as many relevant nodes (representing concepts, entities, attributes, etc.) and directed edges (representing relationships) as possible to capture the structure of knowledge about this concept.

We also provide each model with Json schema for the output format (which some models do better than others). The outputted json file is then sanitized and standardized before getting evaluated and eventually transformed into images.

Before going into the results, i want to make it clear that we don't check for the ground truth of what the model is outputting and that's a limitation that should be taken into account, i truly believe however that this is not something that would hurt the benchmark unless the models are properly trained to game it.

And these are the results: Image of the final models ranking.

The results correlate strongly with LMSys Arena and other metrics.

Image of lmsys arena

Evaluation Metrics

The evaluation system ranks models based on several graph metrics:

Metric	Description	Weight in Ranking
Average Node Degree	Average number of connections per node	2.0 (Primary)
Node Count	Total number of concepts/entities	1.0
Edge Count	Total number of relationships	1.0
Graph Density	Ratio of actual connections to possible connections	0.8
Connected Components	Number of disconnected subgraphs (lower is better)	1.0
Largest Component Ratio	Size of largest component relative to total graph	0.8

Ranking Methodology

The overall ranking is calculated using a weighted average of normalized metrics. The evaluation gives double weight to average node degree, as this metric best captures the knowledge graph's interconnectedness and usefulness.

For each subject and model combination:

Each metric is normalized to a [0,1] scale.
Metrics are weighted according to importance.
A composite score is calculated.
Models are ranked by their average scores across all subjects.

A really important thing to notice is that in this case we can test these models on a particular subject (a programming language, a type of cuisine, etc..) In my case i've evaluated it on Pure mathematics and the results were interesting.

Now, let's look at some of the most beautiful graphs that our study generated...

Images of different graphs generated by the llms

All of this work is opensource and i hope that you will enjoy testing it on your favorite models.

The GitHub link for this project: NessimBenA/LLMEvaluationWithGraphs.

← Back to research