Knowledge Graph QASP Generation
Master's Thesis — Automated Question-Answer-SPARQL triplet generation from Knowledge Graphs using open-source LLMs
Leuphana University · 2025
Overview
This thesis investigates whether open-source Large Language Models can generate coherent Question-Answer-SPARQL (QASP) triplets directly from Knowledge Graph triples in a single prompt — without fine-tuning or external supervision.
The task combines semantic parsing and natural language generation simultaneously: the model must infer entities, relations and constraints from structured KG triples, and produce a natural language question, a corresponding answer, and a syntactically valid SPARQL query that retrieves that answer from the DBLP knowledge graph. The research explores how design choices — model size, triple representation format, structural metadata, and prompting strategy — affect generation accuracy.
Key Result
87.38% accuracy
Best configuration: Qwen2.5-Coder-32B · hop-info+-linearised-str format · zero-shot · no structural metadata · 10,704 correct QASP triplets generated from 12,250 total
Pipeline
The automated pipeline takes raw KG triples from the DBLP Knowledge Graph (accessed via a Virtuoso SPARQL endpoint) and transforms them through several stages before prompting the LLM.
1 — KG Data Extraction
Raw 1-hop and 2-hop triple chains are extracted from the DBLP KG via SPARQL queries targeting the Publication entity class. Structural metadata (node types, predicates, relation schema) is extracted separately for use in higher-level SMI experiments.
2 — Triple Preprocessing
Triples are cleaned and URI-prefixed to reduce token length without losing semantic information. Semantically irrelevant triples (e.g. hasIdentifier) are filtered out. 6 triple format types (TFT) are generated: array, linearised-array, linearised-str, hop-repeat, hop-info+, and special-tokens.
3 — Prompt Construction
Prompts are assembled from the formatted triple chain, optional structural metadata (SMI levels −1, 0, 1), a primitive question task composition (single-fact), and a 0-shot or 1-shot example. The LLM is run in chat mode with temperature 0 and a configurable repeat parameter to generate multiple QASPs per prompt.
4 — Extraction & Validation
Question, answer and SPARQL are extracted from LLM output via regex matching **Question:**, **Answer:**, **SPARQL:**. Each SPARQL is executed against the DBLP KG API — accuracy is measured by whether the queried answer exactly matches the LLM-generated answer.
Experiments
| Experiment | Finding |
|---|---|
| Triple Format Type (TFT) | Linearised string formats outperform array formats across most models. hop-info+-linearised-str is the most consistent top performer. |
| Structural Metadata (SMI) | Counterintuitively, no metadata (SMI = −1) gives the highest accuracy. Adding node class and relation schema context confuses models more than it helps. |
| N-shot prompting | Zero-shot outperforms one-shot for larger models (Gemma, Mistral). Smaller models benefit from examples. One example can introduce bias towards that model's style. |
| Model comparison | Qwen2.5-Coder-32B achieves 87.38% avg. in SMI experiment. Gemma 3 27B and Qwen3 32B are close. Mistral 123B is not proportionally better than 32B models. |
Models Evaluated
| Model | Parameters | Best Avg. Accuracy |
|---|---|---|
| Qwen2.5-Coder-32B-Instruct | 32B | 87.38% |
| Gemma 3 27B Instruct | 27B | 72.2% |
| Qwen3 32B | 32B | 71.4% |
| Mistral Large Instruct 123B | 123B | 63.3% |
| Codestral 22B | 22B | 32.2% |
| Llama 3.1 8B Instruct | 8B | 14.4% |
Stack
| Layer | Technology |
|---|---|
| Language | Python 3 · Poetry |
| LLM API | GWDG HPC interface (open-source models via API) |
| Knowledge Graph | DBLP KG · Virtuoso SPARQL endpoint |
| Data | DBLP-QASP dataset — 10,704 validated QASP triplets |