Whose Name Comes Up? II: Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation (arxiv.org)
arXiv:2602.08873v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) are now used for academic expert recommendation. Existing audits typically evaluate such recommendations in isolation, ignoring end-user inference-time interventions. Thus, it remains unclear whether failures (e.g., refusals, hallucinations, uneven coverage) stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that each intervention entails distinct tradeoffs. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing uniform gains. LLMScholarBench makes all these dynamics auditable across models and interventions in LLM-based scholar recommendations.
Abstract: Large language models (LLMs) are now used for academic expert recommendation. Existing audits typically evaluate such recommendations in isolation, ignoring end-user inference-time interventions. Thus, it remains unclear whether failures (e.g., refusals, hallucinations, uneven coverage) stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that each intervention entails distinct tradeoffs. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing uniform gains. LLMScholarBench makes all these dynamics auditable across models and interventions in LLM-based scholar recommendations.
Comments