Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

1Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo 2Meituan Inc. 3Salesforce Research 4School of Software Engineering, Huazhong University of Science and Technology 5Engineering Research Center of Chiplet Design and Manufacturing of Zhejiang Province
*Indicates Equal Contribution
†Indicates Corresponding Author

Learning from model weights


InfoSearch consists of six dimensions, each representing a document-level feature with values drawn from predefined conditions. Queries are paired with one dimension and evaluated in three retrieval modes based on the given instructions.

  • Original Mode: This mode serves as a baseline that evaluates the model’s basic retrieval ability to find pertinent information without any specific constraints.
  • Instructed Mode: In this mode, the model is required to find documents that are content relevant and satisfy the condition specified in the instruction.
  • Reversely Instructed Mode: In this mode, the model is required to find documents that are content relevant and do not satisfy the condition specified in the instruction, which tests the model’s ability to understand negation.

Abstract

Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.

πŸ† InfoSearch Leaderboard πŸ†

Model Type WISE SICR p-MRR nDCG@10_ori nDCG@10_ins nDCG@10_rev
πŸ“€ Contact Us πŸ“€ You can send the results to us via email or GitHub issues. Here are the example JSON format for the result:
{
    "model_name": "bge-Large-v1.5",
    "type": "Dense Retrieval",
    "ndcg_ori": 53.2,
    "ndcg_ins": 34.9,
    "ndcg_rev": 34.9,
    "p_mrr": 21.3,
    "wise": -29.5,
    "sicr": 1.0,
    "model_url": "https://huggingface.co/BAAI/bge-large-en-v1.5",
    "Audience": {
      "ndcg_ori": 48.6,
      "ndcg_ins": 38.1,
      "ndcg_rev": 37.6,
      "mrr_ori": 22.9,
      "mrr_ins": 12.9,
      "mrr_rev": 11.9,
      "wise": -16.8,
      "sicr": 0.5
    },
    ...
}
We will update the leaderboard and the paper with your results. If you have any questions, please feel free to contact us.

πŸ” Analysis of Different Models πŸ”


Retrieval Model Comparison - WISE Radar Chart
Reranking Model Comparison - WISE Radar Chart

BibTeX

@misc{zhou2024contentrelevanceevaluatinginstruction,
      title={Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models},
      author={Jianqun Zhou and Yuanlei Zheng and Wei Chen and Qianqian Zheng and Zeyuan Shang and Wei Zhang and Rui Meng and Xiaoyu Shen},
      year={2024},
      eprint={2410.23841},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2410.23841},
}