๐Ÿงฉ About the Project

This is the official repository for the paper:

Meta Data Retrieval for Data Infrastructure via RAG
Presented at IEEE ICWS 2024

We propose DOR-RAF (Digital Object Retrieval via RAG-Agent Fusion), a new framework that integrates Retrieval-Augmented Generation (RAG) and Agent mechanisms to tackle metadata inefficiency, fuzzy requirements, and costly retrieval in modern data infrastructures.

๐Ÿ“Œ Highlights

โœจ Why DOR-RAF?

  • ๐Ÿ’ก Supports fuzzy & complex queries with multi-round dialogue
  • ๐Ÿ” Combines keyword + vector search for robust metadata retrieval
  • ๐Ÿค– Agent decomposes and rewrites queries for intelligent feedback
  • ๐Ÿ“š Uses a custom-built dataset simulating realistic digital infrastructure

๐Ÿ“Š Key Results

Task DOR-RAF vs. Baselines
๐ŸŽฏ Digital Object Retrieval +18.6% โ†‘ F1-score vs. keyword search
๐Ÿ“ Data Characterization +14.5 โ†‘ Answer Correctness vs. Original RAG
๐Ÿง  Faithfulness, Precision, Similarity Substantially improved!

๐Ÿ› ๏ธ What's Inside (Coming Soon)

We're actively preparing the open-source codebase including:

๐Ÿ“ฆ DOR-RAF Framework (Agent + DO-RAG)
๐Ÿ”ง LangGraph-based Self-RAG implementation
๐Ÿงช RAGAS-based evaluation pipeline
๐Ÿงพ Custom digital object dataset generator
๐Ÿง  Scripts for embedding models, LLM interface, Elasticsearch, etc.

๐Ÿ•’ Stay tuned! The code and documentation will be released shortly. Follow the repo and โญ star it to get updates.

๐Ÿงช Abstract

Data infrastructures face challenges in metadata retrieval due to inefficiency, ambiguity, and manual costs. DOR-RAF tackles these by integrating Large Language Models (LLMs) and RAG tools via an intelligent Agent. It handles vague queries, decomposes complex tasks, and achieves interactive multi-turn retrieval. Experiments on a custom dataset show clear improvements over traditional keyword-based and vanilla RAG methods in terms of F1, precision, and semantic alignment.

๐Ÿ“– Citation

If our work inspires yours, please cite us:

@inproceedings{shi2024meta,
  title={Meta data retrieval for data infrastructure via RAG},
  author={Shi, Zhuo-Fan and Liu, Kun and Bai, Shan and Jiang, Yun-Tao and Huo, Tong and Jing, Xiang and Li, Rui-Zhi and Ma, Xin-Jian},
  booktitle={2024 IEEE International Conference on Web Services (ICWS)},
  pages={100--107},
  year={2024},
  organization={IEEE}
}

๐Ÿ“ฌ Contact

๐Ÿ’ฌ Corresponding Author: Xin-jian Ma

๐Ÿง‘โ€๐Ÿ’ป Lead Maintainer: Zhuo-fan Shi

๐Ÿ“ Affiliations: National Key Lab of Data Space Technology & Advanced Institute of Big Data, Beijing