๐งฉ About the Project
This is the official repository for the paper:
Meta Data Retrieval for Data Infrastructure via RAG
Presented at IEEE ICWS 2024
We propose DOR-RAF (Digital Object Retrieval via RAG-Agent Fusion), a new framework that integrates Retrieval-Augmented Generation (RAG) and Agent mechanisms to tackle metadata inefficiency, fuzzy requirements, and costly retrieval in modern data infrastructures.
๐ Highlights
โจ Why DOR-RAF?
- ๐ก Supports fuzzy & complex queries with multi-round dialogue
- ๐ Combines keyword + vector search for robust metadata retrieval
- ๐ค Agent decomposes and rewrites queries for intelligent feedback
- ๐ Uses a custom-built dataset simulating realistic digital infrastructure
๐ Key Results
| Task | DOR-RAF vs. Baselines |
|---|---|
| ๐ฏ Digital Object Retrieval | +18.6% โ F1-score vs. keyword search |
| ๐ Data Characterization | +14.5 โ Answer Correctness vs. Original RAG |
| ๐ง Faithfulness, Precision, Similarity | Substantially improved! |
๐ ๏ธ What's Inside (Coming Soon)
We're actively preparing the open-source codebase including:
๐ Stay tuned! The code and documentation will be released shortly. Follow the repo and โญ star it to get updates.
๐งช Abstract
Data infrastructures face challenges in metadata retrieval due to inefficiency, ambiguity, and manual costs. DOR-RAF tackles these by integrating Large Language Models (LLMs) and RAG tools via an intelligent Agent. It handles vague queries, decomposes complex tasks, and achieves interactive multi-turn retrieval. Experiments on a custom dataset show clear improvements over traditional keyword-based and vanilla RAG methods in terms of F1, precision, and semantic alignment.
๐ Citation
If our work inspires yours, please cite us:
@inproceedings{shi2024meta,
title={Meta data retrieval for data infrastructure via RAG},
author={Shi, Zhuo-Fan and Liu, Kun and Bai, Shan and Jiang, Yun-Tao and Huo, Tong and Jing, Xiang and Li, Rui-Zhi and Ma, Xin-Jian},
booktitle={2024 IEEE International Conference on Web Services (ICWS)},
pages={100--107},
year={2024},
organization={IEEE}
}
๐ฌ Contact
๐ฌ Corresponding Author: Xin-jian Ma
๐งโ๐ป Lead Maintainer: Zhuo-fan Shi
๐ Affiliations: National Key Lab of Data Space Technology & Advanced Institute of Big Data, Beijing