RAGtag: A Retrieval-Augmented Generation-Based Topic Modeling Framework

National Key Laboratory of Data Space Technology and System, Beijing
Advanced Institute of Big Data, Beijing

dmbd' 2024
*Indicates Equal Contribution

Abstract

Topic modeling serves as a pivotal field in natural language processing. It involves unsupervised learning on a large text dataset to produce a set of words bags or phrases that best characterize a set of documents. While traditional approaches such as LDA lack interpretability and control for users, the usage of LLMs to produce topics has efficiently solved this issue. Previous works such as TopicGPT used a prompt based framework to find latent topics in a corpus. However, despite its innovation in leveraging Large Language Models (LLMs) for topic generation, TopicGPT faces limitations due to LLM context length constraints and reliance on predefined topic directions. To overcome these challenges, we propose RAGtag, a Retrieval-Augmented Generation-based method designed to dynamically generate and assign topics to documents. Unlike TopicGPT, RAGtag does not require prior knowledge of document themes and is not limited in length, making it suitable for unpredictable industrial use cases. Our evaluation demonstrates that RAGtag surpasses TopicGPT in handling large topic sets and unknown thematic directions. By incorporating external knowledge effectively, RAGtag exhibits enhanced flexibility and scalability, making it an ideal solution for advanced document analysis and categorization in real-world settings.

BibTeX

BibTex Code Here