RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Published at

2024/05/12

Last edited time

2025/01/12 04:31

Created

2024/03/13 15:53

Section

Prompt Enginnering

Status

Done

Series

Tags

Paper

AI summary

Keywords

LLM

Long Context Search

Retrieval

Language

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.

https://arxiv.org/html/2401.18059v1

Abstract

•

Problems of existing retrieval-augmented approaches

◦

most existing methods retrieve only a few short, contiguous text chunks

▪

Authors designed an indexing and retrieval system that uses a tree structure to capture both high-level and low-level details about a text.

•

 selecting the most relevant information for knowledge-intensive tasks is still crucial.

Background

•

Recursive Summarization as Context

◦

The recursive-abstractive summarization model by Wu et al. (2021) employs task decomposition to summarize smaller text chunks, which are later integrated to form summaries of larger sections.

◦

While this method is effective for capturing broader themes, it can miss granular details. LlamaIndex (Liu, 2022) mitigates this issue by similarly summarizing adjacent text chunks but also retaining intermediate nodes thus storing varying levels of detail, keeping granular details. 

◦

may still overlook distant interdependencies within the text

•

 long texts often present subtopics and hierarchical structures (Cao & Wang, 2022; Dong et al., 2023b)

Method

Tree Contrection Process

•

 RAPTOR recursively clusters chunks of text based on their vector embeddings and generates text summaries of those clusters, constructing a tree from the bottom up. Nodes clustered together are siblings; a parent node contains the text summary of that cluster.

The clustering approach in tree construction includes a few interesting ideas.

GMM (Gaussian Mixture Model)

•

Model the distribution of data points across different clusters

•

Optimal number of clusters by evaluating the model's Bayesian Information Criterion (BIC)

UMAP (Uniform Manifold Approximation and Projection)

•

Supports clustering

•

Reduces the dimensionality of high-dimensional data

•

UMAP helps to highlight the natural grouping of data points based on their similarities

Local and Global Clustering

•

Used to analyze data at different scales

•

Both fine-grained and broader patterns within the data are captured effectively

Thresholding

•

Apply in the context of GMM to determine cluster membership

•

Based on the probability distribution (assignment of data points to ≥ 1 cluster)

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Abstract

Background

Method

Tree Contrection Process

Retreivals