Back to AI Lab

Rag

Research papers, repositories, and articles about rag

Showing 22 of 22 items

opendatalab/MinerU

Pipeline that converts messy PDFs and Office docs into clean markdown or JSON tuned for LLM and agent workflows. It's quickly becoming a standard pre-processing tool. Plug it in if you're serious about document-heavy RAG. ([github.com](https://github.com/trending?since=daily))

71,556

mvanhorn/last30days-skill

An AI agent skill that scrapes Reddit, X, YouTube, Hacker News, Polymarket, and the web for the last 30 days, then synthesizes a grounded summary. Use it to replace generic web search with a behavior-driven snapshot of what real people and real money care about.

41,912

chopratejas/headroom

Headroom compresses tool outputs, logs, and RAG chunks before they ever hit the model, often cutting tokens by 60–95%. It acts as a library, proxy, and MCP server so you can slash running costs without sacrificing answer quality.

27,419

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

The authors design a reward scheme that scores agents on how well they build evidence chains with proper citations, not just final answers. Their new training method reduces shortcut tricks and hallucinated claims, so deep research agents behave more like careful analysts.

Jiajie Zhang, Xin Lv

Over-Searching in Search-Augmented Large Language Models

This work shows that search‑augmented models often call tools even when search hurts answers or wastes tokens. It introduces a cost‑aware metric and mitigation tricks, so teams can dial back needless retrieval instead of just adding more context.

Roy Xie, Deepak Gopinath

LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.

Yilin Xiao, Jin Chen

ruvnet/ruflo

Agent orchestration platform tuned for Claude-based systems. Focuses on multi-agent swarms, enterprise deployments, and built-in RAG and code workflows. If you’re standardizing on Claude for serious products, study this before rolling your own orchestrator. ([github.com](https://github.com/trending?since=daily))

38,646

GoogleCloudPlatform/generative-ai

Large collection of Gemini on Vertex AI notebooks and sample apps. Great starting point if you want to build production-style systems on Google Cloud fast.

14,457

openai/openai-cookbook

The OpenAI cookbook is a large set of worked examples for building with OpenAI’s API. Treat it as a pattern library for chat apps, agents, RAG systems, and fine-grained evaluations.

70,628

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

HGMem turns the “scratchpad” of a multi-step retrieval system into a hypergraph that connects many related facts at once. This richer memory structure helps language models keep global context straight over long tasks, boosting performance on challenging reasoning and long-document benchmarks.

Chulun Zhou, Chunkang Zhang

ObjectGraph: From Document Injection to Knowledge Traversal — A Native File Format for the Agentic Era

Proposes a new file format that treats documents as typed graphs instead of long strings dumped into context windows. Agents query and traverse nodes, cutting tokens used by up to ~95% while keeping task accuracy. If your agents still paste whole PDFs into prompts, this hints at a cleaner architecture layer. ([arxiv.org](https://arxiv.org/abs/2604.27820))

Mohit Dubey, Open Gigantic

yichuan-w/LEANN

LEANN is a compact retrieval system for "RAG on everything" with big storage savings. It compresses document representations while keeping accuracy high, making private, on-device retrieval far cheaper.

8,971

RyanCodrai/turbovec

Turbovec is a vector index built on TurboQuant with Rust internals and Python bindings. It targets high-speed similarity search for embeddings. Drop it into your stack if your current vector store is the bottleneck.

7,194

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Uses a language model’s own feedback as a training signal for retrieval rerankers in RAG pipelines. Aims to pick more useful documents for question answering.

Yuhang Wu, Xiangqing Shen

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

Defines a benchmark that scores how models actually write answers when given retrieved documents. Helps teams compare RAG setups on answer quality, not just retrieval hit rates.

Koki Itai, Shunichi Hasegawa

lfnovo/open-notebook

open-notebook recreates NotebookLM as an open-source app with more control and features. It lets you spin up your own AI notebook that reasons over your documents without being locked into a single provider.

30,471

WeKnora

Tencent’s LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answering with a RAG paradigm. Essentially a production-grade answer engine stack rather than a toy demo. ([github.com](https://github.com/trending?since=daily))

8,623

Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

Introduces Neuro-RIT, which looks at individual neurons while customizing language models for retrieval-heavy tasks. The aim is steadier answers when retrieved documents shift or are noisy.

Jaemin Kim, Jae O Lee

refactoringhq/tolaria

Tolaria is a desktop app for managing markdown knowledge bases, often paired with local LLMs. It makes it easier to turn notes into an AI-ready memory store. Try it if your personal or team knowledge is scattered across files and you want AI on top.

12,880

langchain-ai/rag-from-scratch

Step-by-step notebooks for building retrieval-augmented generation systems without heavy frameworks. Walks through indexing, retrieval, and response patterns. If your team keeps misusing generic RAG libraries, force everyone to work through this once. ([github.com](https://github.com/trending/jupyter-notebook?since=daily))

8,181

pandas-ai

pandas-ai turns DataFrames and SQL/CSV/Parquet sources into a conversational interface, translating natural-language questions into code or SQL, running them in a (configurable) sandbox, and optionally using RAG and semantic schemas to answer more complex queries. It’s attractive for quickly giving analysts or business users an LLM front-end on top of existing data, though you do need to pay attention to security configurations given its history of prompt-injection/RCE issues that were later mitigated with new settings. ([github.com](https://github.com/sinaptik-ai/pandas-ai?utm_source=openai))

22,805

Naiad: Novel Agentic Intelligent Autonomous System for Inland Water Monitoring

Naiad chains an AI agent with weather data, satellite imagery, and domain tools to monitor lakes and rivers end to end. It lets non-experts ask plain-language questions and get tailored environmental reports, showing how agent stacks can tackle real infrastructure problems.

Eirini Baltzi, Tilemachos Moumouris