Back to AI Lab

Data

Research papers, repositories, and articles about data

Showing 5 of 5 items

opendatalab/MinerU

Pipeline that converts messy PDFs and Office docs into clean markdown or JSON tuned for LLM and agent workflows. It's quickly becoming a standard pre-processing tool. Plug it in if you're serious about document-heavy RAG. ([github.com](https://github.com/trending?since=daily))

71,556

Exploring Autonomous Agentic Data Engineering for Model Specialization

The authors build an LLM-based "data engineer" that plans, collects, cleans, and filters domain data without hand-written pipelines. It can specialize models to new fields with less manual effort, but also exposes how brittle current agent setups remain over long, messy workflows. If you’re building vertical copilots, this paper is a blueprint for automating your data pipeline end to end.

Yujie Luo, Xiangyuan Ru

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.

Yosub Shin, Michael Buriek

cocoindex-io/cocoindex

A high-performance data transformation engine built for AI pipelines. It focuses on incremental processing, so you can keep large feature stores and training datasets in sync cheaply. ([github.com](https://github.com/trending))

4,395

D4Vinci/Scrapling

Adaptive web-scraping framework that scales from one-off fetches to large crawls. It’s designed to play nicely with AI agents that need to browse and extract data. If your agents keep breaking on websites, Scrapling is worth testing as the web layer.

56,576