Data
Research papers, repositories, and articles about data
Showing 5 of 5 items
opendatalab/MinerU
Pipeline that converts messy PDFs and Office docs into clean markdown or JSON tuned for LLM and agent workflows. It's quickly becoming a standard pre-processing tool. Plug it in if you're serious about document-heavy RAG. ([github.com](https://github.com/trending?since=daily))
Exploring Autonomous Agentic Data Engineering for Model Specialization
The authors build an LLM-based "data engineer" that plans, collects, cleans, and filters domain data without hand-written pipelines. It can specialize models to new fields with less manual effort, but also exposes how brittle current agent setups remain over long, messy workflows. If you’re building vertical copilots, this paper is a blueprint for automating your data pipeline end to end.
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.
cocoindex-io/cocoindex
A high-performance data transformation engine built for AI pipelines. It focuses on incremental processing, so you can keep large feature stores and training datasets in sync cheaply. ([github.com](https://github.com/trending))
D4Vinci/Scrapling
Adaptive web-scraping framework that scales from one-off fetches to large crawls. It’s designed to play nicely with AI agents that need to browse and extract data. If your agents keep breaking on websites, Scrapling is worth testing as the web layer.