"Data preparation pipelines"

20 results for “Data preparation pipelines”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.AIcs.DBRecentMay 27, 2026

A Query Engine for the Agents

The paper introduces Hyperparam, a set of lightweight JavaScript libraries designed to enable direct, model-aware querying of unstructured data (like agent traces) within client-side AI applications.

View →

cs.DBcs.AIRecentMay 29, 2026

Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation

Madhav Jivrajani, Ramnatthan Alagappan, Aishwarya Ganesan

The paper introduces Sophrosyne, a system that moderates LLM agent exploration in relational data systems, significantly reducing over-exploration and boosting SQL generation accuracy by guiding the a…

View →

cs.DBcs.AIRecentMay 29, 2026

SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition

Yunkai Lou, Longbin Lai, Shunyang Li, Zhengping Qian +1 more

SpecDB is a novel system that uses LLMs to synthesize highly customized, purpose-built relational databases, achieving performance comparable to commercial systems while significantly reducing code si…

View →

cs.CLRecentMay 29, 2026

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren +1 more

The paper proposes Speculative Pipeline Decoding (SPD), a novel framework that uses pipeline parallelism to accelerate LLM inference by processing multiple tokens in parallel, achieving higher speedup…

View →

cs.AIRecentMay 31, 2026

"Skill issues'': data-centric optimization of lakehouse agents

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…

View →

cs.CLRecentMay 31, 2026

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…

View →

cs.CRcs.AIcs.LGRecentMar 27, 2026

Machine Learning Transferability for Malware Detection

César Vieira, João Vitorino, Eva Maia, Isabel Praça

This study evaluates various data preprocessing pipelines to improve the transferability and generalization of Machine Learning models for detecting malware in Portable Executable (PE) files across di…

View →

cs.AIRecentMay 31, 2026

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

Leo Luo, Haining Xie, Siqi Shen, Zhipeng Ma +7 more

SIRIUS-SQL introduces a robust multi-candidate text-to-SQL system that addresses weaknesses in candidate generation, error handling, and selection, achieving state-of-the-art performance on complex be…

View →

cs.CRRecentMay 20, 2026

An Evidence-driven Protocol for Trustworthy CI Pipelines

Fernando Castillo, Eduardo Brito, Pille Pullonen-Raudvere, Sebastian Werner +1 more

The paper proposes an evidence-driven protocol combining Deterministic Build Systems and Trusted Execution Environments to provide cryptographically verifiable guarantees of software artifact integrit…

View →

cs.IRcs.AIcs.CLRecentJun 1, 2026

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou +2 more

The paper introduces Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, a new dataset and framework that enables LLMs to perform time-series forecasting and reasoning on…

View →

cs.AIcs.CLRecentMay 28, 2026

Demystifying Data Organization for Enhanced LLM Training

Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang +7 more

This paper proposes four guidelines and two novel data ordering methods (STR and SAW) to systematically optimize data organization, significantly enhancing the stability and performance of LLM trainin…

View →

cs.CRRecentMay 21, 2026

Parser-Free Querying of Security Logs

Evan Luo, Julien Piet, David Wagner

The paper introduces Sieve, a system that uses a large language model (LLM) to generate executable query code from natural language security questions, significantly improving the ability to perform c…

View →

cs.LGRecentJun 1, 2026

TabPrep: Closing the Feature Engineering Gap in Tabular Benchmarks

Andrej Tschalzev, Nick Erickson, Yuyang Wang, Huzefa Rangwala +3 more

The paper introduces TabPrep, a feature engineering pipeline that systematically improves performance across various tabular machine learning models by addressing structural data patterns ignored by c…

View →

cs.CRcs.DBRecentMay 20, 2026

Polars inside Intel SGX2 Enclaves: An Empirical Study of Confidential Analytical Query Processing

Wei Wang, Burns Smith, Kenny Leftin

This paper empirically evaluates the performance of the Polars DataFrame engine running within Intel SGX2 enclaves, finding that while the overall security overhead is manageable, the performance is s…

View →

cs.CLcs.DSRecentMay 29, 2026

Incremental BPE Tokenization

Shenghu Jiang, Ruihao Gong

The paper introduces an efficient, novel algorithm for incremental Byte Pair Encoding (BPE) tokenization that processes input text prefix by prefix, achieving significant speedups and enabling streami…

View →

cs.AIRecentJun 1, 2026

From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation

Hamied Nabizada, Thomas Wirt, Luis Miguel Vieira da Silva, Felix Gehlhoff +1 more

This paper proposes an automated method to generate complete PDDL planning problems directly from Asset Administration Shell (AAS) capability models, eliminating the need for specialized planning expe…

View →

quant-phcs.CRRecentMay 13, 2026

QCIVET: A Quantum--Classical Pipeline Integrity Framework with Contract-Based Subtype Verification and Hash-Chained Audit Traces

Esra Yeniaras, Muhammad Amin Karimov

QCIVET introduces a novel contract-based framework to ensure the integrity of hybrid quantum-classical pipelines by verifying both the structure (syntactic) and the behavior (semantic) of quantum stag…

View →

cs.CRRecentApr 5, 2026

Styx: Collaborative and Private Data Processing With TEE-Enforced Sticky Policy

Shixuan Zhao, Weicheng Wang, Ninghui Li, Zhiqiang Lin

Styx is a novel framework that enhances data privacy and security in collaborative data processing, such as joint AI training, by integrating sticky policies with Trusted Execution Environments (TEEs)…

View →

cs.LGcs.AIRecentMay 29, 2026

dashi: A Python library for Dataset Shift Characterization to Support Trustworthy AI Development and Deployment

David Fernández-Narro, Pablo Ferri, Ángel Sánchez-García, Juan M. García-Gómez +1 more

The paper introduces 'dashi,' an open-source Python library that provides comprehensive tools for characterizing dataset shifts (covariate, prior, concept) to ensure robust and trustworthy AI developm…

View →

cs.CRcs.SERecentMay 4, 2026

SCRIBE: Practical Static Binary Patching via Binary-Aware Recompilation of Decompiled Code

Han Dai, Soumyakant Priyadarshan, Abdullah Imran, Ruoyu Wang +1 more

SCRIBE is a novel framework that enables reliable source-level patching of binaries by performing 'binary-aware' recompilation, successfully resolving syntactic and semantic inaccuracies inherent in d…

View →