~ similar to 2606.01469· 19 results
The paper introduces Latent Terms, a method that shows dense retrieval models implicitly learn sparse, Zipfian vocabularies that can be used for classical BM25-style sparse scoring without requiring s…
The paper proposes InSemRAG, an enhanced RAG framework that improves retrieval accuracy and knowledge integrity by incorporating intent-aware retrieval and semantics-preserving chunking, achieving sta…
This study systematically evaluates a wide range of chunking methods for Retrieval-Augmented Generation (RAG) to assess their effectiveness and highlight the overlooked challenges associated with chun…
The paper introduces FOSSIL, a new multilingual dataset and specialized workflow designed to significantly improve the extraction of citations embedded within complex footnotes common in law and human…
Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki +1 more
The paper introduces a robust four-stage NLP framework that uses schema-constrained LLMs and ESCO vocabulary to accurately extract and align educational competencies with labor market demands, quantif…
The authors introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database.
The paper introduces a novel, scalable framework to monitor and classify dataset usage within research literature, addressing the current lack of infrastructure for tracking data citations.
Yutong Cheng, Changze Li, Raihan Sultan Pasha Basuki, Qian Cui +2 more
TTPrint proposes a novel diverge-then-converge framework for extracting MITRE ATT&CK techniques from CTI reports, significantly improving both recall and precision compared to existing methods.
The paper proposes an embarrassingly simple detector that monitors model extraction attacks by testing whether the aggregate distribution of incoming LLM queries deviates from the historical distribut…
This paper proposes a lightweight encoder-based MEL solution called FAST-MEL that meets three objectives: high linking accuracy, computational efficiency, and storage efficiency.
The paper systematically compares multiple content representations for RAG pipelines and finds that answer retention—the ability of the representation to preserve the original answer-bearing content—i…
The study compares agentic data retrieval using unstructured web data versus structured, semantically-annotated datasets, concluding that semantic metadata remains essential for high-precision, reliab…
The paper proposes a novel KAN-enhanced BiGRU architecture to improve legal document classification and summarization in a low-resource, multilingual setting using Bengali and English legal texts.
Zelin Guan, Shengda Zhuo, Zeyan Li, Jinchun He +3 more
E-MIA introduces a novel, stealthy black-box membership inference attack that converts verifiable hard evidence within a candidate document into an objective, multi-part exam score to determine if the…
The paper systematically evaluates advanced retrieval-augmented generation (RAG) architectures for Cyber Threat Intelligence (CTI), demonstrating that a hybrid graph-text approach significantly improv…
The paper introduces 'infilling extraction' to accurately model training data memorization in Diffusion Language Models (DLMs), finding that bidirectional masking significantly increases the extractab…
Wenna Lai, Haoran Xie, Guandong Xu, Qing Li +1 more
The paper proposes FiVeD, a fine-grained verification framework that uses diagnostic reasoning supervision to significantly improve the reliability and performance of Aspect Sentiment Triplet Extracti…
Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more
The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…
Yung-Yu Shih, Shang-Yu Su, Tzu-I Ho, Dongzhe Wang +1 more
The paper presents BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies from scratch.