20 results for “text mining”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
The authors introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database.
The paper proposes a low-cost and interpretable fine-tuning extraction strategy for automatic term extraction, demonstrating consistent and balanced performance on the ATE Shared Task.
The paper introduces an agentic framework for text clustering that dynamically adapts the taxonomy generation process using specialized LLM agents, achieving state-of-the-art performance on multiple b…
The paper introduces a novel, scalable framework to monitor and classify dataset usage within research literature, addressing the current lack of infrastructure for tracking data citations.
The paper introduces IPO-Mine, a comprehensive toolkit and large-scale dataset designed to enable standardized, multimodal analysis of extremely long and structurally complex Initial Public Offering (…
This study compares various authorship attribution methods on Japanese web reviews, finding that while BERT fine-tuning performs best, TF-IDF+LR offers superior stability and efficiency for large-scal…
The paper proposes a novel KAN-enhanced BiGRU architecture to improve legal document classification and summarization in a low-resource, multilingual setting using Bengali and English legal texts.
The paper introduces TaDaS, a framework that analyzes large-scale text archives to measure professional sentiment, finding that while AI discussion among economists is initially negative, the trend sh…
Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková +1 more
This paper introduces SkMTEB, a comprehensive text embedding benchmark for Slovak, and develops efficient, locally-deployable Slovak embeddings.
The paper introduces TorchSight, an open-source local system using a fine-tuned Qwen 3.5 27B model that achieves high accuracy (95.0%) in classifying sensitive security documents without relying on ex…
This paper conducts a large-scale audit of human annotation reporting in NLP, finding that while reporting has improved, critical details needed to assess annotation validity, such as training and agr…
This paper introduces a machine learning system that detects phishing emails by analyzing contextual features from the entire email body content, achieving 95.41% accuracy using Logistic Regression.
The paper introduces FOSSIL, a new multilingual dataset and specialized workflow designed to significantly improve the extraction of citations embedded within complex footnotes common in law and human…
This paper introduces KliniskVestBERT, a suite of BERT models specialized by pre-training on a large, diverse corpus of real-world Norwegian clinical texts, demonstrating superior performance for clin…
The paper introduces 'bundesrecht,' an open-source, end-to-end pipeline for processing complex German statutory references, which parses, normalizes, and resolves raw citation strings into structured,…
Xiaoqi He, Kaixin Lan, Mu You, Tao Fang +2 more
The paper proposes MACAT, a Multi-Agent Culture-Aware Translation framework, to selectively translate culture-loaded words in ancient Chinese texts, achieving superior performance over existing method…
The paper enhances French parsing accuracy by integrating data from a syntactic lexicon and applying word clustering methods to verbs within a Probabilistic Context-Free Grammar framework.
Liangyi Huang, Zichen Liu, Fei Shao, Shang Ma +4 more
The paper introduces GRID, an end-to-end framework that significantly improves the construction of security knowledge graphs from cyber threat intelligence by replacing unstable LLM-based supervision…
This paper proposes a joint BERT-GNN architecture to systematically extract entities and relationships from diverse historical texts, achieving superior performance over conventional methods.
The paper introduces KnowledgeGain, a novel metric that measures the actual knowledge gained by readers from science news, and demonstrates its use in optimizing news generation to improve reader lear…