"text mining" | ArxivCSExplorer

20 results for “text mining”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.IRcs.CLDatasetRecentJun 9, 2026

A PubMed-Scale Dataset of Structured Biomedical Abstracts

Chia-Hsuan Chang, Haerin Song, Brian Ondov, Hua Xu

The authors introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database.

View →

cs.CLRecentMay 31, 2026

Peacemaker at ATE-IT: Automatic term extraction from Italian text for waste management data using encoder model

Mahdi Bakhtiyarzadeh, Hadi Bayrami Asl Tekanlou, Jafar Razmara

The paper proposes a low-cost and interpretable fine-tuning extraction strategy for automatic term extraction, demonstrating consistent and balanced performance on the ATE Shared Task.

View →

cs.CLRecentMay 31, 2026

Agentic Clustering: Controllable Text Taxonomies via Multi-Agent Refinement

Simon Löwe, Emily Silcock

The paper introduces an agentic framework for text clustering that dynamically adapts the taxonomy generation process using specialized LLM agents, achieving state-of-the-art performance on multiple b…

View →

cs.CLRecentMay 28, 2026

AI for Monitoring and Classifying Data Used in Research Literature

Rafael Macalaba, Aivin V. Solatorio

The paper introduces a novel, scalable framework to monitor and classify dataset usage within research literature, addressing the current lack of infrastructure for tracking data citations.

View →

cs.CLcs.AIRecentMay 27, 2026

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi +7 more

The paper introduces IPO-Mine, a comprehensive toolkit and large-scale dataset designed to enable standardized, multimodal analysis of extremely long and structurally complex Initial Public Offering (…

View →

cs.CLcs.CRRecentMar 24, 2026

Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

Hiroshi Matsubara, Shingo Matsugaya, Taichi Aoki, Masaki Hashimoto

This study compares various authorship attribution methods on Japanese web reviews, finding that while BERT fine-tuning performs best, TF-IDF+LR offers superior stability and efficiency for large-scal…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

Ahmed Faizul Haque Dhrubo, Souvik Pramanik, Most. Aysha Siddika Sumona, Shahnewaz Siddique +3 more

The paper proposes a novel KAN-enhanced BiGRU architecture to improve legal document classification and summarization in a low-resource, multilingual setting using Bengali and English legal texts.

View →

cs.CERecentJun 1, 2026

Are Economists Open to AI? Text as Data as Survey on Professional Sentiment and Academic Research Trends

Yi Wang, Lei Ge

The paper introduces TaDaS, a framework that analyzes large-scale text archives to measure professional sentiment, finding that while AI discussion among economists is initially negative, the trend sh…

View →

cs.CLcs.AIcs.LGEmpiricalRecentJun 11, 2026

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková +1 more

This paper introduces SkMTEB, a comprehensive text embedding benchmark for Slovak, and develops efficient, locally-deployable Slovak embeddings.

View →

cs.CRcs.AIRecentMay 19, 2026

Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System

Ivan Dobrovolskyi

The paper introduces TorchSight, an open-source local system using a fine-tuned Qwen 3.5 27B model that achieves high accuracy (95.0%) in classifying sensitive security documents without relying on ex…

View →

cs.CLcs.AIRecentJun 1, 2026

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen +9 more

This paper conducts a large-scale audit of human annotation reporting in NLP, finding that while reporting has improved, critical details needed to assess annotation validity, such as training and agr…

View →

cs.CRRecentMar 28, 2026

Context-Aware Phishing Email Detection Using Machine Learning and NLP

Amitabh Chakravorty, Matthew Price, Nelly Elsayed, Zag ElSayed

This paper introduces a machine learning system that detects phishing emails by analyzing contextual features from the entire email body content, achieving 95.41% accuracy using Logistic Regression.

View →

cs.DLcs.CLRecentMay 31, 2026

Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities

Luca Foppiano, Christian Boulanger

The paper introduces FOSSIL, a new multilingual dataset and specialized workflow designed to significantly improve the extraction of citations embedded within complex footnotes common in law and human…

View →

cs.CLcs.AIRecentJun 1, 2026

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

Christian Autenried, Cosimo Persia

This paper introduces KliniskVestBERT, a suite of BERT models specialized by pre-training on a large, diverse corpus of real-world Norwegian clinical texts, demonstrating superior performance for clin…

View →

cs.CLRecentMay 29, 2026

Bundesrecht: An Open Library and Corpus for German Statutory Reference Processing

Harshil Darji, Martin Heckelmann, Christina Kratsch, Gerard de Melo

The paper introduces 'bundesrecht,' an open-source, end-to-end pipeline for processing complex German statutory references, which parses, normalizes, and resolves raw citation strings into structured,…

View →

cs.CLRecentMay 31, 2026

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

Xiaoqi He, Kaixin Lan, Mu You, Tao Fang +2 more

The paper proposes MACAT, a Multi-Agent Culture-Aware Translation framework, to selectively translate culture-loaded words in ancient Chinese texts, achieving superior performance over existing method…

View →

cs.CLcs.LGRecentMay 30, 2026

French parsing enhanced with a word clustering method based on a syntactic lexicon

Anthony Sigogne, Matthieu Constant, Eric Laporte

The paper enhances French parsing accuracy by integrating data from a syntactic lexicon and applying word clustering methods to verbs within a Probabilistic Context-Free Grammar framework.

View →

cs.AIcs.CRRecentMay 15, 2026

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

Liangyi Huang, Zichen Liu, Fei Shao, Shang Ma +4 more

The paper introduces GRID, an end-to-end framework that significantly improves the construction of security knowledge graphs from cyber threat intelligence by replacing unstable LLM-based supervision…

View →

cs.CLcs.AIRecentJun 1, 2026

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

Ping Li, Bartlomiej Brzozka

This paper proposes a joint BERT-GNN architecture to systematically extract entities and relationships from diverse historical texts, achieving superior performance over conventional methods.

View →

cs.CLcs.AIRecentMay 29, 2026

KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning

Dominik Soós, Meng Jiang, Jian Wu

The paper introduces KnowledgeGain, a novel metric that measures the actual knowledge gained by readers from science news, and demonstrates its use in optimizing news generation to improve reader lear…

View →