|
Selected Publications and Preprints
My research interest lies in the following topics: Data Management (Data Annotation, Data Integration, and Data Organization), RAG, Agent Memory, and RL. My vision is to make data usable: starting from ensuring the annotation and preparation quality, to developing effective data organization paradigm, to improving usability of data in downstream applications (e.g., RAG). Here are the selected publications. Please check my Google Scholar to get the full list of my publication. Feel free to drop me an email if you are interested in potential collaborations! :)
|
|
APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL
Bowen Cao, Weibin Liao, Yushi Sun+, Dong Fang+, Haitao Li, Wai Lam
Preprint, Under Review, 2026, + indicates correspondence.
We propose APEX-SQL, an Agentic Text-to-SQL Framework that shifts the paradigm from passive translation to agentic exploration. Our framework employs a hypothesis-verification loop to ground model reasoning in real data. In the schema linking phase, we use logical planning to verbalize hypotheses, dual-pathway pruning to reduce the search space, and parallel data profiling to validate column roles against real data, followed by global synthesis to ensure topological connectivity. For SQL generation, we introduce a deterministic mechanism to retrieve exploration directives, allowing the agent to effectively explore data distributions, refine hypotheses, and generate semantically accurate SQLs.
paper
[LLM Agent]
|
|
LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation
Yushi Sun, Xijia Li, Nan Tang, Quanqing Xu, Chuanhui Yang, Lei Chen
Preprint, Under Review, 2026
We introduce a novel PLM-LLM collaborated cross data lake column type annotation approach that significantly reduces the generalization cost of domain specifc column type annotation models.
paper
[Data Preparation]
|
|
MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring
Qipeng Wang, Rui Sheng, Yafei Li, Huamin Qu, Yushi Sun+, Min Zhu+
Preprint, Under Review, 2026, + indicates correspondance.
We proposed MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns.
paper
[KG]
|
|
CacheRAG: A Novel Approach to Enhance KG-based RAG Through Caching Mechanisms
Yushi Sun, Lei Chen
VLDB under revision, 2026
We proposed a novel KGQA approach that utilizes a caching structure to significantly improve the QA accuracy of KG-based RAG pipeline.
data & code;
paper
[LLM, RAG, KG]
|
|
KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering
Yushi Sun, Kai Sun, Ethan Yifan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen
EMNLP (Findings), 2025
We proposed a novel KGQA approach that resolves the low recall issue of existing methods.
data & code;
paper
[LLM, RAG, KG]
|
|
CRAG - Comprehensive RAG Benchmark
Xiao Yang*, Kai Sun*, Hao Xin*, Yushi Sun*, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran,
Jiaqi Wang, Ethan Yifan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
NeurIPS, 2024, * indicates equal contribution.
We constructed a comprehensive RAG benchmark and hosted the KDD Cup competition.
paper;
KDD Cup
[LLM, RAG, Data Organization]
|
|
Are Large Language Models a Good Replacement of Taxonomies?
Yushi Sun, Hao Xin, Kai Sun, Ethan Yifan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen
VLDB, 2024
We conducted an extensive evaluation of SOTA LLMs on taxonomies.
data & code;
paper;
slides
[LLM, Data Organization]
|
|
Cross-domain-aware Worker Selection with Training for Crowdsourced Annotation
Yushi Sun, Jiachuan Wang, Peng Cheng, Libin Zheng, Lei Chen, Jian Yin
ICDE, 2024
We proposed a novel cross-domain-aware worker selection with training approach for crowdsourced data labeling.
data & code;
paper;
slides
[Data Annotation]
|
|
RECA: Related Tables Enhanced Column Semantic Type Annotation Framework
Yushi Sun, Hao Xin, Lei Chen
VLDB, 2023
We defined a novel named entity schema for related and sub-related table discovery and alignment for enhancing the annotation quality of column semantic types.
data & code;
paper;
slides
[Data Preparation]
|
|
|
Project Up Scholar at Tencent (2025)
HKUST Research Travel Grant (2024, 2025)
RedBird Academic Excellence Award for Continuing PhD Students (2023-2024)
RedBird Academic Excellence Award for Continuing PhD Students (2022-2023)
RedBird PhD Scholarship (2021)
HKUST Academic Achievement Medal (2021)
First Class Honor graduate from HKUST (2021)
Hong Kong PhD Fellowship Scheme (2021-2025)
|
|
|
Conference reviewer: CIKM 2023
Journal reviewer: TKDE 2024
|
|
|
Teaching Assistant of COMP 1021 Introduction to Computer Science (2024 Fall)
Teaching Assistant of COMP 2711H Honors Discrete Mathematical Tools for Computer Science (2022 Fall)
Teaching Assistant of COMP 5712 Introduction to Combinatorial Optimization (2022 Spring)
|
|