Xin Luna Dong
Principal Scientist, Meta Wearables AI
“From Sight to Insight: Visual Memory for Smarter Assistants”
Abstract & bio →ACM SIGKDD 2026 · Special Day
Putting data back at the center of machine learning and AI — a half-day of keynotes and debate on the science of data itself.
✦ Closing Panel Will AI Agents Make Data Scientists Obsolete?The Vision
Every breakthrough in predictive modeling, every advance in generative AI, and every leap in autonomous systems rests on a foundation of high-quality, representative, and well-understood data. Without data, models are hollow shells; with the right data, even simple algorithms can yield extraordinary insights.
Yet for too long the field has operated under a model-centric paradigm — celebrating ever-more-powerful architectures while treating data as a fixed, almost incidental commodity. This imbalance has created silent but profound crises: models that fail in production because training data did not reflect deployment distributions; datasets riddled with hidden biases that surface only after release; and a widening gap between cutting-edge research and the messy, evolving data realities of industry.
Data Day reframes data as what it truly is: a dynamic, fragile, and invaluable resource that demands as much scientific rigor and engineering excellence as the models themselves.
Keynote Speakers
Leaders from academia and industry — listed alphabetically.
Principal Scientist, Meta Wearables AI
“From Sight to Insight: Visual Memory for Smarter Assistants”
Abstract & bio →Michael Aiken Chair Professor, UIUC
“Towards Theme-Based and Structure-Guided Knowledge Discovery with LLMs”
Abstract & bio →VP & Head of Science for Outlook, Microsoft
“From Big Data to Right Data: Post-Training Agentic AI for Outlook”
Abstract & bio →Lead Scientist, Shanghai AI Laboratory
“The Path to Trustworthy AI Agents”
Abstract & bio →VP of Engineering & Head of AI, EvenUp · U. Washington
“Document AI: How ready is it for mission-critical tasks?”
Abstract & bio →Associate Professor, KAIST
“Connect Data, Discover Knowledge: Empowering AI with Graph Intelligence”
Abstract & bio →Program
All times local to Jeju (KST). Click any talk to expand its abstract and speaker bio.
Data mining aims for knowledge discovery from massive amounts of data across multiple themes. Large language models can be a powerful tool to help us understand such data of various forms. We argue that mining structures from theme-focused data plays a key role in effective extraction of theme-related data and knowledge for successful LLM-based retrieval, reasoning, knowledge discovery, and problem solving. We introduce recent methods for mining knowledge structures for theme-centered retrieval and for constructing theme-specific reasoning graphs to facilitate structure-augmented reasoning generation, and show how theme-centered, structure-augmented generation enhances knowledge discovery and multi-hop reasoning with LLMs.
Jiawei Han is Michael Aiken Chair Professor in the Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign. A Fellow of ACM and IEEE with over 1,000 research publications, he received the ACM SIGKDD Innovation Award (2004), the IEEE Computer Society Technical Achievement Award (2005), the IEEE W. Wallace McDowell Award (2009), and Japan's Funai Achievement Award (2018), and was elevated to Fellow of the Royal Society of Canada (2022).
Imagine a personal assistant that, with user permission, persistently remembers moments from daily life — answering “When and where did I see this lady?” or offering suggestions like “You might enjoy The Little Prince — it relates to the statue you liked in Lyon.” Realizing this vision requires capturing visual memories under hardware constraints, extracting personalization signals from noisy visual histories, and supporting real-time QA under tight latency. We present early work toward this goal: Pensieve, a memory-based QA system that improves accuracy by 11% over state-of-the-art multimodal RAG baselines, and VisualLens, which infers user interests from casual photos and outperforms leading recommenders by 5–10%.
Xin Luna Dong is a Principal Scientist at Meta Wearables AI, leading agentic-AI efforts for trustworthy, personalized assistants on wearable devices. Previously she advanced knowledge-graph technology including the Amazon Product Graph and the Google Knowledge Graph. An ACM and IEEE Fellow, she received the VLDB Women in Database Research Award and the VLDB Early Career Research Contribution Award, and co-authored Big Data Integration and Machine Knowledge.
Large language models can now read dense contracts, physician notes, and photographed invoices with impressive fluency. But in courtrooms, clinics, and underwriting desks, “pretty good” still courts unacceptable risk. This talk frames Document AI as an end-to-end product challenge: reliable performance emerges not from bigger models alone but from principled evaluation and structured design. We show how QA-driven metrics — especially reinforcement learning with automated reward signals — enable scalable self-improvement, and how agentic systems that decompose schema-driven tasks into orchestrated micro-agents offer a more transparent, cost-effective path than monolithic LLMs.
Haixun Wang is an ACM and IEEE Fellow, Editor-in-Chief of the IEEE Data Engineering Bulletin, a VLDB trustee, and affiliate professor at the University of Washington. He is VP of Engineering and Head of AI at EvenUp, building LLM support for legal and medical document understanding. He previously held leadership roles at Instacart, WeWork, Amazon, and Meta, and was a researcher at Microsoft, Google, and IBM. His work has earned the 10-Year ICDE Influential Paper Award (2024) and the ICDM 10-Year Best Paper Award (2013), among others.
Graph-structured data underpins advances from recommendation to complex AI reasoning, with Knowledge Graphs representing real-world facts as interconnected entities and relations. Conventional KG representation learning relies heavily on memorizing fixed embeddings, limiting generalizability in dynamic environments. This talk explores a paradigm shift from rote memorization toward deep structural understanding, introducing methods that model intrinsic topological relationships to enable robust inductive inference on entirely unseen entities and relations — and extending the framework to hyper-relational KGs and the generative synthesis of complex facts from scratch.
Joyce Jiyoung Whang is an associate professor in the Department of AI / School of Computing at KAIST, leading the Big Data Intelligence Lab since 2020. She received her Ph.D. from the University of Texas at Austin under Inderjit Dhillon. She serves as Area Chair for ICML, NeurIPS, and ICLR, as Workshop Chair for KDD 2026, and as associate editor for ACM TKDD. Her research focuses on graph machine learning, knowledge-graph representation learning, and graph neural networks.
AI models increasingly assist humans by leveraging external tools, but tool-empowered agent systems can harm the physical world if they make unsafe decisions. This talk introduces representative agent frameworks, illustrating how integrating different tools reduces controllability, then details recent efforts to ensure safety across all four core steps of the agent workflow: user-input scanning, inherent model safety, tool-usage monitoring, and output verification. Together these advances drive a development paradigm that governs the co-evolution of AI systems' safety and performance.
Xia “Ben” Hu is Lead Scientist at Shanghai AI Laboratory; previously he was Full Professor and Director of the Data Science Institute at Rice University. He has published over 200 papers with more than 40,000 citations; his leading works include AutoKeras and the NCF algorithm (cited over 9,000 times). He has won the NSF CAREER Award and the KDD Rising Star Award, with best-paper recognitions at ICML, WWW, WSDM, ICDM, and beyond.
Traditional data mining focuses on learning from large-scale data. In the era of agentic AI, the central question shifts from big data to right data: how do we construct the right tasks, environments, and reward signals so models learn useful behaviors through post-training? Using Outlook's email and calendar agents as a case study, this talk shows how the unit of data becomes a trajectory of states, actions, observations, and rewards; how evaluation moves to sandboxed, auditable environments; and how data efficiency hinges on the “learnable frontier” — cases with high pass@k but low pass@1 where a model can sometimes, but not yet reliably, succeed.
Qi He is Vice President and Head of Science for Outlook at Microsoft, leading quality, evaluation, post-training, and GenAI applications. With more than 20 years of experience, he previously held senior roles at Amazon, Nextdoor, and LinkedIn. He serves on the ACM CIKM Steering Committee, is PC Chair of the SIGKDD 2025–2026 ADS Track and General Chair of SIGKDD 2027, and is an IEEE Fellow and ACM Distinguished Member with 100+ papers and 10,000+ citations.
Closing Panel · 5:00 – 6:00 PM
For decades, students were told that data science was the career of the future. Today many ask a more unsettling question: is that future already disappearing? As foundation models, AI agents, and autonomous analytics systems rapidly absorb tasks once performed by data scientists, analysts, and researchers, universities and employers face a reckoning.
This panel brings together leading voices from academia and industry for a candid, unfiltered discussion on one of the most consequential questions facing our field: not how AI will change data science, but whether data science as we know it will survive the age of AI.