New approaches to heterogeneous reference data mining
With the mass availability of historical and contemporary digitized texts, the task of computer-assisted extraction of scholarly references has become increasingly important in many research disciplines. The key challenge in reference extraction is to identify and extract cluttered and fragmented bibliographic information encoded in multiple ways from a mass of noise. However, the correct extraction of such heterogeneous references from texts, especially historical documents and works in the humanities or law, is an unsolved problem.
For this reason, Andreas Wagner, Digital Humanities specialist in the Institute Library, and Christian Boulanger, researcher in the Department of Marietta Auer, invited scholars and practitioners from the social sciences, humanities, and computational disciplines to a workshop dedicated to this challenge. Co-organizers were Robin Haunschild from the Max Planck Institute for Solid State Research and Malte Vogl from the Max Planck Institute for the History of Science. The goal of the workshop was to define the key issues and identify the state of the art. Most importantly, it aimed to launch a community capable of sharing resources and advancing the state of the art in the midst of the unfolding "AI revolution."
The workshop included two interactive sessions to gather ideas on how to move the field forward. One tool suggested by the organizers is a white paper that can provide an overview of the current state and potential pathways for the field. At the end of the first day, participants gave their opinions on what topics could be covered in such a whitepaper. The second day ended with a brainstorming session on what other forms of collaborative research and development might look like. The event will certainly not be the last of its kind, not only given that Large Language Models are currently changing the technological landscape and will help unlock previously untapped reference/citation data in the humanities, law, and social sciences.