CfP: Workshop ‚New Approaches for Extracting Heterogeneous Reference Data‘
Call for Papers
Extracting heterogeneous references from texts, in particular from historical documents and humanities or legal scholarship is an unresolved problem. Yet, there is currently no coordinated effort to develop solutions. We therefore invite scholars and practitioners from the social sciences, the humanities and the informational and computational disciplines to join us in a workshop in which we want to define the problem(s), establish the state of the art and share resources. The overarching aim of the event is to find ways for jointly developing new tools and workflows which are able to unlock previously untapped reference/citation data in the humanities, law and the social sciences. A particular focus lies on newly emerging technologies that are based on (pre-trained) language models.
Problem Outline
With masses of historical and contemporary digitized texts becoming available, the task of computationally extracting scholarly references comes to the focus of many research disciplines. The core challenge for reference data extraction is to identify and extract messy and fragmentary bibliographic information encoded in a multitude of ways from a lot of noise. While there are well-established reference extraction tools, most of them have been developed for and evaluated on a particular genre of literature only: English-language texts that have their bibliographies at the end as a list of references, following a somewhat consistent citation style. These tools may thus work well enough for some typical use cases. However, they are hardly fit for extracting literature references from texts that organise their references differently, as are common in humanities, law, and parts of the social sciences:1
- Citation Style: Complete reference information is often using referencing terms like “ibid.”, “op.cit.” or other forms of abbreviations, and in general is rarely fully consistent. Historical documents are frequently referring to their sources with just an abbreviation of the author’s name, sometimes an incipit of the passage, or other canonical ways of referencing “classic works”.
- Dispersed References: References are found in footnotes as well as in the main text. In many cases, references are stretched over one or several sentences with the author name and the title of the referenced work being interspersed with comment or criticism. Sometimes several references are listed sequentially without a standard separation of one reference from the next one.
- Language: Finally, the processing of texts in languages other than English (or in historical variants/stages of otherwise well-resourced languages like English, German, Spanish, Italian) has not been tested extensively and is likely to suffer from a lack of training material.
These features of the target documents (plus other ambiguities, inconsistencies and OCR noise) derail document segmentation, reference recognition, and reference parsing algorithms expecting homogenous textual data, neatly separated, mostly English-language references and self-contained, consistently styled reference information.
In principle, the current state of technological development may be able to cope with this, however: With language processing technologies being revolutionized by transformer-based (2017), pre-trained language models since 2018, and with the recent investment of lots of resources into building really huge language models,2 the tools seem to have acquired some abstract “understanding” of texts and some capacity to perform tasks they have not explicitly been trained for (“transfer learning”).3 In fact, an ad hoc experiment we did with the OpenAI API, a commercial tool based on one of the mentioned large language models, shows stunning performance when prompted to extract and segment references from a footnote of a German scholarly work.4
However, the established tools for reference extraction do not make use of this latest available language technology. And workflows using the large language models have not been developed yet or, as is the case with our ad-hoc experiment, it is unclear if/how long these models will remain free and open for research purposes. Plus, the (current) APIs have token limits and processing speeds that hinder their use in real-world scenarios. Finally, the fitness for historical and less-resourced languages remains to be addressed.
Thus, in order to develop suitable tools and workflows, there is a current need of charting available models and APIs with their respective strengths and limitations, assessing and finding the necessary computing power, and understanding what kind of finetuning is necessary (and what resources this in turn requires).
Call for Participation
For the hybrid workshop on 15/16 May, 2023, we invite contributions to the following topics/questions:
Problem/task definition
- What are use cases and current projects, in which the issue of extracting heterogeneous reference data arises?
- Can we define different (sub-)tasks, like parsing of references vs. just finding them vs. identifying what they refer to (linking)? In the main text vs. in footnotes? In which languages?
State of the art
- What is the performance of up-to-date approaches (be it OpenAI, other large language models, or approaches based on other technology like AnyStyle or GROBID)?
- What are their respective limitations, and how do we go about evaluating them?
- How can the newly emerging technologies be leveraged to improve upon existing solutions?
Pooling of tools and resources
- What training corpora are available? Which are needed?
- What format of training data is best (annotation schema, file format, convertibility etc.)?
- What workflows/toolchains are best suited to process the available data?
Future work
- What infrastructure is required/available for continuing work in this regard (repositories, communication channels, computing platforms etc.)?
- Should we establish a Shared Task? Write a whitepaper? Hold a hackathon?
Please also consider contributing if:
- you have data that would be suitable as training data
- your previous or future work contributes to related tasks like preparation of datasets (incl. normalization/tokenization etc.), linking of references, classification of citation contexts, long-term storage and reuse of generated citation data etc. (See annex, section “Related tasks”.)
As submissions, we expect a short abstract of your presentation which also references previous work in the subject area (papers and code, if applicable). The presentations on the workshop should not exceed 20 minutes each.
While the workshop will be organized in a hybrid format, we hope to be able to meet you in person. Participation is free of charge. Travel and accomodation cannot be reimbursed.
Please send your submission to wagner@lhlt.mpg.de. Deadline for submissions is Feb 28th 2023. All submitters will be notified of their status by March 13th 2023.