Nick Thieberger

Amanda Harris
Director, PARADISEC Sydney Unit

Being home to over a quarter of the world’s languages, the Pacific is a particularly good place to focus on how language records can be made accessible. The creation and description of research records has not always been a priority for humanities academics and any records that are created have typically not been provided with good archival solutions. This is despite these records often being of cultural or historical relevance beyond academia. Many cultural agencies struggle to keep track of recordings they have made, and it is the same for many researchers. Often it is only when researchers prepare recordings for archiving that they realize how many (or few) are described adequately, or have been transcribed or translated.
Many academic researchers at the end of their careers despair at the task of making sense of a lifetime’s output of papers, notes, images, and recordings. Our project, the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC), a collaboration between the University of Sydney, University of Melbourne, and the Australian National University, began in 2002 by digitizing analog tape collections and providing sufficient metadata (contextual information) to make them discoverable. These tapes belonged to retired or deceased researchers and would otherwise have been stored in houses or libraries where they would be difficult to find and even more difficult to access.
In the past 18 years we have added 14,000 hours of audio in a 125-terabyte collection representing 1,280 languages. It is a significant collection that has been entered into the UNESCO Memory of the World register for ICH.1

Ambong Thompson and Asal Lazare of the Vanuatu Kaljoral Senta, receving tapes digitised by PARADISEC in 2016. © Nick Thieberger

Endangered Records of Endangered Languages
Imagine you speak one of the 130 languages of Vanuatu, and you remember a strange visitor to your village forty years ago who recorded your family talking and singing. You want to find those recordings because you know of no other old recordings in your language. First, you have to find out who that stranger was, and try to contact them. If they published something about their time in your village then a web search may turn up their name. If they did not then the search is harder. And, if you do find them, are they able to find the tape they made? If they used a tape recorder, how can tape now be played in the village?
A group of linguists and musicologists in Australia engaged with this problem in 2002, seeing that many analog tapes had been recorded in countries around the Pacific region and those tapes were now almost all orphaned, sitting in offices and homes, not accessible to the people whose voices were in the recordings or to their families. The result was PARADISEC, a research repository that acts as a conduit for research outputs to a range of audiences, within and outside of academia. The focus is on recordings and transcripts in the many small languages of the world, and on songs and stories that are unique cultural expressions.
The research data is typically oral tradition from places where little else has been recorded, and has huge value beyond academic research. This is the basic data for research, but it is also cultural material that has value to the people recorded and their descendants, and so we, as outsider researchers, have special responsibilities to treat the materials with respect and to ensure they are accessible to the people we have worked with.
There are some seven thousand languages in the world, and few records exist for most of them. High-quality records are often made by linguists, musicologists, and anthropologists who have spent time studying performance. But, without a digital repository to store these unique records, they are at risk of being lost. The PARADISEC project does not ‘save languages’ and does not ‘save musics’. We are saving records of performance that serve to reflect the diversity of language and performance that exists in the world. These records give presence to voices that are usually marginal and excluded from the internet.
Until the 1990s, many of the unique recordings in PARADISEC were made on analog tape. Analog tape, on reels or cassettes, has a major problem in that it is likely to become unplayable within the next few years.2 The lack of playback equipment for these tapes is one factor in their inaccessibility, in particular for open reels but increasingly also for cassettes.
More critically, the tapes themselves will begin falling apart, having reached the end of their lives.
Paradoxically, while analog tape is fragile, we know that digital records can be even more fragile, yet digitization is currently the recommended means of preservation of analog audio.3 We have probably all had the experience of being unable to open digital files made even ten years ago due to changes in formats and software. A partial solution is to ensure that all files are converted to a format we know has more chance of surviving, and here we follow established standards. Thus, we archive .wav, .txt, .xml, and .tif files but also store lower-resolution copies in .mp3, .pdf, or .jpg format for delivery. Additionally, we make daily backup copies in different physical locations. In 2019 we received the World Data System ( data seal, signifying we conform to all required standards. In 2013 PARADISEC’s collections were inscribed in the UNESCO Australian Memory of the World heritage list.

A mobile phone displaying a catalog of locally relevant records, streamed on a local Wi-Fi network from a Raspberry Pi computer at Erakor village in Vanuatu. © Nick Thieberger

Planning for Project Ending
PARADISEC has received many accolades, but it exists in a research infrastructure environment—both at universities and at a national level—that does not provide for long-term curation of research outputs. We have pioneered a system for describing and curating particular kinds of research data, and know that our methods can be extended to other types of data.
Our database controls metadata entry, and provides checking of incoming materials for conformance to our standards (e.g., file naming, metadata terms, file types, unique identifiers, assignment of digital object identifiers). A complete description of each item is written into a file in the directory that holds the data, so it is a self-describing collection, independent of cataloging software, and the catalog of the entire collection can be reconstituted from the collection itself. Each
item stores collection-level and item-level metadata that is updated every time the catalog entry is edited and saved.
We are able to create arbitrary subsets of the collection from these self-describing directories and not lose the metadata.
We routinely make files available to cultural centers or museums in the Pacific. Simply putting files on a hard disk does make them available, but without a catalog they are difficult, if not impossible, to navigate. To address this, we have written an app that looks into the hard disk and writes an HTML catalog of just those files to create a local viewer.4 The same services that we provide in the online catalog (media players, image viewers, and so on) are also available in this local viewer. We can write these bundles of sub-collections and a catalog to Raspberry Pi units that serve a small Wi-Fi network, allowing access on a mobile device.
The archive’s provision of fixed, reproducible, citable forms of research data makes it a locus of activity, a place from which research materials can be re-used in novel ways, with new knowledge reflected back into the collection over time.
Far from being the endpoint for research, the archive reinserts these materials into an ongoing and dialogic relationship with the people recorded and with future researchers. Without the archival effort, these materials would remain inaccessible once the project that created them ended.
The system automates most of the processes of file ingestion, quality assurance, user management, and access for collections of research materials, especially media recordings, transcripts and material associated with linguistic or musicological fieldwork. We provide advice on our web page about data management and file naming, and we run regular training sessions to encourage thinking about archiving from the beginning of fieldwork, and the use of appropriate tools whose output can be archived and is not locked into proprietary formats.
Once files are in the collection, we assign digital object identifiers (DOI) and our system enforces access conditions.
Each registered user of the catalog accepts a set of conditions,5 and each depositor specifies how their materials can be used.
For items listed as “open,” a registered user can download the file. Even if an item is given “closed” status, meaning there are restrictions on access, the depositor can assign individual rights to other registered users for that item. We also allow for “private” status as a collection is being built, closing even the metadata from public view; no DOI is assigned until that private status is ended.
As our current system is now aging, we are moving to use the Oxford Common File Layout (OCFL, to store the files and Research Object Crate (RO-Crate, to describe them. This provides the same kind of description we had in our metadata files but now in a standards-compliant format. OCFL and RO-Crate are written in JSON, which is a commonly used technology and so should be more robust for our next phase of development than the current Ruby on Rails system. We have built a demonstrator ( using these standards that indicates it is a viable and fruitful direction for our collection.

1. UNESCO National Committee of Australia, “Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC),” Memory of the World #42,
2. National Film and Sound Archive of Australia (2017). “Deadline 2025: Collections at Risk,”
3. International Association of Sound and Audio-visual Archives (2009). “Guidelines on the Production and Preservation of Digital Audio Objects,”
4. Arc Centre of Excellence for the Dynamics of Language, “Data Loader,”
5. PARADISEC, “Conditions of Access,”