Hangingtogether.org

Getting identifiers created for legacy names

2015-10-30

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by John Riemer of UCLA and Jennifer Baxmeyer of Princeton. A sizable quantity of legacy names are represented only by text strings in bibliographic records; authority records are created only by institutions involved in the Program for Cooperative Cataloging’s Name Authority Cooperative Program (NACO) or in national library programs. Even then, authority records are created only selectively, for certain headings or sometimes only when references are involved. The LC/NACO name authority file contains only 30% of the total names reflected in WorldCat’s bibliographic record access points (9 million LC/NACO records compared to the 30 million total names reported on the WorldCat Identities project page as of 2012).

The library community has become aware of the importance of getting persistent identifiers created for all names. These identifiers are crucial for the transition to linked data. (This overlapped with the previous topic, persistent identifiers for local collections, which was meant to focus on special collections materials, persons and bodies associated with the local institution that were unlikely to exist elsewhere. “If the local institution doesn’t create identifiers for these local resources, who else will?”) Personal name text strings in bibliographic records often reflect commonly-held resources, so the workload of creating needed identifiers has more options, such as through a library cooperative, collaboration among multiple libraries, and authority vendors.

Focus group members were split on the importance of creating persistent identifiers for legacy names relative to “newly-encountered” names (which may be people who have published their first works or referenced in materials that have just been digitized.) Generally, names that are part of current workflows are important regardless of whether they are old or “new”.

Authority records contributed to the LC/NACO authority file or other national authority files become part of the Virtual International Authority File (VIAF), available as linked data. Everyone acknowledged they didn’t have the resources to create authority records for everyone – authority work is not sustainable at scale. A few admitted they were “treading water.” ORCIDs (Open Researcher and Contributor IDs) are becoming more prevalent among faculty, and are included in research information management systems such as Symplectic Elements, but they do not overlap much with the names in VIAF. The Remixing Archival Metadata Project (RAMP) generates authority records for creators of archival collections using the EAC-CPF (Encoded Archival Context—Corporate Bodies, Persons and Families) format, which can then be enhanced with additional data from sources like VIAF and WorldCat Identities.

Some of the issues raised:

We should not conflate authority work with assigning identifiers. Differentiating entities, followed by assigning an identifier and providing a unique text string (as is currently required) can be viewed as compatible with authority work and constituting its first phase.

Our focus should be on identifying a person as distinct from another. We need tools and techniques to make this easier. Matt Carruthers at the University of Michigan has developed a “LCNAF named entity reconciliation” tool to automatically search VIAF for matches to personal and corporate names, look for a Library of Congress source authority record in the matching VIAF cluster, and extract the authorized heading. The resulting dataset pairs the corresponding authorized LCNAF heading with the original name heading, along with a link to the authority record on id.loc.gov.

Disambiguating names is the most labor-intensive part of authority work. Minting identifiers with the idea of merging or splitting them later will not avoid the work to determine whether they represent the same entity or not. We need to remember that people come to libraries not with identifiers, but with names expressed in different languages. We need to have each name associated with an identifier that’s flexible and includes other differentiating information about the entity.

Authority work and algorithms based on text string matching has limits; we will still need manual or expert review. We may have an opportunity to tap the expertise in our user communities to verify whether two identifiers represent the same person or not.

We should link to other personal name sources rather than duplicating work done elsewhere. We compared the entry for the comedian Dame Edna (Barry Humphries) in the National Library of Australia’s Trove service’s People database, which merges information from AusStage, Design and Art Australia Online and Libraries Australia, with the one in WorldCat Identities, mined from the over 350 million bibliographic records in WorldCat. Both the Social Networks and Archival Context (SNAC) database and Wikidata were suggested as linked data sources that could be used for name identification and disambiguation. (The image accompanying this post is from SNAC’s “featured identities” page.)

We are already encountering multiple identifiers for the same legacy name. For example, George Washington University’s Linked Catalog Prototype includes both loc.gov from the Library of Congress and ISNI (International Standard Name Identifier) for Teru Miyamoto (宮本輝).

Given the different name identifier systems already in use, we need a third-party name reconciliation service. Several focus group members are collaborating with OCLC on a Person Entity Lookup pilot to link related sets of person identifiers and authorities.

Some focus group members wished that vendors could be more involved, and that enhancements they make to authority records are shared on the network level. Vendors should compete on what they do with the data, not the data itself.

We discussed the potential of a gamification approach to recruit the community to establish relationships among entities. Wikidata recently released a distributed game, Source Metadata, so people can verify whether the authors of scientific articles are the same or not (you have to log in to play). Universities might take advantage of graduate student energy to disambiguate names on scholarly publications this way.

Some of the issues discussed are reflected in the OCLC Research report, Registering Researchers in Authority Files, published last year.

About Karen Smith-Yoshimura

Karen Smith-Yoshimura, program officer, works on topics related to renovating descriptive and organizing practices with a focus on large research libraries and area studies requirements.

Mail | Web | Twitter | More Posts (63)