A weblog following developments around the world in FRBR: Functional Requirements for Bibliographic Records.

Maintained by William Denton, Web Librarian at York University. Suggestions and comments welcome at wtd@pobox.com.


Confused? Try What Is FRBR? (2.8 MB PDF) by Barbara Tillett, or Jenn Riley's introduction. For more, see the basic reading list.

Books: FRBR: A Guide for the Perplexed by Robert Maxwell (ISBN 9780838909508) and Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools edited by Arlene Taylor (ISBN 9781591585091) (read my chapter FRBR and the History of Cataloging).

Calendar

April 2007
M T W T F S S
« Mar   May »
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Superduping: slow introduction

Posted by: William Denton, 18 April 2007 7:30 am
Categories: Implementations,LibraryThing,OCLC

My supderuping experiments were interesting in a few different ways, and I’m still trying some things out and hacking my scripts. I’ll give a few examples over a few days.

First, a brief introduction. Let’s consider Ross Thomas’s novel The Seersucker Whipsaw. My item of this work is an examplar of the 1985 Perennial Library paperback manifestation, which is an embodiment of the author’s final edited text. The ISBN is 0060807288.

If we query thingISBN for 0060807288, we get 3 ISBNs back:

0060807288
0060808497
0446401692

And if we query xISBN for 0060807288, we also get 3 ISBNs back:

0060807288
0060808497
0446401692

The two results sets are identical. Nothing further need be done. That was simple, eh? As far as we can tell, this work has had only three manifestations.

In fact that’s false: these are all paperbacks, respectively from 1985, 1987, and 1992. The first edition was published by Morrow in 1967. Why isn’t it included in the results? It doesn’t have an ISBN! It was published too early to have one. That first ISBNless manifestation is out of luck and won’t show up in any xISBN or thingISBN results.

“That isn’t fair,” I hear you cry. It isn’t. Books that predate International Standard Book Numbers get the cold shoulder from xISBN and thingISBN, which, as you may have noticed from their names, are about ISBNs. “How do we get around that?” I hear you ask. Every work, expression, manifesation, and item will need to have a unique identifier. If one exists (like an ISBN for a manifestation), we can use it. If none exists, we’ll have to make one up and have everyone agree on it. (Or make up several and map them from one to the other.)

For the second example, let’s use another Ross Thomas novel, The Fools In Town Are On Our Side. (The title is from The Adventures of Huckleberry Finn: “Hain’t we got all the fools in town on our side? And ain’t that a big enough majority in any town?”) My item is an examplar of the 2003 St. Martin’s trade paperback reprint, ISBN 0312315821. The first manifestation was published in 1970. It’s one of his best novels and has been reprinted more than The Seersucker Whipsaw.

If we query thingISBN for 0312315821, we get 4 ISBNs back:

0312315821
0380006871
0445405600
0445408677

And if we query xISBN for 0312315821, we get 8 ISBNs back:

0312315821
0340127376
0380006871
0417052502
0445405600
0445405619
0445408677
3548014402

The 4 thingISBN numbers appear in xISBN’s result set. In set theory lingo, one might say that xISBN’s results are a proper superset of thingISBN’s.

If we combine and dedupe the results, we’ll get 8 ISBNs, all the ones from xISBN. What would superduping give us? Might it find more?

As it turns out, no. Here’s what happens:

First run through ISBNs in thingISBN result set
0312315821 is in xISBN's result set
0380006871 is in xISBN's result set
0445405600 is in xISBN's result set
0445408677 is in xISBN's result set
Now run through the ISBNs in xISBN's result set
The above four have been examined already; don't look at them again
0340127376 is unknown at thingISBN
0417052502 is unknown at thingISBN
0445405619 is unknown at thingISBN
3548014402 is unknown at thingISBN

thingISBN can’t give us any leads on new ISBNs. Four ISBNs were known to both places; their clusters sort of lined up. We knew four other ISBNs, from xISBN, and threw them at thingISBN, but we didn’t turn up any previously undiscovered manifestations.

So in this case, combining and deduping gives the same results as superduping. “That’s boring,” I hear you say. Next time I’ll give examples of where superduping breaks apart clusters and gives more complete results. And I’ll show examples of how this can fly out of control and go haywire.