Superduping: slow introduction
My supderuping experiments were interesting in a few different ways, and I’m still trying some things out and hacking my scripts. I’ll give a few examples over a few days.
First, a brief introduction. Let’s consider Ross Thomas’s novel The Seersucker Whipsaw. My item of this work is an examplar of the 1985 Perennial Library paperback manifestation, which is an embodiment of the author’s final edited text. The ISBN is 0060807288.
If we query thingISBN for 0060807288, we get 3 ISBNs back:
0060807288 0060808497 0446401692
And if we query xISBN for 0060807288, we also get 3 ISBNs back:
0060807288 0060808497 0446401692
The two results sets are identical. Nothing further need be done. That was simple, eh? As far as we can tell, this work has had only three manifestations.
In fact that’s false: these are all paperbacks, respectively from 1985, 1987, and 1992. The first edition was published by Morrow in 1967. Why isn’t it included in the results? It doesn’t have an ISBN! It was published too early to have one. That first ISBNless manifestation is out of luck and won’t show up in any xISBN or thingISBN results.
“That isn’t fair,” I hear you cry. It isn’t. Books that predate International Standard Book Numbers get the cold shoulder from xISBN and thingISBN, which, as you may have noticed from their names, are about ISBNs. “How do we get around that?” I hear you ask. Every work, expression, manifesation, and item will need to have a unique identifier. If one exists (like an ISBN for a manifestation), we can use it. If none exists, we’ll have to make one up and have everyone agree on it. (Or make up several and map them from one to the other.)
For the second example, let’s use another Ross Thomas novel, The Fools In Town Are On Our Side. (The title is from The Adventures of Huckleberry Finn: “Hain’t we got all the fools in town on our side? And ain’t that a big enough majority in any town?”) My item is an examplar of the 2003 St. Martin’s trade paperback reprint, ISBN 0312315821. The first manifestation was published in 1970. It’s one of his best novels and has been reprinted more than The Seersucker Whipsaw.
If we query thingISBN for 0312315821, we get 4 ISBNs back:
0312315821 0380006871 0445405600 0445408677
And if we query xISBN for 0312315821, we get 8 ISBNs back:
0312315821 0340127376 0380006871 0417052502 0445405600 0445405619 0445408677 3548014402
The 4 thingISBN numbers appear in xISBN’s result set. In set theory lingo, one might say that xISBN’s results are a proper superset of thingISBN’s.
If we combine and dedupe the results, we’ll get 8 ISBNs, all the ones from xISBN. What would superduping give us? Might it find more?
As it turns out, no. Here’s what happens:
First run through ISBNs in thingISBN result set 0312315821 is in xISBN's result set 0380006871 is in xISBN's result set 0445405600 is in xISBN's result set 0445408677 is in xISBN's result set Now run through the ISBNs in xISBN's result set The above four have been examined already; don't look at them again 0340127376 is unknown at thingISBN 0417052502 is unknown at thingISBN 0445405619 is unknown at thingISBN 3548014402 is unknown at thingISBN
thingISBN can’t give us any leads on new ISBNs. Four ISBNs were known to both places; their clusters sort of lined up. We knew four other ISBNs, from xISBN, and threw them at thingISBN, but we didn’t turn up any previously undiscovered manifestations.
So in this case, combining and deduping gives the same results as superduping. “That’s boring,” I hear you say. Next time I’ll give examples of where superduping breaks apart clusters and gives more complete results. And I’ll show examples of how this can fly out of control and go haywire.
It’s an ISBN thing :-)
Comment by Mia Massicotte — 18 April 2007 @ 6:38 pmIn fact, ‘super duping’ doens’t give you any extra manifestations in that last example _because_ one set was a proper superset of the other. In any case where this is true, your ‘super duping’ won’t give you any extras.
This is true because both services implement the property where querrying on any of the ISBNs in a work set will give the same workset. What do you call this property mathematically, I forget? (For that matter, there’s probably a more rigorous and precise name for ‘super duping’ too).
Anyway, your point about identifiers being needed for pre-ISBN works is a good one. On the RDA lists and elsewhere, certain traditionalist catalogers seem to be arguing that our traditional ‘primary access points’ (ie, ‘work identifiers’ and ‘author identifiers’) serve perfectly well. I’m having trouble explaining why they don’t.
Incidentally, ISBNs are somewhat problematic as identifiers for us, in that they don’t line up perfectly 1-to-1 with the way we’ve always divided up the universe. A single manifestation (according to us) can frequently have multiple ISBNs. We don’t divide up the world in the same way the ISBN standard does. This is potentially problematic—ISBN does not exactly serve as a manifestation identifier, what ISBN identifies is an entity that does not exist in our historical practice or in FRBR.
Comment by Jonathan Rochkind — 19 April 2007 @ 9:53 amIf one result set is a subset of the other, you’re right, there’s no point in going any further. I dragged it out for the sake of the example.
Mathematically, I think of the result sets/clusters as sometimes imperfectly defined equivalance classes.
Comment by William Denton — 19 April 2007 @ 10:28 amHa, except I was wrong, as you show in your next example, where one set is a subset of the other, but they can still be super-duped. Now I’m confused, have to think about it more. But I was wrong!
Comment by Jonathan Rochkind — 19 April 2007 @ 12:09 pmYou made me wrong, too! I’m sure we both had our minds on more important things, like morning tea or coffee. However, we can agree if that the results are equal then there’s no point in going further.
Comment by William Denton — 19 April 2007 @ 12:48 pm