A weblog following developments around the world in FRBR: Functional Requirements for Bibliographic Records.

Maintained by William Denton, Web Librarian at York University. Suggestions and comments welcome at wtd@pobox.com.


Confused? Try What Is FRBR? (2.8 MB PDF) by Barbara Tillett, or Jenn Riley's introduction. For more, see the basic reading list.

Books: FRBR: A Guide for the Perplexed by Robert Maxwell (ISBN 9780838909508) and Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools edited by Arlene Taylor (ISBN 9781591585091) (read my chapter FRBR and the History of Cataloging).

Calendar

March 2007
M T W T F S S
« Feb   Apr »
 1234
567891011
12131415161718
19202122232425
262728293031  

30 March 2007

Comparing xISBN and thingISBN (4): Fiction

Filed under: Implementations, LibraryThing, OCLC — William Denton @ 7:11 am

Today I compare what OCLC’s xISBN and LibraryThing’s thingISBN know about the fiction in my personal library. I have everything catalogued and stored in a MySQL database so it wasn’t hard to whip up a script to query the database, pull the ISBNs, and then run through each one and see what results the two services gave back. Remember, given an ISBN, they’ll give you a list of ISBNs of other manifestations (editions) of the same work. xISBN does this based on algorithms that OCLC people run on the enormous WorldCat database. thingISBN does this based on which books LibraryThing users have decided are two different editions of the same thing.

982 of my fiction books (in which I include plays and poetry) have ISBNs. Of the 982, xISBN’s results were greater for 520 and thingISBN’s were greater for 276. In 398 cases it was necessary to combine and de-dupe the results to get the best answers: that is, both services knew about some ISBNs the other one didn’t, so to get the most complete coverage I combined the sets of ISBNs and tossed out any duplicates. 398 / 982 = 41%.

xISBN knew about all of my books, but thingISBN had never seen 94 of them. The most-manifested book unknown to LibraryThing is K.C. Constantine’s The Man Who Liked to Look at Himself, the second in the great series of crime novels about Mario Balzic and Rocksburg, Pennsylvania. xISBN knows ten different manifestations of it.

Herewith, the books work for which the combined and de-duped total (the x+t column is over 200. The t is thingISBN’s results and x is xISBN’s.

x+t   t   x
715 170 642 0140449094 Don Quixote (Cervantes)
647 203 576 0140350160 Treasure Island (Stevenson)
519 213 442 0486280616 Adventures of Huckleberry Finn (Twain)
451 241 353 0192833553 Pride and Prejudice (Austen)
423 189 333 0670821624 The Odyssey (Homer)
402 195 330 014043237X Frankenstein (Shelley)
386 168 314 0192815989 Dracula (Stoker)
353 126 239 0048231134 The Two Towers (Tolkien)
320 152 263 0140366857 The Wind In the Willows (Grahame)

There are a lot of editions of those books! 715 of Don Quixote! Nothing surprising in the results, except that The Two Towers is alone of the three books in The Lord of the Rings. See below for more on that. One other Stevenson is in this list, and I’m sure The Iliad, Tom Sawyer, and other Austens would be too, if I owned them.

x+t   t   x
314 284  36 0393099776 Alice in Wonderland (Carroll)

So far xISBN has had the higher numbers but now it really seems to be letting the side down, but perhaps it’s not simple: my copy is a Norton Critical Edition that includes essays and commentary. It’s not a manifestation of the work Alice in Wonderland, it’s a manifestation of a newer work that contains Alice in Wonderland and a number of derivative works. Complicated. I can see why LibraryThing users grouped it together with all other editions of the book. However, by the FRBR model, it’s a separate work. Perhaps xISBN gives such a low number because it’s keeping it apart from the others. I didn’t check.

x+t   t   x
304 137 240 0486410250 Anne of Green Gables (Montgomery)
300 136 233 0192828398 Twenty Thousand Leagues Under the Sea (Verne)
294 140 217 0452269695 The Essential Dr. Jekyll and Mr. Hyde (Stevenson)

The last book there is actually a similar thing to Alice in Wonderland: it’s “The Definitive Annotated Edition of Robert Louis Stevenson’s Classic Novel.” Both services group it together with the regular editions. In FRBR terms, it’s a separate work.

x+t   t   x
293 122 225 0330242407 The Jungle Book (Kipling)
280 111 226 0670037796 The Three Musketeers (Dumas)
280 111 226 0192827510 The Three Musketeers (Dumas)
278 136 206 0192830937 Around the World in 80 Days (Verne)
271  81 240 0140444300 Les Miserables (Hugo)
233  91 190 034547242X The Hunchback of Notre Dame (Hugo)

Two editions of The Three Musketeers are grouped together at both places and recognized as being the same work. That’s good. thingISBN shows lower numbers than I’d expect for Les Miserables and The Hunchback of Notre Dame. Perhaps they’re just a bit less popular with its users so there aren’t as many manifestations to group.

x+t   t   x
221 217   5 0048231541 The Hobbit (Tolkien)

Problem at xISBN! This is a 1970s Unwin paperbacks edition, and there’s nothing special about it. By some mistake it’s not getting clustered with the hundreds of other editions WorldCat knows about.

x+t   t   x
201  98 157 0140126708 Animal Farm (Orwell)

That’s the last of the fiction in my collection that has over 200 different manifestations. To go back a bit, what about the other two books in The Lord of the Rings?

x+t   t   x
154 150   5 004823155X The Fellowship of the Ring (Tolkien)
130 126   5 0048231576 The Return of the King (Tolkien)

xISBN is preventing two of my three Unwin LOTRs, and The Hobbit, from clustering with other manifestations of the same works! Why does it handle my The Two Towers properly, but not the others? I have no idea. thingISBN groups them all properly.

A solution to this problem is what I’m going to call superduping. (I mentioned this a couple of days ago, and OCLC’s Xiaoming Liu suggested it too.) If thingISBN’s and xISBN’s results differ by, say, an order of magnitude, or there’s something that leads you to think one of them is missing a lot of ISBNs, or whenever you want the absolute maximum number of related ISBNs, try running through ISBNs from one set of results and query the other service about them until you have merged all possible sub-clusterings.

For example, for The Hobbit, where thingISBN said 217, and xISBN said 5, pick one of the 217 that’s not in xISBN’s 5 and query xISBN with it. You’ll probably get back hundreds of results, most of them duplicating thingISBN’s. If thingISBN still knows about numbers that xISBN hasn’t told you about, pick one of them and query xISBN with it. Continue until done, then go the other way and query thingISBN about any results from xISBN that haven’t shown up at thingISBN so far. When finished, you’ll know that all possible clusterings at both services have been joined by you into one big new set. Thus you use each service to correct any failings or lack of knowledge at the other.

Monday: my nonfiction. I’ll give you the top 20 or so, only two of which have over 200 manifestations. After that I’ll do some experiments with superduping (what will the new numbers be for my Tolkiens?) and perhaps I’ll pick out some oddities from all these results.