A weblog following developments around the world in FRBR: Functional Requirements for Bibliographic Records.

Maintained by William Denton, Web Librarian at York University. Suggestions and comments welcome at wtd@pobox.com.


Confused? Try What Is FRBR? (2.8 MB PDF) by Barbara Tillett, or Jenn Riley's introduction. For more, see the basic reading list.

Books: FRBR: A Guide for the Perplexed by Robert Maxwell (ISBN 9780838909508) and Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools edited by Arlene Taylor (ISBN 9781591585091) (read my chapter FRBR and the History of Cataloging).

Calendar

May 2012
M T W T F S S
« Apr    
 123456
78910111213
14151617181920
21222324252627
28293031  

Superduping: slow introduction

Posted by: William Denton, 18 April 2007 7:30 am
Categories: Implementations,LibraryThing,OCLC

My supderuping experiments were interesting in a few different ways, and I’m still trying some things out and hacking my scripts. I’ll give a few examples over a few days.

First, a brief introduction. Let’s consider Ross Thomas’s novel The Seersucker Whipsaw. My item of this work is an examplar of the 1985 Perennial Library paperback manifestation, which is an embodiment of the author’s final edited text. The ISBN is 0060807288.

If we query thingISBN for 0060807288, we get 3 ISBNs back:

0060807288
0060808497
0446401692

And if we query xISBN for 0060807288, we also get 3 ISBNs back:

0060807288
0060808497
0446401692

The two results sets are identical. Nothing further need be done. That was simple, eh? As far as we can tell, this work has had only three manifestations.

In fact that’s false: these are all paperbacks, respectively from 1985, 1987, and 1992. The first edition was published by Morrow in 1967. Why isn’t it included in the results? It doesn’t have an ISBN! It was published too early to have one. That first ISBNless manifestation is out of luck and won’t show up in any xISBN or thingISBN results.

“That isn’t fair,” I hear you cry. It isn’t. Books that predate International Standard Book Numbers get the cold shoulder from xISBN and thingISBN, which, as you may have noticed from their names, are about ISBNs. “How do we get around that?” I hear you ask. Every work, expression, manifesation, and item will need to have a unique identifier. If one exists (like an ISBN for a manifestation), we can use it. If none exists, we’ll have to make one up and have everyone agree on it. (Or make up several and map them from one to the other.)

For the second example, let’s use another Ross Thomas novel, The Fools In Town Are On Our Side. (The title is from The Adventures of Huckleberry Finn: “Hain’t we got all the fools in town on our side? And ain’t that a big enough majority in any town?”) My item is an examplar of the 2003 St. Martin’s trade paperback reprint, ISBN 0312315821. The first manifestation was published in 1970. It’s one of his best novels and has been reprinted more than The Seersucker Whipsaw.

If we query thingISBN for 0312315821, we get 4 ISBNs back:

0312315821
0380006871
0445405600
0445408677

And if we query xISBN for 0312315821, we get 8 ISBNs back:

0312315821
0340127376
0380006871
0417052502
0445405600
0445405619
0445408677
3548014402

The 4 thingISBN numbers appear in xISBN’s result set. In set theory lingo, one might say that xISBN’s results are a proper superset of thingISBN’s.

If we combine and dedupe the results, we’ll get 8 ISBNs, all the ones from xISBN. What would superduping give us? Might it find more?

As it turns out, no. Here’s what happens:

First run through ISBNs in thingISBN result set
0312315821 is in xISBN's result set
0380006871 is in xISBN's result set
0445405600 is in xISBN's result set
0445408677 is in xISBN's result set
Now run through the ISBNs in xISBN's result set
The above four have been examined already; don't look at them again
0340127376 is unknown at thingISBN
0417052502 is unknown at thingISBN
0445405619 is unknown at thingISBN
3548014402 is unknown at thingISBN

thingISBN can’t give us any leads on new ISBNs. Four ISBNs were known to both places; their clusters sort of lined up. We knew four other ISBNs, from xISBN, and threw them at thingISBN, but we didn’t turn up any previously undiscovered manifestations.

So in this case, combining and deduping gives the same results as superduping. “That’s boring,” I hear you say. Next time I’ll give examples of where superduping breaks apart clusters and gives more complete results. And I’ll show examples of how this can fly out of control and go haywire.


Superduping results next week

Posted by: William Denton, 13 April 2007 7:09 am
Categories: Implementations,LibraryThing,OCLC

A brief note: I got my superduping script working and next week I’ll post some results. With it, I take an ISBN and check thingISBN and xISBN to get the ISBNs that they cluster with it as being other manifestations of the same work. Instead of combining and de-duping the results, as I did before, I run through the ISBNs one by one, and if an ISBN has only been seen at one service I look it up at the other and grab all of the ISBNs that were clustered with it. All of those ISBNs will be new, and I can check on them back at the first service. I go back and forth, using each service’s results to break apart the fragmentation at the other service, forming a maximal superset of ISBNs. This process I call superduping.

When I combined and de-duped the results for my copy of The Hobbit from thingISBN (217 ISBNs) and xISBN (5) I got a total of 221 ISBNs. xISBN’s number was so low that obviously it didn’t have the ISBN of my manifestation grouped in with others. By superduping, I got over 1200 ISBNs! That kind of result won’t happen often, but even for more common works the results were interesting. More on all this next week.


LibraryThing gizmo for libraries

Posted by: William Denton, 11 April 2007 7:11 am
Categories: Implementations,LibraryThing

Over at LibraryThing, Tim Spalding announced a new gizmo for libraries:

LibraryThing for Libraries is composed of a series of widgets, designed to enhancing library catalogs with LibraryThing data and functionality. The achievement is that the widgets require NO back-end integration.

We’re serious. Just add a single Javascript tag, and one tag for every widget you want to display and we do the rest. To make sure the widgets use your library’s version of a title and that some widgets only refer to books you have, you also need to upload a file with ISBNs in it—just ISBNs or all mixed together in MARC records or whatever. The whole thing should work with any catalog.

In a subsequent blog post there is a sample of the XML output the gizmo would offer.

Spalding did up a demo of what the gizmo would look like in action on a New York Public Library page, and you can see why I mention it here: he’s using thingISBN data to group together other manifestations of the same work under Related Editions.

It’s great he’s making this available, especially as XML so that libraries with programmers can do what they want with the data. Libraries without programmers can just bung in a line of HTML and get something that will help users. Or will it? Would they find it confusing? I wonder what usability testing will say. I see two reasons why LibraryThing’s service may not get much uptake from libraries.

First, they’re unable or scared to making the slightest change to their catalogue, especially from an organization that doesn’t have thousands of employees and doesn’t charge them lots of money.

Second, users may be confused by seeing a list of books that appear to be identical to the one they’re looking at. In Spalding’s example, if I’m an NYPL user looking for Harry Potter and the Half-Blood Prince, why would I follow any of the related links? Would it help if more information were there, such as if it’s a translation? That’s at the FRBR Expression level, and not available yet. Where thingISBN’s information is really needed is inside the workings of the catalogue, not pasted on top, but library systems vendors move slowly and most seem uninterested in FRBRizing.

That said, perhaps libraries running the free and open source systems Koha and Evergreen could sign up and work to get this data handled inside the catalogue, where it could be used to give the users better results, instead of just layering it on top.

However, the new gizmo is certainly a step in the right direction and congrats to LibraryThing.


Comparing xISBN and thingISBN (5): Nonfiction

Posted by: William Denton, 2 April 2007 7:58 am
Categories: Implementations,LibraryThing,OCLC

In my previous post I compared how much thingISBN and xISBN know about my fiction books. Today I look at how much the two services know about nonfiction in my personal library.

Exactly 1000 works of nonfiction in my collection have ISBNs. Of that, xISBN knew more manifestations for 371, and thingISBN knew more for 338. In 187 cases (call it 20%, or half the rate for fiction) it was necessary to combine and de-dupe the results to get the most ISBNs.

xISBN knew about all of my books, but thingISBN had never seen 116 of them. There was a tie for the most-manifested book unknown to LibraryThing: Clark Blaise’s Time Lord: Sir Sandford Fleming and the Creation of Standard Time and Dorothy Gardiner and Kathrine Sorley Walker’s Raymond Chandler Speaking have eight manifestations at xISBN but LibraryThing has never seen the ISBNs of my copies so it can’t give any results. It does know about other editions. Superduping will help in such cases.

Herewith, the top nonfiction results. There are twenty books covering nineteen works, ten of which are translated into English. The x+t column shows how many combined and de-duped ISBNs there are; t is thingISBN’s results; x is xISBN’s.

x+t   t   x
322 140 280 0877735425 Tao Teh Ching
273  68 241 0140441212 The Bhagavad Gita
141  28 132 0140390448 Walden and Civil Disobedience (Thoreau)
130  59 107 0140209158 The Communist Manifesto (Marx and Engels)
 92  45  71 0486290735 The Autobiography of Benjamin Franklin
 88   2  87 0140150609 The Portable Gibbon

The Portable Gibbon is the first odd one. It’s in the Viking Portable series, and it’s an abridged (very abridged, but it’s still thick) edition of The Decline and Fall of the Roman Empire by Edward Gibbon. In FRBR terms, abridgments are new expressions of a work. (See Barbara Tillett’s great “Family of Works” diagram in What Is FRBR? for more on when something is the same expression, a new expression, or a new work.) xISBN gives 87 because it considers The Portable Gibbon and Decline and Fall to be the same work, as FRBR dictates, but thingISBN gives 2 because LibraryThing users consider it a separate work. That’s understandable, especially in this case, because it does really seem like a different work.

x+t   t   x
 87  33  70 0380010003 The Interpretation of Dreams (Freud)
 83  51  59 0812968255 Meditations (Marcus Aurelius)
 83  51  59 0140441409 Meditations (Marcus Aurelius)

Both of my editions of the great Stoic work are grouped together, as they should be. They are different translations, so they are separate expressions.

x+t   t   x
 79  61  53 0020867409 The Screwtape Letters (Lewis)
 78  60  41 055305340X A Brief History of Time (Hawking)

Interesting that thingISBN is ahead on the Hawking book. I suspect a clustering problem at xISBN.

x+t   t   x
 69  23  56 1857150848 Confessions (Rousseau)
 60  23  45 0553370901 The Tibetan Book of the Dead
 59  39  47 0879510188 A Book of Five Rings (Musashi)
 52  40  29 0062700375 Halliwell's Film Guide (Halliwell)

My Halliwell’s Film Guide is the eighth revised edition, the last one he edited; he died afterwards and someone else took over the series. Both services group together all the revised editions as the same work.

x+t   t   x
 50  20  42 0141182768 Seven Pillars of Wisdom (Lawrence)
 47  47   2 1580085415 What Color Is Your Parachute? (Bolles)

Something odd at xISBN with this one. What Color Is Your Parachute? is a book about how to find a job, and it’s revised every year. WorldCat mostly considers it an annual serial, and xISBN groups all of the yearly editions together, but for some reason mine is in a separate cluster. thingISBN groups all the years together as the same work.

x+t   t   x
 46  29  35 048642703X The Protestant Ethic and the Spirit of Capitalism (Weber)
 46   2  46 0801856620 History of My Life (6 vols) (Casanova)
 42  35  17 0553225987 Robert's Rules of Order (Robert)

Nice to follow What Color Is Your Parachute? with Weber’s classic work of sociology, and then get on to Casanova’s memoirs. There’s a trio for you.

The Casanova is an interesting case. (He was a librarian at the end of his life, by the way.) The ISBN I’ve showed is for the first volume of six in W.R. Trask’s translation. This manifestation comes from Johns Hopkins University Press. The other five volumes all had almost identical numbers at both services. xISBN groups this translation in with lots of others, including the various printings of Arthur Machen’s translation, but thingISBN doesn’t. There are other editions of Casanova’s memoirs in LibraryThing, including the Penguin Classics abridgment, but they’re not clustered with this. It gets confusing, because some English translations are called The Memoirs of Jacques Casanova, and different manifestations have different numbers of volumes. If you look at the LibraryThing page for combining works by Casanova you’ll get an idea of how bibliographically challenging the memoirs are. A full FRBRization, with the expression layer dealing with translations, would help a lot. Anyway, here it seems like thingISBN is missing out, whereas with What Color Is Your Parachute? xISBN was missing out. Superduping will fix this.

That ends the big comparisons of fiction and nonfiction. Lesson learned: If you’re using either xISBN or thingISBN, you should be using both and combining and de-duping results. I think superduping will give interesting even more and I’ll post on that soon.


Comparing xISBN and thingISBN (4): Fiction

Posted by: William Denton, 30 March 2007 7:11 am
Categories: Implementations,LibraryThing,OCLC

Today I compare what OCLC’s xISBN and LibraryThing’s thingISBN know about the fiction in my personal library. I have everything catalogued and stored in a MySQL database so it wasn’t hard to whip up a script to query the database, pull the ISBNs, and then run through each one and see what results the two services gave back. Remember, given an ISBN, they’ll give you a list of ISBNs of other manifestations (editions) of the same work. xISBN does this based on algorithms that OCLC people run on the enormous WorldCat database. thingISBN does this based on which books LibraryThing users have decided are two different editions of the same thing.

982 of my fiction books (in which I include plays and poetry) have ISBNs. Of the 982, xISBN’s results were greater for 520 and thingISBN’s were greater for 276. In 398 cases it was necessary to combine and de-dupe the results to get the best answers: that is, both services knew about some ISBNs the other one didn’t, so to get the most complete coverage I combined the sets of ISBNs and tossed out any duplicates. 398 / 982 = 41%.

xISBN knew about all of my books, but thingISBN had never seen 94 of them. The most-manifested book unknown to LibraryThing is K.C. Constantine‘s The Man Who Liked to Look at Himself, the second in the great series of crime novels about Mario Balzic and Rocksburg, Pennsylvania. xISBN knows ten different manifestations of it.

Herewith, the books work for which the combined and de-duped total (the x+t column is over 200. The t is thingISBN’s results and x is xISBN’s.

x+t   t   x
715 170 642 0140449094 Don Quixote (Cervantes)
647 203 576 0140350160 Treasure Island (Stevenson)
519 213 442 0486280616 Adventures of Huckleberry Finn (Twain)
451 241 353 0192833553 Pride and Prejudice (Austen)
423 189 333 0670821624 The Odyssey (Homer)
402 195 330 014043237X Frankenstein (Shelley)
386 168 314 0192815989 Dracula (Stoker)
353 126 239 0048231134 The Two Towers (Tolkien)
320 152 263 0140366857 The Wind In the Willows (Grahame)

There are a lot of editions of those books! 715 of Don Quixote! Nothing surprising in the results, except that The Two Towers is alone of the three books in The Lord of the Rings. See below for more on that. One other Stevenson is in this list, and I’m sure The Iliad, Tom Sawyer, and other Austens would be too, if I owned them.

x+t   t   x
314 284  36 0393099776 Alice in Wonderland (Carroll)

So far xISBN has had the higher numbers but now it really seems to be letting the side down, but perhaps it’s not simple: my copy is a Norton Critical Edition that includes essays and commentary. It’s not a manifestation of the work Alice in Wonderland, it’s a manifestation of a newer work that contains Alice in Wonderland and a number of derivative works. Complicated. I can see why LibraryThing users grouped it together with all other editions of the book. However, by the FRBR model, it’s a separate work. Perhaps xISBN gives such a low number because it’s keeping it apart from the others. I didn’t check.

x+t   t   x
304 137 240 0486410250 Anne of Green Gables (Montgomery)
300 136 233 0192828398 Twenty Thousand Leagues Under the Sea (Verne)
294 140 217 0452269695 The Essential Dr. Jekyll and Mr. Hyde (Stevenson)

The last book there is actually a similar thing to Alice in Wonderland: it’s “The Definitive Annotated Edition of Robert Louis Stevenson’s Classic Novel.” Both services group it together with the regular editions. In FRBR terms, it’s a separate work.

x+t   t   x
293 122 225 0330242407 The Jungle Book (Kipling)
280 111 226 0670037796 The Three Musketeers (Dumas)
280 111 226 0192827510 The Three Musketeers (Dumas)
278 136 206 0192830937 Around the World in 80 Days (Verne)
271  81 240 0140444300 Les Miserables (Hugo)
233  91 190 034547242X The Hunchback of Notre Dame (Hugo)

Two editions of The Three Musketeers are grouped together at both places and recognized as being the same work. That’s good. thingISBN shows lower numbers than I’d expect for Les Miserables and The Hunchback of Notre Dame. Perhaps they’re just a bit less popular with its users so there aren’t as many manifestations to group.

x+t   t   x
221 217   5 0048231541 The Hobbit (Tolkien)

Problem at xISBN! This is a 1970s Unwin paperbacks edition, and there’s nothing special about it. By some mistake it’s not getting clustered with the hundreds of other editions WorldCat knows about.

x+t   t   x
201  98 157 0140126708 Animal Farm (Orwell)

That’s the last of the fiction in my collection that has over 200 different manifestations. To go back a bit, what about the other two books in The Lord of the Rings?

x+t   t   x
154 150   5 004823155X The Fellowship of the Ring (Tolkien)
130 126   5 0048231576 The Return of the King (Tolkien)

xISBN is preventing two of my three Unwin LOTRs, and The Hobbit, from clustering with other manifestations of the same works! Why does it handle my The Two Towers properly, but not the others? I have no idea. thingISBN groups them all properly.

A solution to this problem is what I’m going to call superduping. (I mentioned this a couple of days ago, and OCLC’s Xiaoming Liu suggested it too.) If thingISBN’s and xISBN’s results differ by, say, an order of magnitude, or there’s something that leads you to think one of them is missing a lot of ISBNs, or whenever you want the absolute maximum number of related ISBNs, try running through ISBNs from one set of results and query the other service about them until you have merged all possible sub-clusterings.

For example, for The Hobbit, where thingISBN said 217, and xISBN said 5, pick one of the 217 that’s not in xISBN’s 5 and query xISBN with it. You’ll probably get back hundreds of results, most of them duplicating thingISBN’s. If thingISBN still knows about numbers that xISBN hasn’t told you about, pick one of them and query xISBN with it. Continue until done, then go the other way and query thingISBN about any results from xISBN that haven’t shown up at thingISBN so far. When finished, you’ll know that all possible clusterings at both services have been joined by you into one big new set. Thus you use each service to correct any failings or lack of knowledge at the other.

Monday: my nonfiction. I’ll give you the top 20 or so, only two of which have over 200 manifestations. After that I’ll do some experiments with superduping (what will the new numbers be for my Tolkiens?) and perhaps I’ll pick out some oddities from all these results.


Ruby gem: xisbn

Posted by: William Denton, 29 March 2007 7:03 am
Categories: LibraryThing,OCLC

Tonight I was working on comparing all my be-ISBNed books at thingISBN and xISBN, but while testing and debugging I hit xISBN’s daily limit and had to stop. They very quickly fixed that but I’m going to delay a day in posting the results. The numbers were looking very interesting, both for how xISBN and thingISBN compared and for how many manifestations there are of books like Don Quixote and Treasure Island.

Instead, today I give a pointer to something that may help people who use the programming language Ruby.

A helpful commenter named James left a note last Friday with a pointer to the xisbn Ruby gem written by Ed Summers. Thank you, James! Thank you, Ed! If you’re a Ruby hacker doing anything with xISBN or thingISBN, it’ll be handy. You can say

require 'xisbn'
include XISBN
xs = xisbn('0394821998')
things = thing_isbn('0812548345')

You’ll get back arrays of ISBNs from both services. Run gem install xisbn to install it.

If you know of anything similar for other languages, please leave a comment or drop me a note.


Comparing xISBN and thingISBN (3)

Posted by: William Denton, 28 March 2007 7:31 am
Categories: Implementations,LibraryThing,OCLC

Today I’m comparing how some of my mathematics books fare in LibraryThing’s thingISBN and OCLC’s xISBN services. Given an ISBN, they each return a list of ISBNs of other manifestations (that is, editions) of the same work. Other manifestations that they know about. Of course, if they don’t know about a book, or don’t think it matches with any others, or in LibraryThing’s case the users haven’t grouped it, they won’t have anything to say about it.

Here’s a table showing the results. Each book takes up two rows. Yes, the formatting is a bit ugly, but you can bear it. The top row has the title and author. On the second row are some numbers. The first is the combined and de-duped count of how many ISBNs both thingISBN and xISBN know about. Next is the thingISBN count, then the xISBN count, then the count taken from WorldCat’s Editions tab. (WorldCat’s numbers will never be greater than 25, because 25 is the limit of results it will show.) xISBN and WorldCat’s Editions tab are both from OCLC, but their sources aren’t always in sync. Follow the links to see the raw results.

Some things to notice about the list:

  • These books tend to the academic side of things, but some are quite popular. (As math books go.)
  • Most of them are paperback. University libraries would more likely have them in hardcover, however, xISBN is bound to do a good job of grouping the two together.
  • No-one on LibraryThing has my old Linear Algebra textbook. It’s probably not in use in first-year algebra courses now. My edition of Flatland is completely unknown to thingISBN, which is very surprising. No-one there has Mathematics and the Imagination either. My edition is a Penguin paperback, and I see it in used bookstores occasionnally. The latter two results are unexpected.
  • thingISBN has a 28 count for my edition of Gödel, Escher, Bach, but xISBN doesn’t know about any others. xISBN is failing, or missing something.
  • Forever Undecided by Raymond Smullyan (mine is a trade paperback) gets a 5 at thingISBN but just a 2 at xISBN. I imagine it’s in a lot of libraries, though.
  • My two volumes of Heath’s translation of Euclid give confusing results. Volume 1 isn’t matched up with other editions at either place. Volume 2 gets a 19 from xISBN, but has no companions at thingISBN. Strange. Is it something to do with being Dover reprints? All of the 0-486 books are from Dover, who do a great job of reprinting old math books. Perhaps it’s because Euclid’s Elements has a confusing printing history.
  • Gödel’s Proof by Nagel and Newman is a classic, and thingISBN gives an 8, but xISBN only 1. I’m sure it’s widely held in many libraries and personal collections, so xISBN is failing or missing something.
  • Most things that aren’t extreme cases or probable misses or mistakes do well at both places. For example, Bertrand Russell’s Introduction to Mathematical Philosophy and Boolos and Jeffrey’s Computability and Logic, both old textbooks and classics in their fields, do about equally well.
  • I’m a bit surprised by the number of cases where thingISBN knows more than xISBN.
x+t   t   x  WC
Alan Turing: The Enigma (Hodges)
  8   8   1   4 0099116413
On Numbers and Games (Conway)
  2   2   2   4 0121863506
Elementary Differential Equations with Applications (Penney and Edwards)
  9   6   5   6 0132541297
Linear Algebra (Insel, Spence, and Friedberg)
  6   0   6   8 0135370191
Gödel, Escher, Bach: An Eternal Golden Braid (Hofstadter)
 28  28   1   0 0140055797
Mathematics and the Imagination (Kasner and Newman)
  4   0   4  20 0140803882
Forever Undecided: A Puzzle Guide to Gödel (Smullyan)
  5   5   2   3 0192821962
Reflections on Kurt Gödel (Wang)
  2   2   1   0 0262730871
The Fifty-Nine Icosahedra (Coxeter et al)
  3   2   3   1 038790770X
Differential Equations and Their Applications (Braun)
  9   4   7   7 0387908064
Uses of Infinity (Zippin)
  3   0   3   4 0394015630
The Universal History of Numbers (Ifrah)
  9   9   2   2 0471375683
Introduction to Mathematical Philosophy (Russell)
  8   6   6  23 0486277240
The Thirteen Books of Euclid's Elements (v 1) (Euclid and Heath)
  1   1   1   8 0486600882
The Thirteen Books of Euclid's Elements (v 2) (Euclid and Heath)
 19   1  19  25 0486600890
On Formally Undecidable Propositions (Gödel)
  1   1   1   4 0486669807
Proofs and Refutations (Lakatos)
  4   2   4   8 0521290384
Philosophy of Mathematics (Putnam and Benacerraf)
  4   4   2   2 052129648X
Computability and Logic (Boolos and Jeffrey)
  8   7   8   6 0521389232
Flatland (Abbott)
 28   0  28  25 0631029605
The Man Who Knew Infinity (Kanigel)
  4   4   3   8 0684192594
Godel's Proof (Nagel and Newman)
  8   8   1   2 0710070780
Geometry Revisited (Coxeter and Greitzer)
  3   2   2   3 088385600X
Calculus (Spivak)
  5   5   4   6 0914098772
x+t   t   x  WC

It would be interesting, though somewhat onerous, to do a more in-depth project comparing thingISBN and xISBN, perhaps by comparing results for random samples of different kinds of books from different kinds of libraries. This would tell us something about how well xISBN works and what sorts of books LibraryThing users have and how well they’ve made their clusters. On the other hand, if you’re actually implementing something and need the best results, the same holds true as yesterday: use both.

Upshot of this comparison based on a small sample of my math books: Sometimes xISBN misses manifestations that must be there; something about the data or its algorithm stops it from doing the clustering. Sometimes thingISBN doesn’t know anything about a given book. For best results, combine and de-dupe results from both services.

Tomorrow: who knows more about all the books in my library? Summary results only! No big table.

(Slightly edited after first posting.)


Comparing xISBN and thingISBN (2)

Posted by: William Denton, 27 March 2007 7:32 am
Categories: Implementations,LibraryThing,OCLC

Last week I posted Comparing xISBN and thingISBN, where I did a quick informal look at how the two services handled four books from my collection. The comments were interesting. Mia Massicotte speculated that paperbacks and fiction will probably do better in thingISBN and hardcovers and scholarly books will probably do better in xISBN. Today and tomorrow I’m doing some more comparisons to test this. Nothing scientific, just a bit of poking around with a few sets of books to see if any patterns emerge.

Today I do some paperback fiction and some picture books. Tomorrow I’ll do some mathematics books, most of which are fairly academic. Thursday I’ll try to run all my books through the services and post some aggregate numbers on who knows more.

In the result sets, you’ll see four columns of numbers at the left. x+t is first, but let me explain it third. t is the result count from thingISBN. x is the result count from xISBN. x+t is the count of the combined and de-duped results from both; that is, the two sets of ISBNs are put together and any duplicates removed. (Hence, this will always be equal to or greater than the greater of the thingISBN and xISBN result counts.)

WC is the count from WorldCat’s Editions tab. WorldCat never displays more than 25 other manifestations of a work, so this number will never be over 25. I asked Thom Hickey why xISBN and WorldCat’s Editions tabs sometimes showed different numbers and he said that the two systems get their clustering data from two different sources that may not be synchronized. They’re continuing to work on algorithms and the xISBN implementation so I expect both xISBN and the WorldCat numbers to get more accurate. For now, though, they can sometimes be quite different, which is interesting, so I’ve included them.

These books are all from my own library. I wrote a Ruby script to query my collection database and then check with LibraryThing and OCLC. My paperback fiction subjects here aren’t of Stephen King, Dan Brown, or Danielle Steel’s level of popularity, but I don’t have any of their books. I grabbed two sets of novels by writers I thought would give interesting results. Next time I might check George MacDonald Fraser, Kim Stanley Robinson, and Donald E. Westlake. They’re all still publishing today. Come to think of it, they’d probably be better subjects than the two I picked, but it’s too late now.

First, some paperback novels by Geoffrey Household. Rogue Male is certainly his best known book, and one of the best and most unusual thrillers of the past century. xISBN knows about 11 manifestations in a set including the one I have, thingISBN knows about 6, and between them they know about 13 different ones. WorldCat’s Editions tab matches two manifestations, and shows the one I have and one other. Household’s other novels are less popular and the numbers show that they’ve been printed in few manifestations. xISBN knows more about them than thingISBN does.

x+t   t   x  WC
  4   1   4   4   0140048359 Hostage: London
  5   2   5   7   0140052739 The Last Two Weeks of George Rivac
  3   0   3   6   0140045228 Red Anger
  3   1   3   4   0140068538 Rogue Justice
 13   6  11   2   0140006958 Rogue Male
  4   0   4  10   0140022732 A Rough Shoot

Next, here are books, almost all paperbacks, by John D. MacDonald, one of the greats of the paperback original era who didn’t get into hardcover originals until the early 1970s. His Travis McGee series for Fawcett Gold Medal was massively popular. These are the JDMs for which I have ISBNs. His early books came out before ISBNs were invented. I think they’ve all been reprinted during the ISBN era, so more recent editions have them, but a few of my copies are too early. I skipped them to save time.

You can see that I have two different manifestations (paperback and hardcover) of both Cinnamon Skin and One More Sunday. thingISBN and xISBN’s numbers for them all match up, which shows that they correctly group both of my manifestations together as being the same work.

All of the books with colours in the title, such as Free Fall in Crimson and The Lonely Silver Rain, are in the McGee series and have been reprinted many times. It’s not surprising to see double-digit numbers for most of them. Darker Than Amber and Nightmare in Pink are unusual: xISBN doesn’t know about any other matching manifestations, but thingISBN does. Seems odd. thingISBN wins there. For most of the others, xISBN has a slight edge, but both know about some that the other doesn’t.

WorldCat’s Editions tab usually groups more together than xISBN does, such as for Cinnamon Skin, where it groups 15 manifestations to xISBN’s 4 (and thingISBN’s 6).

x+t   t   x  WC
  3   3   1   0   0449129578 All These Condemned
  1   0   1   0   044902380X Ballroom of the Skies
  8   3   8  11   0449131793 Barrier Island
  3   3   2   6   0449137147 Border Town Girl
  3   3   1   0   0449141411 The Brass Cupcake
  4   3   3   7   0449141063 A Bullet for Cinderella
  7   6   4  15   0060149906 Cinnamon Skin
  7   6   4  15   044912505X Cinnamon Skin
  2   0   2   3   0449123596 Clemmie
  2   2   1   0   0449134296 Cry Hard, Cry Fast
 10  10   1   0   0449127524 Darker Than Amber
 14  11   8  19   039701032X A Deadly Shade of Gold
  4   0   4   8   0449143236 Death Trap
  3   1   3   4   0449140164 The Deceivers
 17  11  14  18   0449141497 The Empty Copper Sea
  4   4   1   1   0449140598 The Executioners
 17  10  15  16   0449144410 Free Fall in Crimson
 16  11  11  16   0449129152 The Girl in the Plain Brown Wrapper
 16  10  15  18   0449123995 The Green Ripper
  1   0   1   0   0449024814 A Key to the Suite
  8   6   7  11   0449125092 The Lonely Silver Rain
  8   8   1   0   0449129659 The Long Lavender Look
  2   2   1   0   0449129667 A Man of Affairs
  3   2   3   3   0449136027 Murder in the Wind
 10  10   1  21   0449133125 Nightmare in Pink
  8   4   8   9   044920703X One More Sunday
  8   4   8   9   0394536738 One More Sunday
  3   3   2   8   0449140806 Please Write for Details
  1   1   1   0   0881840114 Two

Upshot of paperback fiction: Seems like more often than not xISBN has the edge, but sometimes thingISBN knows more. Sometimes xISBN will fail to group your manifestation with others and give a misleading answer. For best results, combine them.

Next, some picture books. The Denton ones are by Kady MacDonald Denton, my mother. They’ve come out in hardcover, paperback, and often come out in a fresh edition a few years later. (All are excellent and I highly recommend them!) Most have been translated into several other languages, but that wouldn’t show up here. The two manifestations each of A Second is a Hiccup and Two Homes are hardcover and paperback; xISBN groups them but thingISBN hasn’t seen both. Le carrousel is the French version of A Second is a Hiccup but it’s alone. For these books, xISBN definitely knows more.

The Flack/Wiese and McCloskey classics are odd because for my editions of The Story About Ping and Make Way for Ducklings, xISBN doesn’t group them with the dozens of other manifestations. If it did, you’d see higher numbers for it than for thingISBN, as is true for Blueberries for Sal. More xISBN oddness, or a grouping failure.

Upshot: In general xISBN knows more about children’s books than thingISBN. However, in some cases xISBN will fail to group your manifestation with others. As usual, group both sets of results together.

x+t   t   x  WC
  3   0   3   4   1550745549 A Child's Treasury of Nursery Rhymes (Denton)
  2   0   2   2   0416130127 The Christmas Boot (Denton)
  6   0   6   5   0744514401 Granny is a Darling (Denton)
  6   1   6   2   0753452243 In the Light of the Moon and Other Bedtime Stories (Denton and McBratney)
  1   0   1   1   0439974011 Le carrousel: Un poeme sur l'enfance (Denton and Hutchins)
  4   1   4   3   0439949033 A Second is a Hiccup: A Child's Book of Time (Denton and Hutchins)
  4   0   4   3   0439974003 A Second is a Hiccup: A Child's Book of Time (Denton and Hutchins)
  4   0   4   4   0744589258 Two Homes (Denton and Masurel)
  4   2   4   4   0763605115 Two Homes (Denton and Masurel)
 10  10   1   1   0140502416 The Story About Ping (Flack and Wiese)
 19   8  15  25   014050169X Blueberries for Sal (McCloskey)
 21  21   1   7   0140501711 Make Way for Ducklings (McCloskey)

A few points:

  • For best results, check both xISBN and thingISBN, and combine and de-dupe the results.
  • xISBN usually knows more, but sometimes gives back strange results.
  • For a possibly more expansive, though more resource-intensive, set of matching manifestations, form a new set of ISBNs by taking the first results from thingISBN and looking up each ISBN in turn at xISBN, and taking the first results from xISBN and looking up each ISBN in turn at thingISBN. That is, if thingISBN give a result count of 4 ISBNs and xISBN gives 6, look up each of thing’s 4 at xISBN and each of x’s 6 at thing. Form a new set of all the ISBNs returned, and de-dupe. Perhaps thingISBN groups two ISBNs that xISBN has in two different clusters, or vice versa. This would get around the strange behaviour shown above where xISBN only returns 1 result for Make Way for Ducklings: you’d have 20 fresh ISBNs from thingISBN to use when re-searching xISBN.
  • xISBN draws on WorldCat, which is made up of data from libraries all over the United States, and many from elsewhere around the world. Libraries do buy a lot of books, and in lots of different editions, and they’ve been doing so for decades. WorldCat’s database is huge, and I wouldn’t underestimate its holdings of any kind of book, be it cheap paperback or expensive academic text.
  • On the other hand, LibraryThing’s results are damned impressive. I also wouldn’t underestimate its holdings.
  • Its children’s book numbers are low, however. The Kady MacDonald Denton results above make me suspect that it won’t do well on children’s books that have not yet become classics that adults buy for themselves. How many LibraryThing users who are parents catalogue their children’s picture books? And how do those ownership numbers compare to the number of picture books they borrow from the library?
  • What about pre-ISBN books? I may test some of them.

Comparing xISBN and thingISBN

Posted by: William Denton, 23 March 2007 7:51 am
Categories: Implementations,LibraryThing,OCLC

I whipped up a little Ruby script to compare results from LibraryThing’s thingISBN and OCLC’s xISBN. (Tim Spalding of LibraryThing does some comparisons in his announcement of thingISBN, which is where I linked. He’d even added an option to thingISBN so it would return xISBN results as well, but OCLC put the kaibosh on that.)

#!/usr/local/bin/ruby

# Use thingISBN and xISBN and put their answers together to get
# the most ISBNs of other manifestations of a work, given the ISBN of
# one manifestation of said work. Eliminate duplicates.

# Change the ISBN to anything you want. 

# Richard Pevear's new translation of THE THREE MUSKETEERS by Alexandre Dumas.
isbn = '0670037796'

# Oxford Classics edition that WorldCat has only one other manifestation for
# isbn = '0192835750'

# Anthony Powell, BOOKS DO FURNISH A ROOM, Fontana pb
# isbn = '0006130879'

# Charles Willeford, THE BURNT ORANGE HERESY, Black Lizard
# isbn = '0887390250'

require 'net/http'

require 'rubygems'
require 'xmlsimple'

puts "Finding manifestations of #{isbn} ..."

# First, get data from thingISBN at LibraryThing

thingURL = "http://www.librarything.com/api/thingISBN/"

url = thingURL + isbn
xml_data = Net::HTTP.get_response(URI.parse(url)).body

data = XmlSimple.xml_in(xml_data)

thingISBNs = []

data['isbn'].each do |i|
  thingISBNs << i
  # puts "thingISBN: #{i}"
end

# Next, get data from xISBN at OCLC

xISBNURL = "http://xisbn.worldcat.org/webservices/xid/isbn/"

url = xISBNURL + isbn + "?method=getEditions&format=xml"
xml_data = Net::HTTP.get_response(URI.parse(url)).body

data = XmlSimple.xml_in(xml_data)

xISBNs = []

data['isbn'].each do |i|
  xISBNs << i
  # puts "xISBN: #{i}"
end

allISBNs = (thingISBNs + xISBNs).uniq

xNotThing = []
thingNotX = []

allISBNs.each do |isbn|
   xNotThing << isbn if xISBNs.include?(isbn) and not thingISBNs.include?(isbn)
   thingNotX << isbn if thingISBNs.include?(isbn) and not xISBNs.include?(isbn)
end

puts " Known to thingISBN: #{thingISBNs.size} (#{thingNotX.size} of which not kn
own to xISBN)"
puts " Known to     xISBN: #{xISBNs.size} (#{xNotThing.size} of which not known
to thingISBN)"

puts "              Total: #{allISBNs.size}"

# Print ISBNs known to LibraryThing but not xISBN.
# thingNotX.sort.each do |isbn|
#   puts isbn
# end

I ran it on that first ISBN, the new and reportedly excellent Pevear translation of Dumas’s The Three Musketeers, and got this:

Finding manifestations of 0670037796 ...
 Known to thingISBN: 109 (52 of which not known to xISBN)
 Known to     xISBN: 226 (169 of which not known to thingISBN)
              Total: 278

“I say, what’s this?” I ejaculated, because I had just finished a P.G. Wodehouse novel. I’d imagined that xISBN, emerging as it does from OCLC’s vast WorldCat, made up of catalogue information from libraries all over (mostly from the United States), would vastly outnumber thingISBN, which draws on the work groupings done by users of LibraryThing, which, granted, is globally popular. But for this manifestation of this work, thingISBN knew of 109 manifestations of the same work (including the one I’d specified, so that’s 108 others), and 52 weren’t known to xISBN. xISBN, in turn, knew of 226 manifestations, 169 of which weren’t known to thingISBN. 278 different manifestations are known between the two.

So xISBN does outnumber thingISBN, but thingISBN knows about 52 manifestations that xISBN doesn’t! Is it because they aren’t in WorldCat, or because the work-grouping algorithm didn’t catch them?

I didn’t check this for all of the 52, but I did try it on my Oxford Classics edition of The Three Musketeers:

Finding manifestations of 0192835750 ...
 Known to thingISBN: 109 (108 of which not known to xISBN)
 Known to     xISBN: 1 (0 of which not known to thingISBN)
              Total: 109

This shows that this manifestation is one of the ones thingISBN has grouped into the work of The Three Musketeers, but xISBN thinks it stands alone. However, when you look it up at WorldCat, you find it’s been grouped with a 1956 edition from Longman’s (look under the Editions tab). I’m not sure what’s going on here but it seems odd. I expect both OCLC sources to agree.

(Conversely, I didn’t check the 169 manifestations that xISBN knows about that thingISBN doesn’t, so I don’t know if they’re not in LibraryThing at all or if they are but haven’t been grouped.)

Anthony Powell‘s Books Do Furnish a Room has been published in a number of editions, and thingISBN wins for my Fontana paperback:

Finding manifestations of 0006130879 ...
 Known to thingISBN: 8 (7 of which not known to xISBN)
 Known to     xISBN: 1 (0 of which not known to thingISBN)
              Total: 8

Looking for other manifestations of Charles Willeford‘s The Burnt Orange Heresy (an excellent noir crime novel about modern art — Willeford’s one of the great American writers of the twentieth century) gave me the sorts of results I’d expected in the first place, with thingISBN’s result being a proper subset of xISBN’s. This is for my Black Lizard edition:

Finding manifestations of 0887390250 ...
 Known to thingISBN: 3 (0 of which not known to xISBN)
 Known to     xISBN: 7 (4 of which not known to thingISBN)
              Total: 7

Upshot: If you have an ISBN in hand and want to find ISBNs of other manifestations of the same work, use both thingISBN and xISBN.


All thingISBN data available in one huge file

Posted by: William Denton, 16 March 2007 7:45 am
Categories: LibraryThing

LibraryThing has done something very useful and generous. As you know, Bob, their thingISBN service is akin to OCLC’s xISBN: give it the ISBN of a book and it will return a list of ISBNs of other editions of the same book. In FRBR terms, we say: give it the ISBN of a manifestation and it will give you a list of ISBNs of manifestations of the same work.

Tim Spalding says in thingISBN Data in One File:

APIs, while nifty, can be a pain. Both thingISBN and xISBN have a 1,000-per-day limit. So, starting today, thingISBN is also available in feed format—one giant XML file with all the data from over two million unique ISBNs.

He doesn’t give a direct link to the full file, probably because it’s 16 MB in size and search engine crawlers would go at it relentlessly, but all you have to do is look for where he gives the URL in plain text. He even gives some sample SQL to help get the stuff into a database. Download it, fool around with it, find new uses for it! Good for LibraryThing.


« Previous PageNext Page »