Here’s a simple example of superduping working well. We’ll start with an item in my collection, my copy of the 2005 HarperCollins trade paperback manifestation of Flashman on the March, the latest in the series of novels by George MacDonald Fraser about the outrageously libidinous and cowardly scoundrel Harry Flashman. The ISBN is 0007201532. It’s a UK edition; I ordered it from over the pond because the release here was delayed by six months or so.
If we query thingISBN for 0007201532, we get back a cluster of five ISBNs:
000719739X 0007197403 0007201532 1400044758 1400096464
And if we query xISBN for 0007201532, we get back a singleton of just one ISBN:
xISBN’s one result is also in thingISBN’s results, so xISBN’s cluster is a proper subset (fully contained in and not equal to) of thingISBN’s. This doesn’t happen often, and it shows some kind of problem or lack of information in how xISBN does its clustering. Happily, we can use the human-generated thingISBN cluster to improve results.
Notice that if we combine and dedupe the results, we just end up with thingISBN’s cluster.
Here’s how we’ll superdupe it. I’ll show the output from my superduping script and explain it line by line. We start off with two arrays of ISBNs,
xs, which at the start are set equal to the result sets. (They are pronounced tees and exes, as in t-plural and x-plural.) Whenever an ISBN is in both arrays we’re going to remove it from both and add it to the
superdupe array. If it isn’t in both, we’ll look it up at the other service.
The Super column is how many ISBNs are in the
superdupe array when this iteration starts. Source is T if the ISBN is coming out of
ts and X if it’s coming from
xs columns show how many ISBNs are left in each array.
Super Source ISBN ts xs 0 T 000719739X 5 1 + 1
Explanation: Start with
superdupe empty, with 0 items. Start with the thingISBN numbers (T) and take the first ISBN from the sorted list: 000719739X. Right now there are five ISBNs in
ts and 1 in
xs, that is, our original unaltered result sets. Look up 000719739X at xISBN and get back one ISBN: 000719739X, the one we queried about. xISBN doesn’t have anything clustered with it; it’s another singleton. The
+ 1 means we add that ISBN to
xs because now we have checked it at xISBN. Then, because that number is in both arrays (it was in
ts to start with and we just added it to
xs), delete it from both arrays and add it to
superdupe has one ISBN in it, as shown at the start of the next line, and
ts has four and
xs has one.
Super Source ISBN ts xs 1 T 0007197403 4 1 + 4
When this ISBN is looked up at xISBN, a cluster of four come back! They are:
0007197403 1400044758 1405611154 1405621028
All of these are pushed onto
xs. The first two were in thingISBN’s initial result set (in fact, the first is the one we queried about), but the last two are new. This is the third cluster of xISBN results we’ve seen so far (two singletons plus this) and we are using thingISBN’s cluster to group them all together. That’s superduping! Remove 0007197403 from both arrays and push it onto
superdupe, which now has two numbers.
Super Source ISBN ts xs 2 T 0007201532 3 T 1400044758 4 T 1400096464 1 2 + 1
Three more ISBNs pulled out of
ts. The first two above are also in
xs, so there’s no need to look them up at xISBN. They are deleted from
xs and pushed onto
superdupe. (The counts for
xs aren’t there partly because of where things get printed in my script and partly because nothing interesting happens so I don’t bother reporting it.) The third line there, for 1400096464, shows that it’s the last ISBN in
ts (the 1 under
ts) and that we have two in
xs, plus one added by querying xISBN about 1400096464 and getting another singleton back. Remove 1400096464 from both arrays and push it into
Now comes the really interesting part as we run through the ISBNs remaining in
xs. These are ISBNs that were part of xISBN result sets but that we have not yet seen or checked at thingISBN. Just as we checked thingISBN-generated ISBNs at xISBN, now we will check xISBN-generated numbers at thingISBN.
5 X 1405611154 0 + 0 2 6 X 1405621028 0 + 0 1
But we don’t find anything interesting. The
+ 0 shows that thingISBN doesn’t even know about these numbers, much less have any more clusters of numbers to give back. If it had, we’d have pushed those numbers onto
ts and started over again, back and forth until all numbers have been checked everywhere.
Combining and deduping: 5 Superduping: 7 ISBNs thingISBN: 5 at start; 3 calls; 0 ISBNs added; 2 unknown xISBN: 1 at start; 4 calls; 6 ISBNs added; 0 unknown
Combining and deduping gave us five ISBNs but superduping gave us seven. That’s not a huge improvement, but I suspect this is all of the existing manifestations.
We made three calls to thingISBN and four to xISBN. Two ISBNs were unknown to thingISBN, and none were unknown to xISBN.
“How do I know all those ISBNs really represent manifestations of Flashman on the March?” I hear you cry. I checked, and they do. But that’s not always the case, as we’ll soon see.