Do any of you know of a research project that actually tested the effectiveness of work record identification algorithms such as the FRBR Work Set Algorithm? That is, how effective is it in actually finding all of the records representing a work? What percentage of records does it identify, and what percentage does it not find? I'm writing an article and want to know for sure if anyone has done this kind of research.
For some relaxing weekend listening, you might be interested in the first in a new podcast series, Librarypages. In the first one, Joan Wilton talks with Prof. Shawne Miksa about cataloguing (9.4 MB MP3, 20 minutes long). (Here’s Miksa’s home page.)
They don’t mention FRBR, but they do discuss Anglo-American Cataloguing Rules and the upcoming FRBR-influenced Resource Description and Access. If you’re a cataloguer or the kind of person who debates MARC records with your friends, there won’t be much new here, but it’s still a good conversation. If you’re from outside the library field it’ll give you some background on cataloguing and give you a sense of what librarians talk about.
Karen Coyle posted FRBR-izing on her blog on Wednesday. Give it a read, then come back and I’ll comment on a few things. I think she underestimates the power of what a fully FRBR-aware catalogue will look like and what will be required to build it.
Then I hear about people FRBR-izing their catalogs, and I have to say that I can find nothing in the FRBR analysis that would support or encourage that activity. FRBR is not about catalogs, it’s not even about creating cataloging records, and it definitely does not advocate the clustering of works for user displays. I’m not sure where FRBR-izing came from, but it definitely didn’t come from FRBR. FRBR defines something called the Work, but does not tell you what to do with it. In addition, the Work is not a new idea (see section 25.2 of AACR2 where it describes the use of Uniform Titles).
I don’t understand the statement that FRBR isn’t about catalogues. It’s Functional Requirements for Bibliographic Records, and when bibliographic records are shown to users, that’s called a catalogue. As to creating catalogue records, the FRBR Final Report says in section 2.1, “Objectives of the Study:”
The study has two primary objectives. The first is to provide a clearly defined, structured framework for relating the data that are recorded in bibliographic records to the needs of the users of those records. The second objective is to recommend a basic level of functionality for records created by national bibliographic agencies.
To say “it’s not even about creating bibliographic records” doesn’t match with that first objective. When bibliographic records are created, they need to fulfill that framework and meet those user needs (the four tasks of find, identify, select, and obtain). Current cataloguing rules do that partially, but not competely. FRBR is about creating bibliographic records that use the full entity-relationship model and allow the four user tasks.
The idea of a “work” isn’t new, no. It goes back to the Paris Principles, Seymour Lubetzky, and before him to a 1936 paper by Julia Pettee (“The Development of Authorship Entry and the Formulation of Authorship Rules as Found in the Anglo-American Code,” which can be found in Foundations of Cataloging: A Sourcebook, edited by Michael Carpenter and Elaine Svenonius (Littleton, CO: Libraries Unlimited, 1985)).
FRBR does say what we’re supposed to do with a work: we’re supposed to relate it to its expressions and any Group 2 or 3 entities, and to allow users to perform any of the four tasks on it. That’s the start of full FRBRization, by which I mean identifying all of the entities involved and all of their relationships, and then making all that open to the four tasks. (A partial FRBRization is the second objective of the Final Report, the “basic level of functionality for records created by national bibliographic agencies.”)
I think that those of us in the systems design arena have confused FRBR, or perhaps co-opted it, to solve two pressing problems of our own: 1) the need to provide a better user interface to the minority of prolific works, that is, the Shakespeare’s and the oft-translated works; 2) and the need to manage works that appear in many physical formats, such as a printed journal and the microform copy of that journal, or an article that is available in both HTML and PDF. We can find elements of FRBR that help us communicate about these issues; we can talk about Works (in the FRBR sense) and Manifestations. But solving these problems is not a FRBR-ization of the catalog.
Those are two of the reasons FRBR was made. Doing those isn’t co-opting FRBR, it’s using it the way it was meant. Neither is a full FRBRization, but the more use of FRBR, the better for the users. The Final Report doesn’t say what what an interface should look like; for that, have a look at something like IFLA’s Guidelines for Online Public Access Catalogue (OPAC) Displays:
5.2 Provide for the option of displaying records in an order consistent with the FRBR model (see Example 4)
In a catalogue where the FRBR model is implemented, the result of a search could consist of bibliographic records representing bibliographic entities of different levels (works, expressions, manifestations, items).
In that case the display of multiple brief bibliographic records should consist only of entities at the same level. The level should correspond to the level of attributes1 given in the query. Tools that enable navigation between corresponding bibliographic entities of different levels have to be provided (e.g., from a work to all expressions of the work, etc.)
I’m late on reporting this, but I just noticed it on the weekend. If you read the letters section in Library Resources and Technical Services, you already know about this, but if you don’t, you may not.
A year ago, in October 2005, Ed Jones had a paper called “The FRBR Model as Applied to Continuing Resources” in Library Resources and Technical Services (49: 4). I mentioned it last December.
Barbara Tillett wrote a letter to the editor about it, and the letter was published in the July 2006 LRTS (50: 3). (People call it “Lurts.”) It’s four pages long, and begins with this:
In November I received the October 2005 (49, no. 4) issue of LRTS, and after reading your glowing remarks about the editorial board and how this is a carefully refereed journal, I launched into the article by Ed Jones about the Functional Requirements for Bibliographic Records (FRBR) titled “The FRBR Model As Applied to Continuing Resources” (p. 227-42). I was disappointed to find so many errors and information presented in a misleading fashion could have so well been addressed through editorial review working with the author. Mr. Jones has an excellent message about our being at a great time of opportunity, and he points to the inconsistencies and varying practices that have evolved over the years for continuing resources through our cataloging rules, rule interpretation, and practices and the MARC format. Using FRBR for such analysis is precisely what that conceptual model is for. I just wish the statements had been clearer about what is really Mr. Jones’ opinion and what FRBR states.
Jones gives a one-page response that begins:
When Peggy Johnson notified me that LRTS had received a letter from Barbara Tillett relating to my paper, my first reaction was excitement that the paper had attracted the attention of so distinguished a colleague. However, after learning of the length of the letter and that I would have a chance to respond, I suspected my reaction might be premature. The reader will probably have reached the same conclusion.
The thing to remember is that if you read Jones’s original article, you should follow up and read Tillett’s letter and Jones’s response.
Allyson Carlyle, a professor at the Information School at the University of Washington, has a paper called Understanding FRBR as a Conceptual Model: FRBR and the Bibliographic Universe in the October 2006 issue of Library Resources and Technical Services (50: 4). (The link is to a web version of the paper; the illustrations are at the bottom.) I recommend it.
The blurb about the author says the paper is based on a 2004 talk (which was mentioned here in June 2005). Carlyle says, “It is the outcome of six years of teaching the FRBR model, answering student questions about it, and explaining it to inquisitive faculty who return from conferences at which it is mentioned.”
[Abstract:] Functional Requirements for Bibliographic Records (FRBR) presents a complex conceptual model. Because of this, it is not easy for everyone to understand. The purpose of this paper is to make some of the more difficult aspects of the FRBR model, in particular the Group 1 entities work, expression, manifestation, and item, easier to understand by placing FRBR in the context of what it is: a conceptual entity-relationship model. To this end, a definition of the term “model” is presented, a variety of types and functions of models are introduced, conceptual models are discussed in detail, modeling an abstraction is explained, and different ways of interpreting FRBR are suggested. Various models used in the history of cataloging are introduced to place FRBR in the context of the historical development of document models.
On Thursday Thom Hickey (of OCLC) commented on how the Melvyl Recommender Project handles FRBRization, which I mentioned in this post. The Melvyl people use scoring to decide when two things are really the same work (so many points for matching titles exactly, so many for matching authors exactly, etc.) but Hickey recommends using a decision table. He does one to describe how the Melvyl thing works:
Titles E E E - P P P Authors P E - E E P P Idents - - E E - - P Dates P - E E P E -
Here’s how to use the table. For each of the rows, you decide whether the records have an Exact, Partial, or no match. These are ordered, so a P in the table means that value has to be at least a partial match. The first column then says that if you have an Exact title match, and at least Partial author and date match, then your records match. The hyphen in the Idents row means that for this column it doesn’t matter how well the identifiers match. The last column shows that partial matches on all but dates result in a match, whether dates match or not. In order to match two records they have to satisfy at least one column.
Interesting and useful.
Kent Fitch of the National Library of Australia dropped me some e-mail about a very interesting project he’s doing: Searching Bibliographic Records, a test of using Lucene, the free search engine. Some FRBRizing is done, so you’ll want to go have a look.
They say on the home page:
The current Libraries Australia database contains many “duplicates”: records not merged due to subtle differences in metadata which are often inconsequential or errors. Many people also think it would be a good idea to combine various editions of works in the search results interface, although how far this combining should go is debatable. Should it be the equivalent of an FRBR work, or of an FRBR expression? Should it include works across languages and material types?
… What we’re trying to achieve is a set of groupings most likely to be useful to a searcher wanting to find a resource. The searcher probably has very strong preferences for the form and language of the resource they’re seeking, which is why they’re our top two layers/groupings. After that, they may have a preference for a particular edition or, less likely but possibly, even a particular manifestation (publisher, publication year, place of publication).
Of course, they don’t actually care about the bibliographic record; they want to get there hands on the resource, so we have to think about how they can easily tell the system to:
- Locate any edition I can get today for free
- Locate any edition published after 1960 I can get today for free
- Locate either of these two editions I can get cheapest and soonest
- Locate any French edition available for electronic access…
Whenever I think of Australian literature I think of Sean McMullen, whose great novel Souls in the Great Machine is set a thousand years in the future, in an Australia where electricity cannot be used and librarians settle fights with shotgun duels. That link will take you to the basic display of the book, but notice the “This title can be viewed as part of an experimental FRBR group” link. That takes you here: FRBRized view of Souls in the Great Machine.
A better example is the FRBRized view of Harry Potter and the Prizoner of Azkaban by J.K. Rowling, which has lots of translations and is much juicier FRBRarily but, certainly, less Australian.
Go have a look youself: poke around, try a search, see what results they show. Kent Fitch is interested in hearing your comments.
Ian Strang, a librarian not far from the FRBR Blog central office, has an interesting post on his blog: FRBRising with the Folks. He was thinking about the failures of FRBRizing by algorithms and automated processes:
The problem that I just don’t see them getting around is that often the “work” is simply not represented in the traditional bibliographic record, not even as a combination of elements. If this is the case no amount of processing by computer or librarian will be able to accurately and consistently identify and group “works”. What the FRBRisation process needs is just a little added information about each record. This seems like a perfect task for a social bookmarking application.
I’ve been thinking along the same lines as he was: that Amazon’s Mechanical Turk would be a good way of doing this. Strang found something interesting, though:
Interestingly Amazon developed the Mechanical Turk initially for internal use, to do much the same thing as I’m suggesting. Amazon had a problem with duplicate records. They realized that many products were virtually the same and could be sold/inventoried as a single product but were in their database as two items. It was for to large a problem to give to one person of even a group of people so they created a task marketplace, what would evolve into the Mechanical Turk. A program would identify to similar records and then submit them to the market place as a task. All the Amazon employee had to do to earn a few extra bucks was glance at each record and answer yes or no to the program. If the answer was yes the records were merged, if no the program moved on. All I’m suggesting is that something like to “work set” algorithm replace the Amazon program. Sure it would cost, but looking at how things are priced, not as much as one might think.
I haven’t mentioned the Melvyl Recommender Project at the California Digital Library before. They recently put out Full Text Extension Supplementary Report (652 KB PDF) and it has a very interesting section on FRBR. Check it out.
Using Lucene, they indexed a bunch of bibliographic records. Then they used an algorithm of their own devising to group manifestations together into works. (They found OCLC’s Work-Set Algorithm too restrictive.)
Title All titles match exactly +100 All titles match after subtitles are removed +80 as above One list is a (nonempty) subset of the other +80 No match -100 Author All author s match exactly +100 Keyword match (all words in shorter author +80 match longer author), for all authors in list One list is a (nonempty) subset of the other +80 No match -100 Date Exact match +50 +/- two years -25 No match -50 Identifier All identifiers match exactly +100 One list is a (nonempty) subset of the other +80 No match 0 Total score Minimum threshold for match 150
They end by saying (and I think they mean “expression” where they say “item”):
The current algorithm attempts to form groups that represent a single FRBR “work.” It would be interesting to pursue a two level decomposition of the records into “work” and “items” within each work. It is unknown whether the metadata would support such a decomposition.
Certainly further tuning of the matching algorithm would be necessary and desirable if it were to be used in a production system. Inevitably there will be many corner cases that result in poor groupings using the current simplistic algorithm. Additionally, the creators of Melvyl developed a separate, somewhat different algorithm for grouping serials (as opposed to monographs), and it seems likely we would discover the need for this as well.
Though our project was able to obtain Library of Congress authority files, which are generally considered a necessary step in FRBRization, we ran out of time to integrate them into our FRBR process. Certainly this should be considered as a likely way to improve grouping for that fraction of documents that match the authority files (a baby step would be to quantify that fraction.)