Melvyl Recommender Project
I haven’t mentioned the Melvyl Recommender Project at the California Digital Library before. They recently put out Full Text Extension Supplementary Report (652 KB PDF) and it has a very interesting section on FRBR. Check it out.
Using Lucene, they indexed a bunch of bibliographic records. Then they used an algorithm of their own devising to group manifestations together into works. (They found OCLC’s Work-Set Algorithm too restrictive.)
Title All titles match exactly +100
All titles match after subtitles are removed +80
as above
One list is a (nonempty) subset of the other +80
No match -100
Author All author s match exactly +100
Keyword match (all words in shorter author +80
match longer author), for all authors in list
One list is a (nonempty) subset of the other +80
No match -100
Date Exact match +50
+/- two years -25
No match -50
Identifier All identifiers match exactly +100
One list is a (nonempty) subset of the other +80
No match 0
Total score Minimum threshold for match 150
They end by saying (and I think they mean “expression” where they say “item”):
The current algorithm attempts to form groups that represent a single FRBR “work.” It would be interesting to pursue a two level decomposition of the records into “work” and “items” within each work. It is unknown whether the metadata would support such a decomposition.
Certainly further tuning of the matching algorithm would be necessary and desirable if it were to be used in a production system. Inevitably there will be many corner cases that result in poor groupings using the current simplistic algorithm. Additionally, the creators of Melvyl developed a separate, somewhat different algorithm for grouping serials (as opposed to monographs), and it seems likely we would discover the need for this as well.
Though our project was able to obtain Library of Congress authority files, which are generally considered a necessary step in FRBRization, we ran out of time to integrate them into our FRBR process. Certainly this should be considered as a likely way to improve grouping for that fraction of documents that match the authority files (a baby step would be to quantify that fraction.)