A weblog following developments around the world in FRBR: Functional Requirements for Bibliographic Records.

Maintained by William Denton, Web Librarian at York University. Suggestions and comments welcome at wtd@pobox.com.


Confused? Try What Is FRBR? (2.8 MB PDF) by Barbara Tillett, or Jenn Riley's introduction. For more, see the basic reading list.

Books: FRBR: A Guide for the Perplexed by Robert Maxwell (ISBN 9780838909508) and Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools edited by Arlene Taylor (ISBN 9781591585091) (read my chapter FRBR and the History of Cataloging).

Calendar

May 2007
M T W T F S S
« Apr   Jun »
 123456
78910111213
14151617181920
21222324252627
28293031  

Pride and Prejudice 7.5: Harry Potter and the Bride of Pemberley

Posted by: William Denton, 17 May 2007 10:34 am
Categories: Pride and Prejudice

Last night I realized there was perhaps another way of running MARC records through the LC FRBR Display Tool, and I tried it. It’s easier, cleaner, and involves less character-obliterating. I’ll post about it next week, but in the meantime, here’s a FRBRization of all the Harry Potter books by J.K. Rowling. I superduped the ISBNs of my copies of the books, found MARC records for 404 different manifestations, ran them through the tool, and ended up with this XML file. The results are good. Even some of the American Harry Potter and the Sorceror’s Stone expressions and manifestations are grouped in with Harry Potter and the Philosopher’s Stone.

The question marks (“?”) you’ll see are my replacements for problem characters I had to clear out to get things to work. I didn’t wipe any MARC fields.


Pride and Prejudice 7: Triumph!

Posted by: William Denton, 7:24 am
Categories: Pride and Prejudice

Yesterday, in the sixth entry in this series, Bad MARC Data, I left you at this thrilling error:

Transforming the MARCXML into FRBR XML and saving to pp.xml ...
Error on line 15908 column 46 of file:///usr/home/wtd/frbr-lc-tool/tmp/slimfrbr.xml:
  Error reported by XML parser: Character reference "&#31" is an invalid XML character.

As it turned out, &#30 caused problems too, and I got rid of them both by the old Perl technique of editing files in place:

perl -pi.bak -e 's/\&#(30|31)//g' slimfrbr.xml

Now I’m editing generated files mid-process, which is bad. Tough.

The next step in the process now worked:

java -jar saxon7.jar -u -o clean.xml slimfrbr.xml \
  http://www.loc.gov/standards/marcxml/frbr/v2/clean.xsl

But then this didn’t:

java -jar saxon7.jar -u -o match.xml clean.xml \

http://www.loc.gov/standards/marcxml/frbr/v2/match.xsl

It generated a big ugly stack trace. I switched to using Saxon (version 8.9). Why it works and the older version doesn’t, I don’t know, nor, at this point, did I particularly care.

saxon -u -o match.xml clean.xml \
 http://www.loc.gov/standards/marcxml/frbr/v2/match.xsl

It complained: Running an XSLT 1.0 stylesheet with an XSLT 2.0 processor. But it worked.

But then the final FRBRizattion XSL failed!

$ saxon -u -o pp.xml match.xml \
 http://www.loc.gov/standards/marcxml/frbr/v2/FRBRize.xsl
Validation error on line 18 of http://www.loc.gov/standards/marcxml/frbr/v2/FRBRize.xsl:
  Cannot convert string " " to a double
Transformation failed: Run-time errors were reported

What does line 18 of that XSL file say?

<xsl:sort
 select="normalize-space(translate(substring(marc:datafield[@tag=130
 or @tag=240 or @tag=243 or
 @tag=245][1]/marc:subfield[@code='a'][1],marc:datafield[@tag=130]/@ind1 |
 marc:datafield[@tag=240 or
 @tag=243 or @tag=245][1]/@ind2),
 'abcdefghijklmnopqrstuvwxyz,.;/-:[]()','ABCDEFGHIJKLMNOPQRSTUVWXYZ'))"/>

The XSL is expecting to find a number in the first or second indicator of the 245 Title Statement field, but in a few cases it’s seeing a space and it gets confused.

$ grep 245 match.xml | grep '" "'
      <datafield tag="245" ind1=" " ind2=" ">
      <datafield tag="245" ind1="1" ind2=" ">
      <datafield tag="245" ind1=" " ind2="0">
      <datafield tag="245" ind1=" " ind2=" ">
      <datafield tag="245" ind1=" " ind2=" ">
      <datafield tag="245" ind1=" " ind2=" ">
      <datafield tag="245" ind1=" " ind2=" ">
      <datafield tag="245" ind1=" " ind2=" ">
      <datafield tag="245" ind1=" " ind2=" ">

The first indicator says whether or not there should be a title added entry, 0 for no, 1 for yes. The second indicator tells how many nonfiling characters there are at the start of the title (for The Three Musketeers it would be 4, so the title sorts under Three, not The).

The second indicator might matter for how the FRBRizing algorithm works, but I didn’t care. I just set all these bad indicators to 0 with Perl again. This is the second time I edited generated files mid-process, but I was so close to the end nothing could restrain me now.

perl -npi.bak -e 'next unless /tag="245/; s/" "/"0"/g;' match.xml

And then this step worked:

saxon -u -o pp.xml match.xml \
 http://www.loc.gov/standards/marcxml/frbr/v2/FRBRize.xsl

And the final step worked:

saxon -a -o pp.html pp.xml

Phew! You can see the results of FRBRizing a superduped Pride and Prejudice here.

Tomorrow: some comments. Have a look at the FRBRized results and give them a think, and leave a comment below or tomorrow.