Yesterday, in the sixth entry in this series, Bad MARC Data, I left you at this thrilling error:
Transforming the MARCXML into FRBR XML and saving to pp.xml ...
Error on line 15908 column 46 of file:///usr/home/wtd/frbr-lc-tool/tmp/slimfrbr.xml:
Error reported by XML parser: Character reference "" is an invalid XML character.
As it turned out,  caused problems too, and I got rid of them both by the old Perl technique of editing files in place:
perl -pi.bak -e 's/\&#(30|31)//g' slimfrbr.xml
Now I’m editing generated files mid-process, which is bad. Tough.
The next step in the process now worked:
java -jar saxon7.jar -u -o clean.xml slimfrbr.xml \
http://www.loc.gov/standards/marcxml/frbr/v2/clean.xsl
But then this didn’t:
java -jar saxon7.jar -u -o match.xml clean.xml \
http://www.loc.gov/standards/marcxml/frbr/v2/match.xsl
It generated a big ugly stack trace. I switched to using Saxon (version 8.9). Why it works and the older version doesn’t, I don’t know, nor, at this point, did I particularly care.
saxon -u -o match.xml clean.xml \
http://www.loc.gov/standards/marcxml/frbr/v2/match.xsl
It complained: Running an XSLT 1.0 stylesheet with an XSLT 2.0 processor. But it worked.
But then the final FRBRizattion XSL failed!
$ saxon -u -o pp.xml match.xml \
http://www.loc.gov/standards/marcxml/frbr/v2/FRBRize.xsl
Validation error on line 18 of http://www.loc.gov/standards/marcxml/frbr/v2/FRBRize.xsl:
Cannot convert string ” ” to a double
Transformation failed: Run-time errors were reported
What does line 18 of that XSL file say?
<xsl:sort
select="normalize-space(translate(substring(marc:datafield[@tag=130
or @tag=240 or @tag=243 or
@tag=245][1]/marc:subfield[@code='a'][1],marc:datafield[@tag=130]/@ind1 |
marc:datafield[@tag=240 or
@tag=243 or @tag=245][1]/@ind2),
'abcdefghijklmnopqrstuvwxyz,.;/-:[]()','ABCDEFGHIJKLMNOPQRSTUVWXYZ'))"/>
The XSL is expecting to find a number in the first or second indicator of the 245 Title Statement field, but in a few cases it’s seeing a space and it gets confused.
$ grep 245 match.xml | grep '" "'
<datafield tag="245" ind1=" " ind2=" ">
<datafield tag="245" ind1="1" ind2=" ">
<datafield tag="245" ind1=" " ind2="0">
<datafield tag="245" ind1=" " ind2=" ">
<datafield tag="245" ind1=" " ind2=" ">
<datafield tag="245" ind1=" " ind2=" ">
<datafield tag="245" ind1=" " ind2=" ">
<datafield tag="245" ind1=" " ind2=" ">
<datafield tag="245" ind1=" " ind2=" ">
The first indicator says whether or not there should be a title added entry, 0 for no, 1 for yes. The second indicator tells how many nonfiling characters there are at the start of the title (for The Three Musketeers it would be 4, so the title sorts under Three, not The).
The second indicator might matter for how the FRBRizing algorithm works, but I didn’t care. I just set all these bad indicators to 0 with Perl again. This is the second time I edited generated files mid-process, but I was so close to the end nothing could restrain me now.
perl -npi.bak -e 'next unless /tag="245/; s/" "/"0"/g;' match.xml
And then this step worked:
saxon -u -o pp.xml match.xml \
http://www.loc.gov/standards/marcxml/frbr/v2/FRBRize.xsl
And the final step worked:
saxon -a -o pp.html pp.xml
Phew! You can see the results of FRBRizing a superduped Pride and Prejudice here.
Tomorrow: some comments. Have a look at the FRBRized results and give them a think, and leave a comment below or tomorrow.