A weblog following developments around the world in FRBR: Functional Requirements for Bibliographic Records.

Maintained by William Denton, Web Librarian at York University. Suggestions and comments welcome at wtd@pobox.com.


Confused? Try What Is FRBR? (2.8 MB PDF) by Barbara Tillett, or Jenn Riley's introduction. For more, see the basic reading list.

Books: FRBR: A Guide for the Perplexed by Robert Maxwell (ISBN 9780838909508) and Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools edited by Arlene Taylor (ISBN 9781591585091) (read my chapter FRBR and the History of Cataloging).

Calendar

May 2007
M T W T F S S
« Apr   Jun »
 123456
78910111213
14151617181920
21222324252627
28293031  

16 May 2007

Pride and Prejudice 6: Bad MARC Data

Filed under: Pride and Prejudice — William Denton @ 7:00 am

Today we’re going to try to run the big MARC file through Library of Congress’s FRBR tool. If that sentence is complete gibberish, look at the previous entry in this series and get caught up.

I have my marc2frbr.sh script, I have the MARC file, I have the LC stuff. When I ran the test with the Mahler MARC file the LC provided, everything went perfectly fine. Not an error in sight. Now I’ve got hundreds of MARC records I downloaded from hither and possibly yon. Perhaps one or two of them will cause a problem?

$ ./marc2frbr.sh pride-and-prejudice.marc pp
Transforming pride-and-prejudice.marc to MARCXML ...
** Error: Invalid directory length
   Record Number: 4240822
   Character: 90496

** Error: Directory not terminated
   Record Number: 4240822
   Character: 90604
[blah blah more errors blah blah]
[blah blah ugly stack trace blah blah]
[blah blah more errors blah blah]

“I say, Denton old bean,” you say. “That doesn’t look good.”

Indeed it doesn’t. There are several things I could have done at this stage. I took the “just hack on it until it works” approach. I didn’t care what the problem was with record number 4240822. I just wanted it gone. “Begone,” quoth I. If I had some fancy MARC editor, I might have fired it up and fixed the problem. I didn’t. I brute-forced it.

I did so with MARC/Perl, “a Perl 5 library for reading, manipulating, outputting and converting bibliographic records in the MARC format.” It’s a pretty hairy Perl module, more complicated and more powerful than the Ruby MARC library I’ve mentioned before.

Before I could get rid of that record, I wanted a Perl script that would just open up the MARC file and parse it. That’s always the first step in doing anything like this. I wrote:

#!/usr/local/bin/perl -w

my $marcfile  = shift;
die "Usage: $0 marc.mrc" unless defined $marcfile;

use MARC::Batch;
my $batch = MARC::Batch->new('USMARC', $marcfile);
$batch->strict_off();

while (my $record = $batch -> next()) {
    print $record->title(), "\\n";
}

I ran it and got this error: utf8 "\xB9" does not map to Unicode at /usr/local/lib/perl5/5.8.8/mach/Encode.pm line 166.

The second record in the file had some kind of character encoding problem. I hate those. I asked about it on the perl4lib mailing list, and Jason Ronallo said it was probably there because the ruby-marc library I’d used in my MARC-record-grabbing script had a bug. I upgraded the library to the new bug-free version and started my script over, but it was going to take hours to run.

Ronallo had suggested editing out the offending character. I fired up Emacs and used the hex mode (M-x hexl-find-file) to get the that character out of the way. And then another character. And another one. And another one. I ended up searching for multiple occurrences of thirty different characters and replacing them all with spaces. They were mostly in Chinese and Eastern European records.

If anyone ever says to you, “I say, old thing, would you mind editing this MARC file in a hex editor — your favourite, of course, I don’t care which — and removing all of these thirty different Unicode characters wherever they appear, until this Perl script runs more or less cleanly and doesn’t die with an error in the Encode module,” then I suggest you reply, “Sorry, old man, not even for one of Antoine’s famous omelettes and a half bot. of the Widow. And stop speaking like you’re in a Wodehouse novel.” (Why didn’t I do it in Perl? I didn’t think to try. My sed doesn’t seem to handle hex, and for some reason I went straight to my editor after that.)

Anyway, I ended up with a file that could, with errors, be parsed by a Perl script: pride-and-prejudice-cleaned-chars.marc. I wasn’t any closer to running the MARC file through the LC tool, though: I still had record number 4240822 to get rid of.

I’m not going to go into all of the gory detail here, but I wrote another Perl script, marc-wiper.pl, which deleted that record and twenty-four others. They were all bad enough that the LC tool just couldn’t parse them. So, away with them.

On top of that, the LC tool was complaining about some particular MARC fields in some records. Some, such as 400, don’t exist in the MARC 21 specification. Others, like 520 Summary, Etc. do, but I didn’t care.

I ended up wiping all mentions of these fields: 200, 240, 280, 380, 400, 450, 500, 520, 600, 840. “240!” you cry. Yes, I deleted all mentions of 240 Uniform Title. Two records had the same record number and had some kind of problem with their 240, and when I couldn’t get rid of them individually I got fed up and just wiped all 240s. That is not the proper way to treat such an important field in an experiment like this, but I told you I was brute-forcing it.

I seem to have mixed up something and now when I run the Perl script I get some warnings, so this may not work for you. I don’t recommend trying it, in any case. If you run marc-wiper.pl on the previous MARC file, you should end up with clean-pride-and-prejudice-cleaned-chars.marc. You may not. You can just download it directly, but I wouldn’t bother. There’s more ugliness to come.

Finally I could get through the first step in the process:

$ ./marc2frbr.sh clean-pride-and-prejudice-cleaned-chars.marc pp
Transforming clean-pride-and-prejudice-cleaned-chars.marc to MARCXML ...

But then …


Transforming the MARCXML into FRBR XML and saving to pp.xml ...
Error on line 15908 column 46 of file:///usr/home/wtd/frbr-lc-tool/tmp/slimfrbr.xml:
  Error reported by XML parser: Character reference "&#31" is an invalid XML character.

And on that cliffhanger, I leave you until tomorrow.


5 Comments »

  1. Hmm, deleting 240 seems rather disastrous for then applying algorithms to “FRBR-ize”. The 240 is about the biggest clue we have of already controlled work set groupings, isn’t it? I mean, I know you admit that this isn’t proper and you’re just experimenting, but to me, this is a disastrous flaw.

    It is instructive to see how incredibly difficult it is to deal with these records to do something like this. It would be useful to try to explain this to certain cataloging traditionalists who seem to think our metadata records are just fine for doing things like this.

    Comment by Jonathan Rochkind — 16 May 2007 @ 11:20 am
  2. Welcome to Hell my friend. The thing that makes me crazy is that MARBI is busy extending the life of MARC-8 (scan for MARC-8) instead of deprecating it in favor of unicode encodings.

    Comment by Ed Summers — 16 May 2007 @ 3:17 pm
  3. Jonathan: The 240 problem could be fixed up with some direct MARC editing or a bit more knowledge of MARC/Perl, instead of my late-night hacking. A serious implementation would do a better job than this. Interestingly, though, things turn out pretty well. Not that many records had Uniform Title specified.

    Comment by William Denton — 16 May 2007 @ 3:59 pm
  4. Rather than editing out the offending character, I suggested changing Leader byte 9 from ‘a’ to ‘ ‘ (blank). If set to ‘a’ it means that the record is UCS/Unicode encoded, but if it’s blank it’s probably MARC8. For some reason your mangled records were read in by MARC Perl if leader byte 9 was changed to blank. I don’t know why or if it might have solved or caused other problems but your “utf8 “\xB9″ does not map to Unicode at…” problem would have gone away and I suspect most other character encoding issues would have been resolved as well.

    The better suggestion I made was to retrieve the records again in a batch with the upgraded ruby-marc gem.

    Comment by Jason Ronallo — 17 May 2007 @ 9:58 am
  5. William,

    I have been using the java library marc4j for reading MARC records for a project at the University of Virginia. The library as distributed doesn’t play well with ill-formed MARC records. I have been working on modifying the library to read more permissively, and was looking for a set of ill-formed MARC records to test it with. The pride-and-prejudice.marc file you compiled really fits the bill. After a week or so of tweaking the library I have a Java based MARC reader that can read in all of the MARC records in that file and write out structurally well-formed MARC records.

    Comment by Robert Haschart — 16 May 2008 @ 2:20 pm

Comments RSSTrackBack URI

Leave a comment