My First Month at the DSU or: how I was given a file and was asked to turn it into something else

Oh, they do so many things they never stop. Oh, the things they do there, my stars.

Why hello, I'm the new contract hire at the DSU since May.  So far it's been lovely - I love the work pace and I immediately felt like I was a part of the team.  The first thing I worked on here was to get content into their shiny new online repository (3 weeks my senior).  I was to move all of the metadata from the Doris McCarthy Image Collection contained in ContentDM (the old asset management software) into Islandora (the new asset management software).  My aim is to be as transparent as possible in the hopes that this will be of value to someone such as myself starting out in libraries and working with library data and metadata.  Of course, I will be more than happy to answer any questions if you too share a similar pain.

Hey, lets make it easy.  The code is available on github.

What we had:

  • Doris McCarthy Simple DC Export (.xml)
  • Rename map (document)

What we used:

  • oxygen XML editor (30 day trial)
  • text editor (we used Sublime Text, excel)
  • xml_twig (cpan needs to be installed)

Scripts:

  • xslt rename map (.xsl)
  • rename (.sh)
  • LOC DC2MODS (.xsl)

To start:

  • the exported XML file from CONTENTdm - in Simple DC had 750 records
  • The renaming map project document that has old filename and new filename (done manually)

To end:

  • one individual .xml record in MODS for each associated .tif object (they need to have the same filename in order to be properly batch ingested using the Large Image Content Model)

Steps

  1. Create a rename map: create an xml style sheet document (xslt) to replace all the text within <dc:source> to read the current name, lookup in xslt its corresponding replacement identifier

  2. Rename transformation in oxygen xml editor -> 750 records (no loss)

    a. ~20 identified duplicates: some had identical identifiers because, some just had two metadata records associated with the object (like some records included OCR transcriptions while the dupe didn’t have)
    b. ~30 container metadata records didn’t have a mapping name so they weren’t transformed - acceptable

  3. Split the files -> 750 files (no loss): using xml_twig > xml_split module

  4. Rename the split files -> 730 records as predicted from 2b: using convert.sh

  5. Transform metadata records from DC to MODS -> 730 records (no loss from step 4): using oxygen xml editor, LOC has templates for MODS transformations that we modified to match our CONTENTdm metadata export

  6. Ready for ingest: single image + xml package, book batch too (steps not included). yay

 

Notes:

  • in almost every step something didn’t work – you will need to go back a few steps, fix, and proceed
  • it’s difficult to figure out the order to do everything, don’t be afraid to try it a different way (you can do one step first and it may cause more problems than if you did it another way - e.g. deciding if you do the dc to mods first or wait until the very end)
  • cleanup is crucial to every step – the more time you devote to clean up earlier on in your workflow the easier the rest of the process will be

All in all, it's been a very exciting month, not just for me but for everyone at the DSU.  Or maybe it's always like this...