Arrr-chivesSpace Migration at ECU: Scurvy Dog
About the series:
The East Carolina University Pirates are engaged in a large-scale migration project to evaluate, prep, and load data from several dispersed databases into ArchivesSpace. Over the next two years, ECU will share the journey from careening data to weighing anchor to sailing into production. By regularly posting progress ECU aims to empower you, the ArchivesSpace community, to know that you can do it, too!
Scurvy Dog
When our manuscript container list database showed signs of Roman numerals mixed with numbers, duplicate boxes, sub-sub-series, container lists that start with box #3, and, I’m not kidding, an entire series devoted to “Empty Photo Albums” (really?!?), we knew we needed to go straight to the naval surgeon.
Enter our Lead Programmer and months of painstaking work. It was immediately apparent that updating the container lists from EAD would have required replacing the whole record and breaking accession links. Importing the container lists from our local database into AT and pushing the AT-AS migration tool was considered. However, in testing that method the migration would run for hours before producing results. Blimey!
So, the Lead Programmer tested the Harvard Excel import template’s ability to handle hierarchy, instance types, and date strings with multiple dates (of which there were thousands). He ran reports to assess the number and scale of issues, often conversing with the migration team on the value of retaining data as-is. This, too, ended up not being the most viable option for our migration.
Ultimately, our Lead Programmer studied the AS database schema while the migration team created and updated test records directly in AS to illustrate structure, allowing him to work backward to the migration code. He developed a console application that restructured the container lists from our local database, wiped existing AS container lists, and generated a ship-load of SQL commands that were saved and then ran to programmatically insert the container lists into AS. As the team worked on running the script, we checked to see that we implemented series properly and handled boxes spanning across series correctly and checked the accuracy of merged container lists for partially processed collections. We had a surgical scare when we thought that the top container relationships flatlined and went missing. Thankfully, though, a full re-index was all it took to bring them back from the brink.
Now that the container lists are in AS, it’s back to data cleanup and quality control for this crew. We’ve identified 50 collections (out of about 2,000) that came out of SQL surgery with known errors such as duplicate box and folder instances. Currently, a sub-team is looking at data mapping for the resource descriptions, consulting our online collection guides to validate the container lists against AS, and double checking the physical material for numbering discrepancies. Our sails might be shortened, but we’ve replenished our stock of vitamin C and will soon be able to haul wind towards our digital objects.