Lab Report 1: Google Refine
While I was in class I managed to complete several tasks with my iTunes data. When I first uploaded the file in to Google Refine, there were only 1409 records. By locating the last row in the original spreadsheet exported from iTunes, I was able to see that there was a lot of data in once cell because the song “Lady” by DeAngelo had quotes around the title. After removing the quotes from a song title twice, the number of records increased to 3502, which matches my iTunes data. I then used clustering to consolidate the terms for genres, artists, album, name, and composer. Next, by renaming the genres, and changing terms like “Alternative & Punk” to “Alternative”, the number of categories was reduced. While in class I managed to reconcile the Artist column with the first (1409) dataset from Freebase, stayed to reconcile the last (3502) dataset, and emailed them to myself.
Interestingly, when I downloaded Google Refine at home and placed it in C:\Program Files (x86)\Google, I could not import the spreadsheets – from the computer or by uploading and accessing them from Google Drive. However, by moving the Refine folder from the Google folder and placing it on the desktop I was able to get it to work. The most frustrating part of this experience is what happened next. At home, I reconciled the data again using the album and the artist columns, and it took several hours to search for matches and create new categories for over 2,000 records. Then I accidentally hit the backspace button and everything disappeared! All of sudden I was back at the screen for selecting a file to upload. I tried to recover the file I was working on for a while, and finally had to accept that it was gone. I then started looking around for a feature to save the file in Google Refine while I am working on it, and it seems the only thing to do is export the file periodically. This is a weakness in the program. I searched online to see if anyone has had similar issues, and the most information I found was in a Google product forum, where folks posted in 2009 and 2010 about losing data with multiple users working on spreadsheets containing tens of thousands of rows.
In order to relieve my frustration, I decided to play around with quilt data. I used the Quilt Index website to search for the quilt patterns listed in Gracie Mitchell’s transcript, which I attempted to map last spring (2012) in Digital Humanities. The Quilt Index is sort of like WorldCat, in that it is compilation of records from many different quilt collections in the USA – it offers a larger domain to work with than the Internation Quilt Study Center online collection that I used last year. There is less consistent metadata, but the pattern name and year are almost always there, which is essential to the interactive map I want to create for the Runaway Quilt Project. I searched the collection, and selected 741 records to compare, which produced a large horizontal table on the webpage. I cut and paste this into Microsoft Excel and transposed the columns, so that the records were listed vertically instead of horizontally, and the facets are in the top row instead of the left column. Unfortunately, transposing lost the thumbnail images in the original spreadsheet. I tried transposing the data a few more times, and finally decided to worry about pictures later – they are key to the work in general, but since I am practicing with Refine and already experienced frustration with the iTunes debacle, I left it to figure out later. Next, I uploaded the file in to Refine and clustered/merged the patterns and quilter names. Next, I clustered/merged the dates as text facets in order to delete the “c” for circa in front of some of them, and turn date ranges like “1860-1890” in to the earliest possible occurrence “1860”. Then I transformed that column in to dates. There were three date columns, so I also did this for the “Date Est.” I am hoping that I can figure out how to manage the images, so that I can create an interactive map with facets using the DataPress plugin in WordPress that I’ve been trying to figure out for a year now.