Cleaning More Messy Data: Tutorial Assignment (Week 9)

The Techniques

For my midterm, I decided to use both of the Tate datasets, one that listed Tate artworks that were from the years 1950-55 and another that listed non-UK Tate artists who were born between 1900-19. I wanted to plot the artworks at the place of birth of the work’s artist. To achieve that, I had to learn how to remove rows in a dataset in OpenRefine if a column’s values did not match any of the values in a column from a different dataset. Then, I learned how to create a new column based on the values from a different dataset. This tutorial will go through the step-by-step process on how to complete these important techniques.

The Techniques’ Uses in Context

This technique could be used in other digital humanities projects for cleaning up messy datasets that rely on information from another dataset. For example, you may have a dataset that lists the highest performing campuses for the SATs in Texas and another dataset that includes basic information about campuses in the Houston Independent School District (HISD), including data like the school’s address, name, or demographics. You could check if any of the schools in the HISD dataset are present in the SAT dataset and analyze any patterns you may notice. For my midterm, I used these techniques to remove artworks from the artwork dataset that were not made by the artists in the Tate artists dataset and create a new column that included the artists’ places of birth. 

Tutorial

  1. Import both datasets into OpenRefine. Determine which dataset you will be cleaning.
    •  In my case, the Tate Artworks dataset is checking if the values in the artist column match with the values in the name column in the Tate Artist dataset.
screenshot of openrefine home page after importing a file
screenshot of openrefine before you create a project
  1. Add a custom text facet to the column that you will be checking.
    • Click the dropdown by the column name > Facet > Custom text facet…
    • In my case, I will be checking the artist column in the Tate Artworks dataset.
screenshot of the shortcut paths to custom text facet
  1. In the box add this GREL function: value.cross(“otherProjectName”, “columnName”).length()
    • If the value is present in the other dataset, the output will be 1. If the value is not present, the output will be 0.
screenshot of the GREL function and its outputs
  1. In the facet, click 0.
screenshot of the text facet
  1. Click the dropdown by All > Edit rows > Remove matching rows. Then, click exclude.
screenshot of the shortcut path to remove matching rows
  1. Next, to create a new column based on the values from a different dataset, click the dropdown for the column you are checking > Edit column > Add column based on this column…
screenshot of shortcut path to add column based on this column
  1. Add this GREL function: cell.cross(“otherProjectName“,”columnName”)[0].cells[“otherColumnName“].value. Remember to add a column name!
screenshot of grel function and its outputs

Now, you should have a clean dataset!

More Resources

You can find more information about these features here:

  1. OpenRefine Documentation for GREL Functions
  2. Tutorial in OpenRefine Google group

1 thought on “Cleaning More Messy Data: Tutorial Assignment (Week 9)

  1. Hi Trixie! I love how detailed each of your steps were in explaining this complex function. When we were working on OpenRefine in class I knew there were many opportunities to use it in fascinating ways, but I didn’t fully get a grasp on it. However, your explanation on how to merge two datasets was very informative and easy for me to follow. Good job!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

css.php