Cleaning More Messy Data: Tutorial Assignment (Week 9)

The Techniques

For my midterm, I decided to use both of the Tate datasets, one that listed Tate artworks that were from the years 1950-55 and another that listed non-UK Tate artists who were born between 1900-19. I wanted to plot the artworks at the place of birth of the work’s artist. To achieve that, I had to learn how to remove rows in a dataset in OpenRefine if a column’s values did not match any of the values in a column from a different dataset. Then, I learned how to create a new column based on the values from a different dataset. This tutorial will go through the step-by-step process on how to complete these important techniques.

The Techniques’ Uses in Context

This technique could be used in other digital humanities projects for cleaning up messy datasets that rely on information from another dataset. For example, you may have a dataset that lists the highest performing campuses for the SATs in Texas and another dataset that includes basic information about campuses in the Houston Independent School District (HISD), including data like the school’s address, name, or demographics. You could check if any of the schools in the HISD dataset are present in the SAT dataset and analyze any patterns you may notice. For my midterm, I used these techniques to remove artworks from the artwork dataset that were not made by the artists in the Tate artists dataset and create a new column that included the artists’ places of birth.

Tutorial

Import both datasets into OpenRefine. Determine which dataset you will be cleaning.
- In my case, the Tate Artworks dataset is checking if the values in the artist column match with the values in the name column in the Tate Artist dataset.

screenshot of openrefine home page after importing a file

screenshot of openrefine before you create a project

Add a custom text facet to the column that you will be checking.
- Click the dropdown by the column name > Facet > Custom text facet…
- In my case, I will be checking the artist column in the Tate Artworks dataset.

screenshot of the shortcut paths to custom text facet

In the box add this GREL function: value.cross(“otherProjectName”, “columnName”).length()
- If the value is present in the other dataset, the output will be 1. If the value is not present, the output will be 0.

screenshot of the GREL function and its outputs

In the facet, click 0.

Click the dropdown by All > Edit rows > Remove matching rows. Then, click exclude.

screenshot of the shortcut path to remove matching rows

Next, to create a new column based on the values from a different dataset, click the dropdown for the column you are checking > Edit column > Add column based on this column…

screenshot of shortcut path to add column based on this column

Add this GREL function: cell.cross(“otherProjectName“,”columnName”)[0].cells[“otherColumnName“].value. Remember to add a column name!

screenshot of grel function and its outputs

Now, you should have a clean dataset!

More Resources

You can find more information about these features here:

Tags: grel functions, OpenRefine, Tate Artists, Tate Collection, Tate Museum, Tutorial

1 thought on “Cleaning More Messy Data: Tutorial Assignment (Week 9)”

Mya says:

March 7, 2025 at 5:10 pm

Hi Trixie! I love how detailed each of your steps were in explaining this complex function. When we were working on OpenRefine in class I knew there were many opportunities to use it in fascinating ways, but I didn’t fully get a grasp on it. However, your explanation on how to merge two datasets was very informative and easy for me to follow. Good job!

Cleaning More Messy Data: Tutorial Assignment (Week 9)

The Techniques

The Techniques’ Uses in Context

Tutorial

More Resources

1 thought on “Cleaning More Messy Data: Tutorial Assignment (Week 9)”

Leave a Reply Cancel reply

Final Project Update

Final Project Update (Assata/Eliza/Josephine)

Blog week 5

Final Project Assignment: Project Update

3D Model of the Amaravati Guardian Lion

Lab Week 4

Lab Week 6

Final Project Website- Historical Northfield Buildings

Blog Week 2

Baby names graph

Lab Week 4

Lab Week 6

Baby names graph

Final Project — Data Visualization

Final Project: Data Visualization (Ngelek, Dylan, Reed, Khizar)

Hacking the Humanities 2025W