Draft: OpenRefine Introduction

Why OpenRefine

Commonly, OpenRefine is used to »refine« data tables. This might consist of cleaning, formatting, merging and extracting columns. The main functionalities of OpenRefine are:

  • Importing various formats
  • Filtering huge datasets
  • Merging similar/incorrect values
  • Operations on entire columns or multiple cells
  • Extracting / applying operation commands
  • Exporting in various formats

Download and start

Download OpenRefine from https://github.com/OpenRefine/OpenRefine/releases and install. Start the application and open http://127.0.0.1:3333/ in any browser.

Import

Create a new project by importing data from Excel, CSV, TSV, JSON, … file or from the clipboard. For this guide, use the example file MoviesList.csv. If you use the file, use the following options:

  • Select to use comma for separation
  • Set character encoding to UTF-8
  • Ignore the first four lines
  • Select to read the first line as header

General interface

What makes OpenRefine efficient for big datasets is that it allows viewing only an excerpt of the whole dataset. Usually, the user applies tasks to the complete column. This is for example »convert every cell in this column to lowercase« or »replace the word foo with bar in every cell in this column«. The task is applied to the whole column without the user seeing the actual cell. This saves a great amount of computational power.

The user can choose how many rows this excerpt should include just above the table. Even though it is not possible to view more than 50 rows at once, the user can navigate through each page of the dataset through the option on the right top of the table.

Rows and records

Rows are the absolute lines the tables consists of like in an Excel spreadsheet. Records, however, can consist of multiple rows that are grouped under the same item in the first column. For this, the first row in the record contains a value (i.e. North America) while the next items do not contain any value in the first column. In this example, the dataset consists of three records and 18 rows.

Rows and records can be easily converted into each other by clicking on the arrow in the header of the first column and selecting Edit cells/Fill down or Edit cells/Blank down.

Column operations

All operations that can be executed on a single column can be selected from the arrow left to the title of the column. Operations that will be executed on all columns can be selected from the default, first column All.

Renaming a column

Select Edit column/Rename this column from the Column column and type in IMDb Rating.

Hiding a column

Select View/Collapse this column from any column. Click on the remaining space to show the column again.

Reorder/remove multiple columns

Select Edit columns/Re-order / remove columns… from the All column. Change the order by dragging them to the top or bottom or delete them by dragging them to the right field.

Facets

Facets allow viewing all occurring values for a column. Through this list, the dataset can be filtered to only show rows with the selected value included.

Facet types

Text facet

The most basic form is the text facet. Like every operation we can apply to a column, we click on the arrow left to Title and select Facet/Text facet.

On the left, we now see a unique list of all titles. The numbers indicate how many times this word appears. If we select count from the header the list is ordered by this number. In order to export this list we can click on the 17 choices link in the header.

One big advantage of OpenRefine is to filter the list of viewed items. This allows to apply operations to a subset of the dataset and also to export only a fraction of rows.

In this example, we use it to remove duplicates: Sort by count and click on the first item (Ulysses' Gaze). Since we can only remove all viewed rows in OpenRefine we need to make one item differ from the other. Do that by clicking on the flag icon in the beginning of the row. Next, select from the All column Facet/Facet by flag. In order to remove all currently visible rows, select Edit rows/Remove all matching rows for the All column.

In order to filter by multiple items simply select one and then select include when hovering the additional items. Once one item or more are selected, you can also select invert from the header of the facet list. Filtering with multiple facets is also possible.

Numeric facet

If the column we want to apply the facet on consists of numeric values the display as list might not be the best solution. Instead, we can use the Numeric facet that displays the values as a histogram.

Select Facet/Numeric facet from the RottenTomatos UserRating column. In the facet the message No numeric value present will appear. This is because the column is currently formatted as strings (text). To solve this, select Edit cells/Common transforms/To number from the RottenTomatos UserRating column. The values will now appear green and on the right side.

Filtering works the same way by using the handles on both sides.

Custom text/numeric facet

Currently, the items in the facet list match the value in the column. What if we want to modify the value for the facet? Select Facet/Custom text facet… from the Title column. Facet by the first character? Use value[0]. Check if the title contains The? Use contains(value, 'The').

The same works for custom numeric facets. Select Facet/Custom Numeric Facet… from the RottenTomatos UserRating column and use value[0].toNumber() + value[1].toNumber() to facet by cross sum.

In the menu Facet/Customized facets you can find commonly used custom facts.

Array facts

OpenRefine can also handle array values for facts. For example, select Select Facet/Custom text facet… from the Genre column. Type value.split(', ') to split up the columns by the separator , .

Date facet

Date facets are similar to numeric facts but require date values. This requires properly formatted string values that can be converted to dates.

Scatterplots facet

One easy way to see possible correlating columns is to use scatterplots. Even though the implementation is very basic, it is a good way to plot every column against every other. Convert the RottenTomatos column to numeric values and select Facet/Scatterplot facet. Since the dataset currently, consists of two columns with numeric values, we only select the one scatterplot available. Unfortunately, filtering or labels on the points are not available.

Clustering

One important functionality is clustering values by similarity. This allows to rapidly clean messy data. Start by creating a text facet for the Director column and click on Cluster above the facet list.

The first group of methods are key collision methods that work with alternative/simplified representations of the string. These methods are fast since these alternative versions need to created only once for each clustering. The second group are the nearest neighbour methods that calculate the difference between every pair of words.

Fingerprint

The first cluster method is fingerprinting. This simple method is good for matching strings with special characters or an alternative order of words. It works by

  1. removing leading and trailing whitespace
  2. changing all characters to their lowercase representation
  3. removing all punctuation and control characters (tab, new line, …)
  4. splitting the string into words
  5. sorting the words and removing duplicates
  6. joining the words back together
  7. normalising extended western characters to their ASCII representation (for example IñárrituInarritu)

OpenRefine already selects the most occurring appearance as new cells value. You can also click on one of the listed appearances or type your own. In order to apply this change select Merge? for each cluster and then Merge Selected & Re-Cluster.

N-Gram fingerprint

This method uses a similar process but with n-grams. These are parts of the word with a user-defined length. By default, words are split into 2-character chunks that are then sorted and uniquely joined.

This method is most useful with 2- or 1-grams. It can detect shuffled or words with repeating characters.

Phonetic

Most languages allow different writings for similar-sounding words. These methods convert strings into their pronounced writing. Metaphone3 is used for English words, cologne-phonetic for German words.

These methods are useful for incorrect writing, especially for manually written data.

Levenshtein

This nearest neighbour method calculates the number of changes required from one string to another. For example, starting from Pak to Park we only need one change; hence the Levenshtein distance is 1. Try changing the radius to create clusters that are more diverse.

Lowering the block Chars will dramatically raise the computational cost. It first performs a key collision clustering that groups words that contain a substring of this size.

This method is useful for common typos or simple alternative writings.

PPM

The Prediction by partial matching method is the most complex method. It works by compressing strings into smaller descriptions. For example, the string ababab can be described as 3 times ab. Now, if you have two strings that are similar then description of theses strings combined should be similar to their independent description. Again, you can change the radius to raise the threshold.

This method is useful if complete parts of a string are missing but can also create many false positive matches.

Further information

You can find more information about clustering in the OpenRefine Wiki: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

Cell operations

Just like in Excel, we can edit single cells. For this hover over the first cell in the Year column and select edit. Make the changes and select Apply.

We can also apply changes to cells that consist of the same value. For this select one of the cells containing Asian from the Continent column. Type Asia and select Apply to All Identical Cells.

Column transformations

One of the most important applications is applying transformations on entire columns.

Common transformations

Commonly used transformations can be found under Edit cells/Common Transformations. This allows, for example, to convert all cells in the Title column into title case.

Every new dataset should experience a basic cleanup. This includes:

  • Edit cells/Common transforms/Trim leading and trailing whitespace
  • Edit cells/Common transforms/Collapse consecutive whitespace
  • and Edit cells/Common transforms/To number on numerical columns

Custom transformations

Since we can use GREL (a subset of JavaScript functions), Python or Clojure the applications and combinations are endless. Here are just a few examples:

Replacing substrings

Select Edit cells/Transform… on the Year column. Type in

value.replace('Veröffentlichung', 'Release')

and click OK.

Convert German number notation to English notation

Select Edit cells/Transform… on the IMDb Rating column. Type in

value.replace('.', '').replace(',', '.').toNumber()

and click OK.

Extracting substrings

Extracting the pure year value can either be done by using this Regex expression

value.match(/.*(\d{4})/)[0].toNumber()

or by removing the additional text.

Sort words in a column

Select Edit cells/Transform… on the Genre column. Type in

value.split(', ').sort().join(', ')

and click OK.

Splitting columns

Select Edit column/Split into several columns… on the Cinematography/Music column. Use , as a separator and set the number of columns to 2. Next, you might want to rename the two new columns.

Create a new column based on another column

Select Edit column/Add column based on this column… on the Genre column and type in

value.contains('Drama')

If you want to reference another column use cells['Director'].value.

Merge two tables

Another great feature of OpenRefine is the ability to merge two tables based on a common column. Make sure you have a second table named Movies Countries (MoviesCountries.csv) with a column MovieTitle and second column MovieCountries. In the original table select Edit column/Add column based on this column… from the MovieTitle column. Type in

cell.cross("Movies Countries", "MovieTitle")[0].cells["MovieCountries"].value

This opens the table Movies Countries and searches for all values from the column Title in the column MovieTitle in the target dataset. If it finds equal values it copies the value from the column MoviesCountries from that row.

Advanced transformation

Select Edit cells/Transform… on the Duration column. Type in

value.match(/(\d+)h(\d+)m/)[0].toNumber() * 60 + value.match(/(\d+)h(\d+)m/)[1].toNumber()

and click OK.

Further information

You can find all available information in the OpenRefine Wiki: https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

History

All the changes in OpenRefine are non-destructive. The complete history can be viewed in the Undo/Redo tab below the OpenRefine logo.

Exporting and applying changes

If you need to edit dataset with the same structure multiple times, it is recommended to extract all changes done to one dataset and then apply these to the others. Select Extract… from the header of the Undo / Redo tab and copy the JSON object on the right. In the next dataset, select Apply… and put in the JSON object.

Export

OpenRefine can export to multiple formats. All options can be found under the Export option in the top right.

Export entire project

You can also export the entire project through Export/Export project. Save the file and select it when you create a new dataset under the Import Project section on the start screen.

Custom tabular exporter

If you want to specify exactly which columns, their format and the output file select Export/Custom tabular exporter…. There you can change the order, select which columns to export and also file format under the Download tab.

You can also use the JSON object in the Option Code tab if you want to use these settings multiple times.