OpenRefine Introduction
Published and edited .OpenRefine, formerly Google Refine, is a mighty browser-based application to work with huge datasets. It is a commonly used tool for reshaping data before visualising it. In this guide, I will show the main functionalities in a beginner’s friendly way. You will learn how to inspect the data, cluster columns, merge datasets and export them in a new format.
Why OpenRefine
Commonly, OpenRefine is used to »refine« or »wrangle« data tables. This might consist of cleaning, formatting, merging and extracting columns. The main functionalities of OpenRefine are:
- Importing various formats
- Filtering huge datasets
- Merging similar/incorrect values
- Operations on entire columns or multiple cells
- Extracting / applying operation commands
- Exporting in various formats
These tasks could of course also be solved by scripts written in Julia, Python and JavaScript or alternatively by tools like Excel, OpenRefine brings advantages from both sides. The graphical user interface that OpenRefine provides makes it more beginner-friendly than coding these tasks. And compared to Excel, OpenRefine offers better performance for huge data sets and offering these common tasks out of the box. OpenRefine also allows the user to apply all previously performed tasks on a new data set, which makes it handy for recurring tasks.
Download and start
Download OpenRefine from Github and install. Start the application and open http://127.0.0.1:3333/ in any browser.
Import
Create a new project by importing data from Excel, CSV, TSV, JSON, … file or from the clipboard. For this guide, use the example file MoviesList.csv. If you use the file, use the following options:
- Select to use comma for separation
- Set character encoding to UTF-8
- Ignore the first four lines
- Select to read the first line as header
General interface
What makes OpenRefine efficient for big datasets is that it allows viewing only an excerpt of the whole dataset. Usually, the user applies tasks to the complete column. This is for example »convert every cell in this column to lowercase« or »replace the word foo with bar in every cell in this column«. The task is applied to the whole column without the user seeing the actual cell. This saves a great amount of computational power.
The user can choose how many rows this excerpt should include just above the table. Even though it is not possible to view more than 50 rows at once, the user can navigate through each page of the dataset through the option on the right top of the table.
Rows and records
Rows are the absolute lines the tables consists of like in an Excel spreadsheet. Records, however, can consist of multiple rows that are grouped under the same item in the first column. For this, the first row in the record contains a value (i.e. North America
) while the next items do not contain any value in the first column. In this example, the dataset consists of three records and 18 rows.
Rows and records can be easily converted into each other by clicking on the arrow in the header of the first column and selecting Edit cells/Fill down
or Edit cells/Blank down
.
Column operations
All operations that can be executed on a single column can be selected from the arrow left to the title of the column. Operations that will be executed on all columns can be selected from the default, first column All
.
Renaming a column
Select Edit column/Rename this column
from the Column
column and type in IMDb Rating
.
Hiding a column
Select View/Collapse this column
from any column. Click on the remaining space to show the column again.
Reorder/remove multiple columns
Select Edit columns/Re-order / remove columns…
from the All
column. Change the order by dragging them to the top or bottom or delete them by dragging them to the right field.
Facets
Facets allow viewing all occurring values for a column. Through this list, the dataset can be filtered to only show rows with the selected value included.
Facet types
Text facet
The most basic form is the text facet. Like every operation we can apply to a column, we click on the arrow left to Title
and select Facet/Text facet
.
On the left, we now see a unique list of all titles. The numbers indicate how many times this word appears. If we select count
from the header the list is ordered by this number. In order to export this list, we can click on the 17 choices
link in the header.
One big advantage of OpenRefine is to filter the list of viewed items. This allows to apply operations to a subset of the dataset and also to export only a fraction of rows.
In this example, we use it to remove duplicates: Sort by count
and click on the first item (Ulysses' Gaze
). Since we can only remove all viewed rows in OpenRefine we need to make one item differ from the other. Do that by clicking on the flag icon at the beginning of the row. Next, select from the All
column Facet/Facet by flag
. In order to remove all currently visible rows, select Edit rows/Remove all matching rows
for the All
column.
In order to filter by multiple items simply select one and then select include
when hovering the additional items. Once one item or more are selected, you can also select invert
from the header of the facet list. Filtering with multiple facets is also possible.
Numeric facet
If the column we want to apply the facet on consists of numeric values the display as a list might not be the best solution. Instead, we can use the Numeric facet
that displays the values as a histogram.
Select Facet/Numeric facet
from the RottenTomatos UserRating
column. In the facet, the message No numeric value present
will appear. This is because the column is currently formatted as strings (text). To solve this, select Edit cells/Common transforms/To number
from the RottenTomatos UserRating
column. The values will now appear green and on the right side.
Filtering works the same way by using the handles on both sides.
Custom text/numeric facet
Currently, the items in the facet list match the value in the column. What if we want to modify the value for the facet? Select Facet/Custom text facet…
from the Title
column. Facet by the first character? Use value[0]
. Check if the title contains The
? Use contains(value, 'The')
.
The same works for custom numeric facets. Select Facet/Custom Numeric Facet…
from the RottenTomatos UserRating
column and use value[0].toNumber() + value[1].toNumber()
to facet by cross sum.
In the menu Facet/Customized facets
, you can find commonly used custom facts.
Array facts
OpenRefine can also handle array values for facts. For example, select Select Facet/Custom text facet…
from the Genre
column. Type value.split(', ')
to split up the columns by the separator ,
.
Date facet
Date facets are similar to numeric facts but require date values. This requires properly formatted string values that can be converted to dates.
Scatterplots facet
One easy way to see possible correlating columns is to use scatterplots. Even though the implementation is very basic, it is a good way to plot every column against every other. Convert the RottenTomatos
column to numeric values and select Facet/Scatterplot facet
. Since the dataset currently, consists of two columns with numeric values, we only select the one scatterplot available. Unfortunately, filtering or labels on the points are not available.
Clustering
One important functionality is clustering values by similarity. This allows to rapidly clean messy data. Start by creating a text facet for the Director
column and click on Cluster
above the facet list.
The first group of methods are key collision methods that work with alternative/simplified representations of the string. These methods are fast since these alternative versions need to be created only once for each clustering. The second group are the nearest neighbour methods that calculate the difference between every pair of words.
Fingerprint
The first cluster method is fingerprinting. This simple method is good for matching strings with special characters or an alternative order of words. It works by
- removing leading and trailing whitespace
- changing all characters to their lowercase representation
- removing all punctuation and control characters (tab, newline, …)
- splitting the string into words
- sorting the words and removing duplicates
- joining the words back together
- normalising extended western characters to their ASCII representation, i.e. removing diacritical signs (for example
Iñárritu
→Inarritu
)
OpenRefine already selects the most occurring appearance as new cells value. You can also click on one of the listed appearances or type your own. In order to apply this change select Merge?
for each cluster and then Merge Selected & Re-Cluster
.
N-Gram fingerprint
This method uses a similar process but with n-grams. These are parts of the word with a user-defined length. By default, words are split into 2-character chunks that are then sorted and uniquely joined. This method is most useful with 2- or 1-grams. It can detect shuffled or words with repeating characters.
Phonetic
Most languages allow different writings for similar-sounding words. These methods convert strings into their pronounced writing. Metaphone3 is used for English words, cologne-phonetic for German words. These methods are useful for incorrect writing, especially for manually written data.
Levenshtein
This nearest neighbour
method calculates the number of changes required from one string to another. For example, starting from Pak
to Park
we only need one change; hence the Levenshtein distance is 1. Try changing the radius to create clusters that are more diverse.
Lowering the block Chars
will dramatically raise the computational cost. It first performs a key collision clustering that groups words that contain a substring of this size.
This method is useful for common typos or simple alternative writings.
PPM
The Prediction by partial matching method is the most complex method. It works by compressing strings into smaller descriptions. For example, the string ababab
can be described as 3 times ab
. Now, if you have two strings that are similar then the description of these strings combined should be similar to their independent description. Again, you can change the radius to raise the threshold.
This method is useful if complete parts of a string are missing but can also create many false positive matches.
Further information
You can find more information about clustering in the OpenRefine Wiki.
Cell operations
Just like in Excel, we can edit single cells. For this hover over the first cell in the Year
column and select edit. Make the changes and select Apply
.
We can also apply changes to cells that consist of the same value. For this select one of the cells containing Asian
from the Continent
column. Type Asia
and select Apply to All Identical Cells
.
Column transformations
One of the most important applications is applying transformations on entire columns.
Common transformations
Commonly used transformations can be found under Edit cells/Common Transformations
. This allows, for example, to convert all cells in the Title
column into title case.
Recommended transformations
Every new dataset should experience a basic cleanup. This includes:
Edit cells/Common transforms/Trim leading and trailing whitespace
Edit cells/Common transforms/Collapse consecutive whitespace
- and
Edit cells/Common transforms/To number
on numerical columns
Custom transformations
Since we can use GREL
(a subset of JavaScript functions), Python or Clojure the applications and combinations are endless. Here are just a few examples:
Replacing substrings
Select Edit cells/Transform…
on the Year
column. Type in
value.replace('Veröffentlichung', 'Release')
and click OK
.
Convert German number notation to English notation
Select Edit cells/Transform…
on the IMDb Rating
column. Type in
value.replace('.', '').replace(',', '.').toNumber()
and click OK
.
Extracting substrings
Extracting the pure year value can either be done by using this Regex expression
value.match(/.*(\d{4})/)[0].toNumber()
or by removing the additional text.
Sort words in a column
Select Edit cells/Transform…
on the Genre
column. Type in
value.split(', ').sort().join(', ')
and click OK
.
Splitting columns
Select Edit column/Split into several columns…
on the Cinematography/Music
column. Use ,
as a separator and set the number of columns to 2
. Next, you might want to rename the two new columns.
Create a new column based on another column
Select Edit column/Add column based on this column…
on the Genre
column and type in
value.contains('Drama')
If you want to reference another column use cells['Director'].value
.
Merge two tables
Another great feature of OpenRefine is the ability to merge two tables based on a common column. Make sure you have a second table named Movies Countries
(MoviesCountries.csv) with a column MovieTitle
and a second column MovieCountries
. In the original table select Edit column/Add column based on this column…
from the MovieTitle
column. Type in
cell.cross("Movies Countries", "MovieTitle")[0].cells["MovieCountries"].value
This opens the table Movies Countries
and searches for all values from the column Title
in the column MovieTitle
in the target dataset. If it finds equal values it copies the value from the column MoviesCountries
from that row.
Advanced transformation
Select Edit cells/Transform…
on the Duration
column. Type in
value.match(/(\d+)h(\d+)m/)[0].toNumber() * 60 + value.match(/(\d+)h(\d+)m/)[1].toNumber()
and click OK
.
Further information
You can find all available information in the OpenRefine Wiki.
History
All the changes in OpenRefine are non-destructive. The complete history can be viewed in the Undo/Redo
tab below the OpenRefine logo.
Exporting and applying changes
If you need to edit a dataset with the same structure multiple times, it is recommended to extract all changes done to one dataset and then apply these to the others. Select Extract…
from the header of the Undo / Redo
tab and copy the JSON object on the right. In the next dataset, select Apply…
and put in the JSON object.
Export
OpenRefine can export to multiple formats. All options can be found under the Export
option in the top right.
Export entire project
You can also export the entire project through Export/Export project
. Save the file and select it when you create a new dataset under the Import Project
section on the start screen.
Custom tabular exporter
If you want to specify exactly which columns, their format and the output file select Export/Custom tabular exporter…
. There you can change the order, select which columns to export and also file format under the Download
tab.
You can also use the JSON object in the Option Code
tab if you want to use these settings multiple times.