RapidMiner to enrich Gender data

Posted by

[UPDATE September-2014 : watch the 3 minutes tutorial video]

[UPDATE July-2014 : NamSor Onomastics Extension is now available in RapidMiner MarketPlace]

[UPDATE June-2014 : we have built an opensource (AGPL) extension for RapidMiner, get it on GitHub]

With Open Data from the Internet Movie Database (IMDb) and a gender prediction API, it was possible to assess the gender gap in the global film industry in minutes. We found that only ~22% of three hundred thousand movie directors worldwide are women.

We used technical skills and a small program to do this first analysis. Could it be done using a friendlier data mining tool? This article shows how a similar gender study can be conducted with RapidMiner.

Get RapidMiner

Install RapidMiner from SourceForge with additional extensions (Help->Updates and Extensions) : Text Mining and Web Mining.

In this example, we will read an Excel file with two columns (firstName, lastName), enrich with a first column containing the Gender (on a -1..+1 scale). Our test file is a list of members of the exclusive Club ‘Le Siècle‘ (2010), which periodically gathers the French élite : Club_LeSiecle.xlsx (Source : La Marseillaise/cryptome).

Import Excel Data

Drag and drop the Read Excel operator (Import->Data->Read Excel) and launch the Import Configuration Wizard.

2014_RapidMiner_1_ReadExcel

Default values should be OK through the wizard, except Encoding should be set to UTF-8 (Unicode, especially required if you would like to genderize Chinese, Russian or Arabic names).

Enrich Data by Webservice

Next, you will call the Gender prediction API to infer the likely sex/gender for each row in your Excel file.

Drag and drop the Enrich Data by Webservice operator (Web Mining->Services->Enrich Data by Webservice) and connect it to the Read Excel operator.

2014_RapidMiner_2_Enrich_by_WebService

You can use our free Gender API or the Freemium on Mashape. For this example, we shall use the free plain text API, entering this kind URL:

NB: we also provide a REST JSON format, not used in this example

We need to configure the Enrich Data by Webservice operator to pass the parameters and assign the result to a new variable GenderScale (-1 is Male ..+1 is Female):

– query type :’Regular Region’

– attribute type : ‘Numerical’

– regular region queries : add a single attribute ‘GenderScale’ containing the entire result from calling the API (ie. anything between the beginning of the line ^ and the end of the line $)

– request method : ‘GET’

– url : FN and LN will be replaced by the firstName and lastName at runtime  http://api.namsor.com/onomastics/api/gendre/<%FN%>/<%LN%>/fr

– encoding : UTF-8

2014_RapidMiner_3_Enrich_by_WebService_Parameters

Write CSV

Next, you will write the output to a CSV file (Export->Data->Write CSV), setting an output file name and selecting UTF-8 encoding again.

Run the Process

Last, set the process encoding to UTF-8 and run it.

2014_RapidMiner_7_Run

The output should look like:

“FN”;”LN”;”GenderScale”
“Philippe”;”Jaffré”;-1.0
“Bertrand”;”Collomb”;-1.0
“André”;”Lévy-Lang”;-0.96

What’s the verdict ? Women account for ~17% of the French Elite Club ‘Le Siècle’ (2010).

2014_RapidMiner_4_LeSiece_GenderGap

Further reading:

Meet us on 29 April 2014 at DataTuesday Paris with Girls in Tech Paris, on the topic ‘Women & Data’.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s