Tag Archives: Gender Studies

An API to measure the Gender Gap accross all professional fields

NamSor API was presented yesterday at the amazing APIDays.io Paris conference. The Gender Gap Grader project will be featured as a Keynote at the next APIDays conference in May.

Download presentation : APIDays-slides.pdf

Further reading:

About NamSor

NamSor™ Applied Onomastics is a European designer of name recognition software. NamSor is committed to promote diversity and equal opportunity. NamSor launched GendRE API, a free API to extract gender from personal names. We support the @GenderGapGrader initiative. http://namsor.com

About GenderGapGrader

GenderGapGrader’s mission is to publish gender gap estimates at the finest grain level, using whatever reference database we can identify for a particular industry: The Internet Movie Database (IMDB) for the film industry, “The Airman Database” for pilots… and more to come. http://gendergapgrader.com

Leave a comment

Filed under General

NamSor and Gender Gap Grader are in AngelList database of Startups, VC, Angels

We’ve analyzed the gender gap in AngelList database of 650k profiles… we’re in it too. In perfect balance. Follow us in AngelList and hear more about our development in 2015:) #datamining #machinelearning #bigdata #opendata

https://angel.co/namsor

https://angel.co/gender-gap-grader-1

GENDERGAP_infoviz_web

Gender Gap Grading : read about the making-of and make yours!

Leave a comment

Filed under General

GGG & AngelList – a making-of

Tools, methodology, data sources, data output used to produce the article GenderGapGrader: AngelList.

We’ve opened the free GendRE API which extracts gender from names. To make it usable by everyone, we’ve built an extension for RapidMiner, a leading open source data mining and predictive analytics software

.

So you can run your own gender gap analysis, where and when it matters to you!

GGG_Make_your_own_gendergap_study_vF

Data Sources:

Data Mining Tools:

Data Output:

Estimates:

Tutorial:

Leave a comment

Filed under General

Video tutorial – How to Extract the Gender of Personal Names, using RapidMiner

RapidMiner is a leading software for advanced analytics, including predictive analytics, data mining, and text mining. We’ve built an onomastics extension for RapidMiner to enrich any database and infer the gender of personal names of all languages/cultures/alphabets/countries. The GendRE API offers unmatched accuracy, recognizing that “Andrea Rossini” is most likely an Italian name and so a male, whereas “Andrea Parker” is most likely an anglosaxon name and so female; 声涛周 is most likely a male ; “O. Sokolova” is most likely a female.

We’ve used RapidMiner and GendRE API to measure the gender gap among EU Officials, mining the 2014 European Union Directory. This video tutorial will show you step-by-step how it was done:

To redo this study or make your own, download RapidMiner with Onomastics extension and Documentation.

About NamSor

NamSor™ Applied Onomastics is a European designer of name recognition software. NamSor is committed to promote diversity and equal opportunity. NamSor launched GendRE API, a free API to extract gender from personal names. We support the @GenderGapGrader initiative. http://namsor.com

About GenderGapGrader

GenderGapGrader’s mission is to publish gender gap estimates at the finest grain level, using whatever reference database we can identify for a particular industry: The Internet Movie Database (IMDB) for the film industry, “The Airman Database” for pilots… and more to come. http://gendergapgrader.com

Leave a comment

Filed under General

GGG & The Airman Directory – a making-of

Tools, methodology and data used to produce the article ‘GenderGapGrader: Airline Pilots‘:

We’re disclosing the data used in the study. We’ve opened GendRE API which extracts gender from names. To make it usable by everyone, we’ve built an extension for RapidMiner, a leading open source data mining and predictive analytics software.

So you can run your own gender gap analysis, where and when it matters to you!

 

GGG_Make_your_own_gendergap_study_vF

 

Data Files:

Data Mining Tools:

Data Scope:

  • Commercial/Airline Pilots :

SELECT ALL_PILOTS_GENDERIZED_LICENCED.COUNTRY, ALL_PILOTS_GENDERIZED_LICENCED.TYPE, ALL_PILOTS_GENDERIZED_LICENCED.LEVEL, ALL_PILOTS_GENDERIZED_LICENCED.gender, Sum(ALL_PILOTS_GENDERIZED_LICENCED.gender_scale) AS SumOfgender_scale, Count(ALL_PILOTS_GENDERIZED_LICENCED.[UNIQUE ID]) AS [CountOfUNIQUE ID] FROM ALL_PILOTS_GENDERIZED_LICENCED WHERE (COUNTRY=’USA’ And TYPE=’P’ And (LEVEL=’C’ Or LEVEL=’A’)) Or (COUNTRY<>’USA’ And TYPE=’Y’ And LEVEL=’Y’) GROUP BY ALL_PILOTS_GENDERIZED_LICENCED.COUNTRY, ALL_PILOTS_GENDERIZED_LICENCED.TYPE, ALL_PILOTS_GENDERIZED_LICENCED.LEVEL, ALL_PILOTS_GENDERIZED_LICENCED.gender;

Raw Estimates:

In bold, the estimates cited in the article or the infographics.

All pilots gender gap, overall statistics:

 gender  SumOfgender_scale  CountOfUNIQUE ID
 (blank)                                          219
 Female                               35,419                                    48,106
 Male –                          490,805                                 596,368
 Unknown –                                       6                                      7,513
 Total Count n/a  652,206
 % Female 6.73% n/a
 % Male 93.27% n/a

 

All commercial pilots gender gap, overall statistics:

Gender  SumOfgender_scale  CountOfUNIQUE ID
Male -225955 272994
Female 13011 17697
Unknown -4 2322
Total Count n/a 293013
% Female 5.44% n/a
% Male 94.56% n/a

 

Commercial/airline pilots gender gap, by country:

NB/ this table was uploaded on Wikipedia to facilitate sharing of alternative statistics, actual gender gap disclosures by major national airlines.

Country Pilots Estimate Female (%, Scale)
USA 218229 5.12%
UNITED KINGDOM 14684 6.37%
GERMANY 11881 7.11%
CANADA 6852 6.78%
SWITZERLAND 4736 6.45%
FRANCE 4396 7.62%
ITALY 2984 4.89%
AUSTRIA 2405 5.50%
SPAIN 2081 5.28%
NETHERLANDS 2068 6.10%
BELGIUM 2037 7.53%
MEXICO 1530 2.33%
AUSTRALIA 1472 6.58%
BRAZIL 1315 2.20%
SWEDEN 1260 8.20%
IRELAND 957 6.80%
JAPAN 732 5.58%
NORWAY 665 4.47%
ISRAEL 628 5.71%
SOUTH AFRICA 551 7.54%
DENMARK 485 4.37%
NEW ZEALAND 465 7.76%
INDIA 445 7.72%
ICELAND 427 15.63%
ARGENTINA 362 1.83%
GREECE 359 2.62%
KENYA 329 8.78%
POLAND 291 5.26%
TRINIDAD & TOBAGO 289 7.06%
VENEZUELA 280 3.79%
FINLAND 277 12.07%
CZECH REPUBLIC 263 2.73%
COLOMBIA 261 2.96%
LUXEMBOURG 215 9.56%
HONG KONG 214 3.89%
SRI LANKA 204 15.54%
CHILE 202 4.26%
PORTUGAL 193 2.93%
SINGAPORE 193 7.46%
UNITED ARAB EMIRAT 189 3.81%
CYPRUS 185 4.63%
GUATEMALA 160 2.78%
ECUADOR 156 4.47%
COSTA RICA 148 5.75%
HUNGARY 139 7.86%
NIGERIA 137 4.32%
PANAMA 131 7.07%
JAMAICA 127 8.86%
DOMINICAN REPUBLIC 120 2.04%
SAUDI ARABIA 120 3.19%
EL SALVADOR 105 3.02%
BAHAMAS 102 5.62%
PHILIPPINES 99 7.19%
NETHERLANDS ANTILL 86 5.48%
THAILAND 80 12.06%
WEST INDIES 78 5.51%
PERU 76 2.95%
SLOVENIA 74 14.21%
RUSSIA 69 9.76%
FRENCH WEST INDIES 69 2.74%
EGYPT 64 1.83%

 

Student pilots gender gap, by country:

Country Estimate %Female Students
USA 12.0%           93,395
GERMANY 9.0%             1,753
UNITED KINGDOM 8.5%             1,224
NETHERLANDS 10.8%                 746
SAUDI ARABIA 3.8%                 407
JAPAN 12.9%                 394
BELGIUM 11.3%                 306
SWITZERLAND 10.1%                 269
INDIA 12.5%                 262
EGYPT 0.9%                 233
NIGERIA 8.8%                 213
CANADA 11.9%                 210
MEXICO 2.9%                 177
ITALY 8.1%                 169
COLOMBIA 8.5%                 152
FRANCE 7.1%                 138
NORWAY 12.3%                 136
IRELAND 9.1%                 108
TURKEY 2.6%                 107
BRAZIL 6.2%                 101
BAHAMAS 8.9%                   89
AUSTRIA 4.7%                   88
ISRAEL 4.5%                   79
BAHRAIN 2.8%                   79
HONG KONG 9.5%                   71
RUSSIA 9.9%                   71
UNITED ARAB EMIRAT 17.1%                   66
SINGAPORE 23.3%                   65
ECUADOR 3.6%                   60
SPAIN 14.4%                   53
DENMARK 7.6%                   49
PANAMA 14.5%                   45
INDONESIA 10.2%                   43
SWEDEN 10.6%                   43
AUSTRALIA 4.1%                   39

 

1 Comment

Filed under General

Onomastic sampling for migration studies

On Friday morning, I had the opportunity to present our breakthrough data mining technology at Regent’s University Turkish Migration Conference (TMC2014, London).

The supporting presentation can be downloaded here (20140530_TMS2014_Pitch_vFf.pdf) or viewed online here.

20150601_TurkishMigrationStudies

During the following sessions by researchers from various countries (Turkey, US, UK, Germany, Netherland, Sweden, Norway, Belgium …), I learned some of the ‘jargon’ of migration studies and also something about the particular research methodologies applied in that field.

My initial vision was that onomastics (the recognition of personal names) could be applied to discover new migration patterns. It was based on several preliminary meetings with international organizations concerned with migration issues. Census data can take up to three years to process. As states struggle to provide timely and accurate data to international organizations (such as the OECD, IOM, United Nations High Commissioner for Refugees UNHCR, …), these organizations can turn to the Big Data to identify and monitor new trends. There are challenges in identifying relevant data sources to provide valuable information about less digitally connected migrants. Twitter, LinkedIn, Google, Facebook, D&B, Thomson WoS … combined with applied onomastics can tell us a lot about the changing migration patterns of STEM Workers, innovators and entrepreneurs.

STEM Workers: workers in science, technology, engineering, and mathematics; art is occasionally considered as well (STEAM Workers).

With several TMS2014 sessions focused on the question of Turkish identity, or the particular migration and integration patterns of the Turkish, Kurdish, Alevi or Circassian communities, applied onomastics clearly offers an innovative tool to look at data from a different angles (nationality/birth place/ethnicity/gender/…)

However, I found that many research studies are conducted based on an initial theoretical hypothesis. Researchers then apply various qualitative or quantitative methods (occasionally both) to assess the hypothesis. Pure quantitative methods such as ‘data mining’ or ‘graph analysis’ as seen as de-humanizing by researchers (anthropologists, sociologists, historians …), primarily interested in the human story of migration. Most researchers conduct surveys to gather the data for their study : they find people, talk to them, ask questions. How do researchers identify to group of people to be surveyed (the sample)? During the conference, I learned another jargon: network/snowball sampling.

Network/snowball sampling: Snowball sampling is based on the selection of target people in personal networks. In a first step, important people within the target group are identified (initial sample) who themselves identify further people who can be also addressed for the survey (McKenzie & Mistiaen, 2007, p. 2; Salentin, 1999, p. 124).

As often, this new word was the magic keyword to find additional resources and understand how NamSor technology could fit with the current start of migration research methodology:

This document clearly describes the various methodologies to identify the initial population of a study and the various sampling procedures. Onomastic sampling is one of them.

‘In many countries, migrants constitute a substantial part of society. In public opinion research, however, they are often inadequately or not at all considered. This paper gives a systematic overview of the underlying methodological challenges that cause this situation. Those challenges are twofold and concern (1) the definition and distinction of the terms migrant and foreigner to describe the target group and (2) the selection of adequate sampling procedures.’

‘The methodological challenge of selecting adequate sampling procedures

Even after defining the target population, researchers still face difficulties regarding sampling. The problems tackled can be divers, for instance in what way the target population can be contacted (which survey modes are culturally accepted?) and how the individual respondents can be selected (e.g. does last-birthday work?). The paper discusses four central sampling procedures which regularly come up in the literature and which are seemingly appropriate for these kinds of surveys:

1. Sampling procedures on the basis of administrative records,

2. Area sampling, like e.g. random-route-procedures,

3. Network/snowball sampling, and

4. Onomastic sampling procedures based on foreign names from directories.’

How NamSor software can help?

1. Sampling procedures on the basis of administrative records

In this sampling method, the administrative records does not reflect the fine-grain identity of the populations: ‘Turkish nationality’ or ‘Born in Turkey’ encompasses many different populations. Applied onomastics can help refine samples to more targeted populations (Turkish, Alevi, Kurdish, Syrian, …)

2. Area sampling, like e.g. random-route-procedures

In this sampling method, it’s critical to understand the geo-demographics of a territory to know where different migrants populations are concentrated. Applied onomastics can help assess the density of migrant populations at various levels (region/city/district or road) from various public data sources.

3. Network/snowball sampling

In this sampling method, the personal network of the researcher is used an an initial seed to identify further prospects for interviews. Applied onomastics could help analyse personal networks of researchers (from social networks such as Twitter, or academic sources  such as bibliographic databases) to identify larger seed networks and generate better sampling. That could help reduce the risk of biases induced by the researcher’s network (reinforcing its own personal or cultural biases).

4. Onomastic sampling procedures based on foreign names

Dictionaries of given names and family names associated with a particular culture have been used for sampling.

NamSor software goes beyond this technique to use sociolinguistics and recognize in a (fistName, lastName) pair the likely origin of a person, with high accuracy. NamSor software can help researchers conduct onomastic sampling, not just from telephone directories but also from a wide range of modern data sources : social networks, opt-in commercial databases, … with high precision and fine-grain targeting.

Conclusion

NamSor powerful technology raises many data privacy and ethical questions, but we’re glad to say that if science and migration studies can be good for society, NamSor can be too.

About NamSor:
NamSor mission is to help understand international flows of money, ideas and people. NamSor launched GendRE API, a free API to conduct analysis of gender equality using opendata. http://namesorts.com/api/

Leave a comment

Filed under General

RapidMiner to enrich Gender data

[UPDATE September-2014 : watch the 3 minutes tutorial video]

[UPDATE July-2014 : NamSor Onomastics Extension is now available in RapidMiner MarketPlace]

[UPDATE June-2014 : we have built an opensource (AGPL) extension for RapidMiner, get it on GitHub]

With Open Data from the Internet Movie Database (IMDb) and a gender prediction API, it was possible to assess the gender gap in the global film industry in minutes. We found that only ~22% of three hundred thousand movie directors worldwide are women.

We used technical skills and a small program to do this first analysis. Could it be done using a friendlier data mining tool? This article shows how a similar gender study can be conducted with RapidMiner.

Get RapidMiner

Install RapidMiner from SourceForge with additional extensions (Help->Updates and Extensions) : Text Mining and Web Mining.

In this example, we will read an Excel file with two columns (firstName, lastName), enrich with a first column containing the Gender (on a -1..+1 scale). Our test file is a list of members of the exclusive Club ‘Le Siècle‘ (2010), which periodically gathers the French élite : Club_LeSiecle.xlsx (Source : La Marseillaise/cryptome).

Import Excel Data

Drag and drop the Read Excel operator (Import->Data->Read Excel) and launch the Import Configuration Wizard.

2014_RapidMiner_1_ReadExcel

Default values should be OK through the wizard, except Encoding should be set to UTF-8 (Unicode, especially required if you would like to genderize Chinese, Russian or Arabic names).

Enrich Data by Webservice

Next, you will call the Gender prediction API to infer the likely sex/gender for each row in your Excel file.

Drag and drop the Enrich Data by Webservice operator (Web Mining->Services->Enrich Data by Webservice) and connect it to the Read Excel operator.

2014_RapidMiner_2_Enrich_by_WebService

You can use our free Gender API or the Freemium on Mashape. For this example, we shall use the free plain text API, entering this kind URL:

NB: we also provide a REST JSON format, not used in this example

We need to configure the Enrich Data by Webservice operator to pass the parameters and assign the result to a new variable GenderScale (-1 is Male ..+1 is Female):

– query type :’Regular Region’

– attribute type : ‘Numerical’

– regular region queries : add a single attribute ‘GenderScale’ containing the entire result from calling the API (ie. anything between the beginning of the line ^ and the end of the line $)

– request method : ‘GET’

– url : FN and LN will be replaced by the firstName and lastName at runtime  http://api.namsor.com/onomastics/api/gendre/<%FN%>/<%LN%>/fr

– encoding : UTF-8

2014_RapidMiner_3_Enrich_by_WebService_Parameters

Write CSV

Next, you will write the output to a CSV file (Export->Data->Write CSV), setting an output file name and selecting UTF-8 encoding again.

Run the Process

Last, set the process encoding to UTF-8 and run it.

2014_RapidMiner_7_Run

The output should look like:

“FN”;”LN”;”GenderScale”
“Philippe”;”Jaffré”;-1.0
“Bertrand”;”Collomb”;-1.0
“André”;”Lévy-Lang”;-0.96

What’s the verdict ? Women account for ~17% of the French Elite Club ‘Le Siècle’ (2010).

2014_RapidMiner_4_LeSiece_GenderGap

Further reading:

Meet us on 29 April 2014 at DataTuesday Paris with Girls in Tech Paris, on the topic ‘Women & Data’.

Leave a comment

Filed under General

New Gender App to enrich Android contacts Title information (Mr,Ms,…)

As a sample use case for the Gendre API, we’ve created an open source Android application on GitHub to enrich Android contacts with classic Title information (Mr,Ms,…) or the more iconic Gender (♀,♂,∅) and Heart (♥,♤,♢), inferred from the contact name.

To recognize name gender, the statistical approach works really well and we’ve deployed a free API with more that half a million unique names to deliver excellent results. But I have seen objections as to the feasibility of to that statistical approach to work globally (“But I don’t think it is feasible to cover all the names in the world. – Ramesh“)

We think we can cover all the names in the world by combining different approaches, including sociolinguistics, machine learning and we have a roadmap to do that.

For example, in most countries, the gender is ‘encoded’ in the first name (John, Isabel, …) but in other countries, the gender is encoded in the last name (O. Sokolova is probably Slavic/of Slavic origin and a Female).

Rare names (or invented names) are also difficult to classify using the statistical approach but we can guess their likely gender by looking at whether they ‘sound’ male or female, according to a particular culture (again having the last name is critical to pin down a particular culture/locale).

20140323_Gendre_pic2The current GENDRE App features are:

  • The gender prediction runs as a background service (every 10 sec, or 1 min, or 10 min or 1 hour)
  • Possibility to choose between three Title formats : Classic (Ms.,Mr.,M.), Gender (♀,♂,∅), Heart (♥,♤,♢)
  • Your existing Title data (Mr., Dr. etc.) is not overwritten, unless you specifically request a wipe
  • Once all contacts are genderized, the App shows a summary of how many Female / Male contacts were detected
  • You can share this #funstat on Twitter, if you like

You can find GENDRE App on GitHub (https://github.com/namsor/gendreapp). Feedback, as well as Open source contributors, are both welcome 🙂

To get GENDRE current version, we recommend using F-Droid App Store.

Leave a comment

Filed under General

Onomastics API for Gender Studies

[AGENDA] Meet us on 29 April 2014 at DataTuesday Paris with Girls in Tech Paris, on the topic ‘Women & Data’.

Gender Equality in French Politics

Women fill 26.17% of the seats at the French National Assembly (‘L’Assemblée Nationale’), according to the count of ‘M.’ and ‘Mme’ at
http://www.assemblee-nationale.fr/qui/xml/liste_alpha.asp?legislature=14
That’s double the figure of ten years ago (2002: 10.9%), good job ladies!

If that list did not indicate M. and Mme, could we still recognize the gender from the politician name? NamSor has published a simple API for Gender Studies which would give the following result: 26.31% (more that 99% accurate compared to the actual figure).

What about the Corporate World?

Playing with old data from a previous life in the corporate world (which cannot be disclosed), applied onomastics tell us that among ~4000 top company executives with a median base salary of 230,000 $ (USD), men landed a neat 890 million $ while women got 143 million $ in total. This huge gap is the result of less women having a top job and men earning ~20% more on average for the same job.

20140314_GenderEquality_Teaser_v001

Currently, the  Gendre API is in Beta Version and free to use.

Read also:

GenderEquality.java

You can download the sample program GenderEquality.java.zip

Detailed input/output

https://namsor-gendre.p.mashape.com/gendre/Damien/Abad/fr returned -0.9979281991518565
https://namsor-gendre.p.mashape.com/gendre/Laurence/Abeille/fr returned 0.9984725610426144
https://namsor-gendre.p.mashape.com/gendre/Ibrahim/Aboubacar/fr returned -1.0
https://namsor-gendre.p.mashape.com/gendre/Élie/Aboud/fr returned -0.9749559773038545
https://namsor-gendre.p.mashape.com/gendre/Bernard/Accoyer/fr returned -0.9996548690100067
https://namsor-gendre.p.mashape.com/gendre/Patricia/Adam/fr returned 0.9997681752981121
[…] namsor_api_calls.zip

Leave a comment

Filed under General