Tag Archives: onomastics

NamSor presented during Symposium on Academic Excellence

Our friend Tania Vichnevskaia of the French National Institute for Health (INSERM) presented the paper ‘Applying onomastics to scientometrics‘ yesterday at IREG International symposium organised by University of Maribor and Shanghai Jiao Tong University.

NamSor as a private start-up company has been solicited in 2014 by a European country to help measure the ‘brain drain’ affecting its competitiveness in the BioTech sector and to produce a global map of its scientific Diaspora (who are they, where are they and what are they doing). The objective was to build up the country’s scientific international cooperation and to engage its Diaspora.

Serendipity led analysts to discover interesting patterns in the way scientists names affect co-authorship and citation – not just for this particular country, but globally.

Last year, during ICOS2014 conference at Glasgow University, we presented how data mining millions of scientific articles in PubMed/PMC LifeSciences database uncovered amazing patterns in the way scientists names correlate with whom they publish, and who they cite in their papers.

We were interested to mine the large commercial bibliographic databases (Thomson WoS, Scopus) because they offer better data quality on citations and useful additional information, compared to PubMed:

– firstly, they have the full name in addition to the short name cited with just initials; this significantly reduces the error rate of onomastic classification

– secondly, they link scientists to research institutions (affiliations) and geographies (country of affiliation) ; this allows additional analysis on the topic of Diasporas and brain drain, comparing -for example- the research output of Chinese / Chinese American scientists in the US with that of scientists of Mainland China;

– thirdly, those databases have a larger coverage in terms of scientific disciplines, allowing comparison between different fields of research.

So collaboration started between NamSor and bibliometric experts at INSERM –the French National Institute for Health- to evaluate and visualize the effects of migration, Diaspora engagement and possibly cultural biases in Science.

This is Tania’s presentation at the conference:

What does the ‘onomastic millefeuille‘ of the global Cancer Research community look like?

201501_ThomsonWoS_CancerResearchOn this same topic:

The agenda of the Symposium is presented below

2nd Maribor Academicus Event

Academic Excellence: BETWEEN HOLY GRAIL AND MEASURABLE OBJECTIVES

International symposium  organised by University of Maribor and Shanghai Jiao Tong University

within the IREG Project on Academic Excellence

19-20 January 2015, Maribor, Slovenia

Higher education can importantly benefit from the rankings and league tables when used in a context with clear perspective of what ranking actually reflects (Prof. Jan Sadlak, President of IREG)
Active participants at the conference will be:

  •            Prof. Jan Sadlak, President of IREG,
  •            Prof. Gero Federkeil, CHE (Coordinator of Multi-Ranking),
  •            Prof. Nian Cai Liu,  Jiao Tong University in Shanghai (Author of the Shanghai ranking list),
  •            Prof. Seeram Ramakrishna,  National University of Singapore,
  •            Prof. Santo Fortunato,  Aalto University,
  •            Prof. Karin Stana Kleinschek, University of Maribor,
  •            Prof. Henryk Ratajczak, member of Czech Academy of Sciences,
  •            Prof. Edvard Kobal, Slovenian Science Foundation,
  •            Roberta Sinatra, PhD, Northeastern University,
  •            Tania Vichnevskaia, French National Institute for health (INSERM),
  •            Prof. Andrée Sursock, Senior Adviser at EUA,
  •            Prof. Øivind Andersen, University of Oslo.

About NamSor

NamSor™ Applied Onomastics is a European vendor of Name Recognition Software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.

NamSor launched FDIMagnet,  a consulting offering to help Investment Promotion Agencies and High-Tech Clusters leverage a Diaspora to connect with business and scientific communities abroad.

Leave a comment

Filed under FDI Magnet, General

Popular names of Delhi, Incredible India

Incredible India is the name of successful place marketing initiative launched in 2002 by the Indian Government using the incredible diversity of India in terms of colors, landscapes, people, languages etc. to promote tourism in India. Because of this same diversity, Indian onomastics are a tough nut to crack. At NamSor, we’ve opened a free API to predict the gender of personal names according to the various languages and cultures (Andrea Rossini is male, Andrea Parker is female, Jean Durieux is …). We aim for 95 to 99% accuracy and 95 to 99% recall, in every country where possible.

This work is important. The status of women in India is a very current issue. At NamSor, we believe in the value of open data mining initiatives -such as Gender Gap Grader– to advance the empowerment of women worldwide. So we work hard to understand how names vary in different states of India, different regions. For example, below are the most frequent male/female names in Delhi.

Most popular female and male names in Delhi, India

Name Female Male Likely Gender Total Pct
Sunita 51526 82 Female 51614 0.5%
Poonam 33524 79 Female 33603 0.3%
Raj Kumar 127 33313 Male 33440 0.3%
Anita 32816 45 Female 32861 0.3%
Ashok Kumar 29 29747 Male 29776 0.3%
Manoj Kumar 35 29197 Male 29232 0.3%
Anil Kumar 32 28943 Male 28975 0.3%
GEETA 28547 30 Female 28577 0.3%
Sunil Kumar 27 27438 Male 27465 0.3%
Santosh 22557 4465 Both 27022 0.3%

About NamSor

NamSor™ Applied Onomastics is a European designer of name recognition software. NamSor is committed to promote diversity and equal opportunity. NamSor launched GendRE API, a free API to extract gender from personal names. We support the @GenderGapGrader initiative. http://namsor.com

About GenderGapGrader

GenderGapGrader’s mission is to publish gender gap estimates at the finest grain level, using whatever reference database we can identify for a particular industry: The Internet Movie Database (IMDB) for the film industry, “The Airman Database” for pilots… and more to come. http://gendergapgrader.com

 

Leave a comment

Filed under General

NamSor and Gender Gap Grader are in AngelList database of Startups, VC, Angels

We’ve analyzed the gender gap in AngelList database of 650k profiles… we’re in it too. In perfect balance. Follow us in AngelList and hear more about our development in 2015:) #datamining #machinelearning #bigdata #opendata

https://angel.co/namsor

https://angel.co/gender-gap-grader-1

GENDERGAP_infoviz_web

Gender Gap Grading : read about the making-of and make yours!

Leave a comment

Filed under General

Popular baby names in Tripura, Incredible India

Incredible India is the name of successful place marketing initiative launched in 2002 by the Indian Government using the incredible diversity of India in terms of colors, landscapes, people, languages etc. to promote tourism in India. Because of this same diversity, Indian onomastics are a tough nut to crack. At NamSor, we’ve opened a free API to predict the gender of personal names according to the various languages and cultures (Andrea Rossini is male, Andrea Parker is female, Jean Durieux is …). We’ve already made it very accurate for most countries. Still, we have a lot of work to do on Indian names, as the precision of Gendre APIv0.0.17 for India is not yet at the right standard. We aim for 95 to 99% accuracy and 95 to 99% recall, in every country where possible.

This work is important. The status of women in India is a very current issue. At NamSor, we believe in the value of open data mining initiatives -such as Gender Gap Grader– to advance the empowerment of women worldwide. So we work hard to understand how names vary in different states of India, different regions. For example, below are the most frequent male/female names in the state of Tripura. Keep posted for the next version of GendRE API, as it will have much better precision to predict the gender of Indian names. In the meantime, further reading:

Most popular female and male names in Tripura, India

Name Female Male Total Gender
Sabita 6364 6 6370 Female
Pradip 9 5077 5086 Male
Kalpana 4481 4481 Female
Anita 4424 13 4437 Female
Rina 4182 6 4188 Female
Ratna 4133 13 4146 Female
Ratan 20 4112 4132 Male
Narayan 11 4102 4113 Male
Namita 4079 6 4085 Female
Uttam 19 4043 4062 Male
Sbapan 6 3981 3987 Male
Dilip 7 3903 3910 Male
Dipali 3898 6 3904 Female
Bishbajit 12 3853 3865 Male
Gita 3853 3853 Female
Tapan 9 3798 3807 Male
Anjali 3617 3617 Female
Ranjit 7 3508 3515 Male
Sanjit 14 3443 3457 Male
Sbapna 3429 3429 Female
Lakshi 3366 29 3395 Female
Soma 3347 3347 Female
Sabitri 3287 3287 Female
Kajal 1930 1349 3279 Both
Suman 58 3218 3276 Male
Shipra 3244 3244 Female
Purnima 3168 3168 Female
Sunil 6 3119 3125 Male
Sujit 14 3103 3117 Male
Rita 3111 3111 Female
Sumitra 3088 10 3098 Female
Bimal 10 3086 3096 Male
Shefali 3091 3091 Female
Ajit 12 3050 3062 Male
Aarati 3022 3022 Female
Anjana 2983 8 2991 Female
Malati 2963 2963 Female
Babul 8 2952 2960 Male
Archana 2915 2915 Female
Samir 9 2852 2861 Male
Gautam 2840 2840 Male
Gopal 9 2813 2822 Male
Dipak 2791 2791 Male
Rekha 2756 2756 Female
Dulal 7 2736 2743 Male
Basanti 2727 2727 Female
Shyamal 11 2711 2722 Male
Minati 2677 2677 Female
Manik 21 2644 2665 Male
Shilpi 2632 7 2639 Female
Sima 2619 6 2625 Female
Bikash 10 2569 2579 Male
Sanjay 2540 2540 Male
Subhash 2529 2529 Male
Sajal 68 2443 2511 Male
Abhijit 8 2488 2496 Male
Litan 26 2465 2491 Male
Parimal 16 2465 2481 Male
Pratima 2421 2421 Female
Anima 2374 2374 Female
Manju 2309 50 2359 Female
Bina 2344 2344 Female
Sandhya 2327 10 2337 Female
Anil 2321 2321 Male
Mamata 2303 8 2311 Female
Ruma 2299 2299 Female
Rabindr 2295 2295 Male
Rajib 9 2263 2272 Male
Biplab 7 2250 2257 Male
Sukumar 2242 2242 Male
Abdul 2230 2230 Male
Chandan 13 2212 2225 Male
Shikha 2213 2213 Female
Rajesh 2196 2196 Male
Manoranjan 2195 2195 Male
Milan 1435 752 2187 Both
Nirmal 9 2173 2182 Male
Sarasbati 2177 2177 Female
Raju 46 2117 2163 Male
Aparna 2146 2146 Female
Zarna 2061 2061 Female
Rakhi 2044 10 2054 Female
Mayarani 2033 2033 Female
Tapas 2030 2030 Male
Rakesh 10 2009 2019 Male
Jyotasna 2005 2005 Female
Jayanti 1985 7 1992 Female
Santosh 1950 1950 Male
Subrat 1918 1918 Male
Ranajit 1917 1917 Male
Sandhyarani 1912 1912 Female
Bijay 22 1845 1867 Male
Suchitra 1833 15 1848 Female
Mira 1846 1846 Female
Haradhan 1815 1815 Male
Kabita 1784 8 1792 Female
Niranjan 1792 1792 Male
Gitarani 1772 1772 Female
Pramila 1761 6 1767 Female
Manika 1740 1740 Female

Leave a comment

Filed under General

moving voices from the grave

It’s not every day one hears the voice of Tolstoy, expressing himself in French. It didn’t occur during a session of spiritism, but during an interview organized by KazTV at the ‘Institut des Archives Sonores’, Paris. Thank you Mr Edison!

This institute, created by Franklin PICARD, resulted of the merger of two large private collections of voice recordings from all over the world. Where do they come from ? From the grave, but before that – where did they come from? Onomastics, the science of proper names, is a powerful tool to search in such archives the origin of the voices. They often belong to several origin, if you think in terms of geography, or political boundaries or culture: the voice of President Kennedy is the voice of America, but it’s also the voice of the Irish Diaspora.

During soviet times, thousands of voices of poets, writers, philosophers, scientists were edited under one flag – the USSR. How can we find them today? In the collections of the IAS, long unheard voices are sleeping treasures for countries such as Kazakhstan – as they rediscover their cultural identities and long for their dead poets.

We’ve recently used NamSor software to help Lithuania attract FDIs or build expert networks in Life Sciences. But there could be hidden treasures in cultural archives (cinema, music, photography, literature, arts, ..) which data mining can help unearth, opening a whole range of new possibilities.

Watch the interview by Kazakhstan National TV (in Russian, dubbing in Kazakh)

201409_NamSor_KazTv

video mirror

About NamSor

NamSor™ Applied Onomastics is a European designer of name recognition software. NamSor is committed to promote diversity and equal opportunity. NamSor launched GendRE API, a free API to extract gender from personal names. We support the @GenderGapGrader initiative. http://namsor.com

Leave a comment

Filed under General

Onomastics to measure cultural bias in medical research

Elian CARSENAT, NamSor Applied Onomastics

Dr. Evgeny Shokhenmayer, e-onomastics

Abstract

This project involves the analysis of about over ten million medical research articles from PubMed, over three million names of scientists, authors or mentioned in citations. We propose to evaluate the correlation between the onomastic class of the article authors and that of the citation authors. We will demonstrate that the cultural bias exists and also that it evolves in time. Between 2007 and 2008, the ratio of articles authored by Chinese scientists (or scientists with Chinese names) nearly tripled. We will evaluate how fast this surge in Chinese research material (or research material produced by scientists of Chinese origin) became cross-referenced by other authors with Chinese or non-Chinese names. We hope to find that the onomastics provide a good enough estimation of the cultural bias of a research community. The findings can improve the efficiency of a particular research community, for the benefit of Science and the whole humanity.

This paper was prepared for ICOS2014, the 25th International Congress of Onomastic Sciences, the premier conference in the field of name studies

Introduction

PubMed/PMC is a large collection of scientific publication in LifeSciences. We used the 2013 data dump for data mining, with 14 million articles and 3.3 million author names. Some of the names are duplicates due to different orthographies, inconsistent use of initials and other data quality issues.

We used NamSor software to allocate an onomastic class to each author name. NamSor software with initially designed to analyse the big data in the field of economic development[1], business and marketing. The method for anthroponomical classification can be summarized as follow: judging from the name only and the publicly available list of all ~150k Olympic athletes since 1896 (and other similar lists of names), for which national team would the person most likely run? Here, the United-States are typically considered as a melting pot of other ‘cultural origins’: Ireland, Germany, etc. and not as a onomastic class on its own.

The breakdown of author names by onomastic classes is represented below :

2014_ICOS_NamSor_paper_vF_pic1

The largest groups of unique names in PubMed are British, French, German, Italian, Indian, Spanish, Dutch, etc.

An author with a French name might have a name from Brittany, Corsica or Limousin … or he might have a Canadian French name, or a Belgium French name. Or he might be an American professor with a French ancestry.

Scientists performance is often measured according to the number of publications, and the number of times a publication is cited by other publications (bibliometric rankings).

The table below shows the number of publications and the number of citations, by onomastic classes (top 20), as well as the ratio between the two metrics:

Onoma A C Ratio (C/A)
(GB,LATIN) 557,177 1,664,415 3.0
(FR,LATIN) 272,150 743,471 2.7
(DE,LATIN) 192,778 448,103 2.3
(JP,LATIN) 172,866 361,682 2.1
(IT,LATIN) 187,564 323,771 1.7
(IE,LATIN)   86,161 422,103 4.9
(NL,LATIN) 102,982 321,787 3.1
(AT,LATIN)   78,199 339,819 4.3
(CN,LATIN)* 219,040 186,464 0.9
(IN,LATIN) 153,555 221,332 1.4
(ES,LATIN) 113,407 228,650 2.0
(PL,LATIN)   47,961 268,115 5.6
(SE,LATIN)   65,717 237,017 3.6
(FI,LATIN)   35,533 247,231 7.0
(KR,LATIN) 146,444 105,605 0.7
(TW,LATIN)*   88,822 162,132 1.8
(GR,LATIN)   51,564 196,056 3.8
(DK,LATIN)   42,403 181,199 4.3
(BE,LATIN)   44,647 162,146 3.6
(CH,LATIN)   32,295 162,495 5.0
*CN+TW    307,862       348,596 1.1

This table tell us that scientists with British names have published 557 thousand articles in PubMed and have been cited 1.6 million times in other PubMed articles: the ratio is 3.

Articles written by authors with Italian names have been relatively less cited (with a ratio of 1.7) while the articles written by authors with Irish names or Finnish names have been more cited (with ratios respectively 4.9 and 7).

We cannot conclude on the overall performance of British, Italian or Finish scientists (many of them might be American scientists), but already we can observe interesting cultural biases emerging that cannot be explained by the imprecision of onomastic classification only. They raise interesting questions:

– can linguistic mastery of the English language explain why authors with British or Irish names have more citations?

– can features of a particular culture (ex. the Irish are excellent networkers and have great pubs) explain why scientific articles are more cited?

– do scientists with Italian names tend to cite more scientists with Foreign sounding names (English, Irish, etc.)?

– do scientists with Finish names tend to cite more scientists with Finish names?

– are there additional cultural biases in the publication process itself (selection, curation, promotion of scientific publications)?

– is there a gender bias worth noting (ex. male scientists are more cited; a culture with less female scientists would get a higher ratio) ?

Altogether, scientists with Chinese names -with names from mainland China or Taiwan- have altogether produced 307 thousand articles and been cited 348 thousand times: a ratio of 1.1, in the low range. We will now focus the rest of this paper on Chinese names: publications authored by a scientist with a Chinese name, or citations of scientists with Chinese names.

Scientists with Chinese names in PubMed

Globally, the number of publications in life sciences has been growing exponentially. Many countries and institutions encourage scientists to publish and link performance to bibliometric rankings (ie. publications in reputable journals, number of citations, etc.)

2014_ICOS_NamSor_paper_vF_pic2

From this chart, we can observe,

– that the absolute number of publications authored by scientists with a Chinese name has nearly tripled between 2007 and 2008 (x2.5, from 7k to 17k);

– that the relative share of publications authored by scientists with a Chinese name (compared to other onomastic classes) is also growing steadily.

This growth in the number of publications by authors with Chinese names, in absolute and relative terms, is matched by a drop in the ratio of citation/authorship :

2014_ICOS_NamSor_paper_vF_pic3

Year A C Ratio (C/A)
2012 81326 68038  0.8
2011 52396 42371  0.8
2010 33821 49260  1.5
2009 24726 35715  1.4
2008 17258 26321  1.5
2007 6944 17234  2.5
2006 4770 11299  2.4
2005 3260 6910  2.1
2004 1830 3782  2.1
2003 1195 2211  1.9
2002 849 1436  1.7
Before 3477 3823  1.1

Next, we will look at co-authorships. We do expect co-authorships to be more frequent within a same onomastic class, because of the correlation with geography : scientists with an Italian name might live in Italy, work in the same University on a research project, publish together the result of their research. We also expect to find diversity: many publications are the result of an international cooperation ; scientists are internationally mobile; last but not least countries like the US, Switzerland attract talents from everywhere and as a result of this global ‘brain drain’ produce very international research teams.

Both aspects, affinity and diversity, are reflected in the following matrix – displaying the number of co-authorships between onomastic classes:

2014_ICOS_NamSor_paper_vF_pic4

For example, the first column of the matrix (reflected in the pie chart below) shows that scientists with British names have a strong affinity to be co-author with scientists with British names, but also that they are likely to publish (in order) with scientists with French names, German names, Irish names, Italian names etc.

2014_ICOS_NamSor_paper_vF_pic5

Scientists with Chinese names have an even stronger affinity to be co-authors with scientists with Chinese names; they are likely to publish (in order) with scientists with British names, French names, German names, Italian names, Irish names, Korean names etc.

2014_ICOS_NamSor_paper_vF_pic6

Next, we will look at citations. In a perfect world, we expect citations to made based on the merits of scientific research only. We assume some ‘invisible hand’ will self-regulate the visibility of publications among research communities -so all relevant research is known by the experts of the field. If scientific excellence is equally distributed, we expect the number of publications citing authors of a particular onomastic class to be proportional to the number of authors of that particular onomastic class.  However, the following table tells a different story.

Onomastic Class Onoma Authored % Onoma
Self Citations %
Bias Factor
(GB,LATIN) 16.6% 17.0% 1.02
(FR,LATIN) 8.1% 7.6% 0.94
(IT,LATIN) 5.6% 3.8% 0.68
(DE,LATIN) 5.8% 6.1% 1.05
(CN+TW,LATIN) 9.2% 12.1% 1.32
(ES,LATIN) 3.4% 3.8% 1.13
(JP,LATIN) 5.2% 19.3% 3.73
(IE,LATIN) 2.6% 4.4% 1.73
(NL,LATIN) 3.1% 5.6% 1.83
(AT,LATIN) 2.3% 4.2% 1.79
(SE,LATIN) 2.0% 3.5% 1.76
(IN,LATIN) 4.6% 4.1% 0.89
(PT,LATIN) 1.9% 2.3% 1.17
(GR,LATIN) 1.5% 2.8% 1.82
(KR,LATIN) 4.4% 3.0% 0.68
(BE,LATIN) 1.3% 2.6% 1.98
(DK,LATIN) 1.3% 3.4% 2.65

In this table, we observe that authors with British names represent 16.6% of publications, but 17% of their citations : a bias factor of 1.02 (almost no bias). Conversely, we observe that authors with French names represent 8.1% of publications, but only 7.6% of their citations : a bias factor of 0.94 indicating that authors with French names tend to cite authors with foreign names more.

As for authors with Chinese names, they represent 9.2% of the publications, but 12.1% of their citations : a bias factor of 1.32 indicating that they tend to cite authors with Chinese names more.

Authors with Chinese names have a positive bias in citing authors with Chinese names, however we can see other cases where the bias is even stronger: authors with Japanese names citing authors with Japanese names, authors with Danish names…

More interesting, the following table shows that -apart from authors with a Chinese name- every other onomastic class (British, French, Italian, German etc.) have a negative bias towards citing authors with a Chinese name.

Onomastic class Chinese Onoma Citation Pct% Bias Factor
(GB,LATIN) 3.9% 0.43
(FR,LATIN) 3.9% 0.42
(IT,LATIN) 3.9% 0.43
(DE,LATIN) 4.1% 0.44
(CN+TW,LATIN) 12.1% 1.32
(ES,LATIN) 4.0% 0.43
(JP,LATIN) 5.2% 0.56
(IE,LATIN) 4.0% 0.44
(NL,LATIN) 3.5% 0.38
(AT,LATIN) 4.1% 0.44
(SE,LATIN) 3.6% 0.40
(IN,LATIN) 5.9% 0.65
(PT,LATIN) 4.0% 0.43
(GR,LATIN) 3.9% 0.42
(KR,LATIN) 6.8% 0.74
(BE,LATIN) 3.8% 0.42
(DK,LATIN) 3.9% 0.42

Authors with a Chinese name tend to cite authors with a Chinese name more. Comparatively, scientists with non Chinese names (British, French, Italian, German etc.) have a bias factor of 0.46 and are 3 times less likely to cite publications authored by a scientist with a Chinese name.

We will now see of the biases factors evolve between 2002 and 2012:

2014_ICOS_NamSor_paper_vF_pic7

According to this table, the positive bias factor of authors with Chinese names in citing other authors with Chinese names remains roughly stable. On the other hand, the negative bias factor of scientists with non-Chinese names in citing authors with Chinese names is generally increasing.

Manual controls

Given the large number of names automatically classified in a taxonomy based on geographic origin (China, etc.) we could not verify manually the entire database. We verified manually two randomly selected subsets:

– firstly, a list of 1280 names recognized by the software as Chinese names;

– secondly, a list of ~10000 names classified by the software into the full taxonomy (over 100 onomastic classes, corresponding to different countries of origin)

According to the first validation method, 83% of names the software recognized as Chinese were manually verified as Chinese; 2% unknown; 15% as non-Chinese (ie. mis-classifications).

The software outputs a confidence level. 76% of the names were classified with positive confidence. For the names recognized as Chinese with a positive confidence, 94% were manually verified as Chinese; 1% unknown; 4% as non-Chinese (ie. mis-classification).

2014_ICOS_NamSor_paper_vF_pic8

In PubMed, many names do not have a full first name, only initials.

For names classified with positive confidence, we found that first names of just one or two character (ex. J or JH) accounted for 90% of mis-classifications. When the input includes a full name (as would generally be the case with other bibliometric sources such as Thomson WoS, Scopus or ORCID) the accuracy is 99%.

2014_ICOS_NamSor_paper_vF_pic9

According to the second validation method, we can calculate the usual metrics used in classification : precision and recall.

10172 names were manually classified by a manual operator independently. In this method, errors could be made by the computer and also by the manual operator.

For the calculations below, we assume the assume the manual operator made no mistakes (this is not the case, error is human). The manual operator could classify 50% of names, left the rest as ‘Not Sure’.

For Chinese, non Chinese names, the software precision was respectively 81% and 97% and the recall was 59% and 99%. For names classified by the software with positive confidence (52% of all names), the precision was 93% and the recall was 69%. Excluding the names with first name length < 2 (initials, such as J or JH) the precision was 97% and the recall was 72%.

If conversely, we assume that the computer made no mistakes, then we can compare the precision and recall of the operator with that of the computer:

Chinese Names Non Chinese Names
All Names Computer Human Computer Human
Precision 81% 59% 97% 99%
Recall 59% 42% 99% 48%
Chinese Names Non Chinese Names
Confidence>0 Computer Human Computer Human
Precision 93% 69% 96% 99%
Recall 69% 49% 99% 48%
Chinese Names Non Chinese Names
Confidence>0 && Len(firstName)>2 Computer Human Computer Human
Precision 97% 72% 96% 100%
Recall 72% 51% 100% 48%

This method of cross validation between computer and human could be improved by having several manual checks by different operators to obtain a good validation sample.

Future work

For future work, we would data mine the large commercial bibliographic databases (Thomson WoS, Scopus and possibly ORCID) because they offer better data quality and useful additional information:

– firstly, they have the full name in addition to the short name cited with just initials; this significantly reduces the error rate of onomastic classification

– secondly, they link scientists to research institutions (affiliations) and geographies (country of affiliation) ; this allows additional analysis on the topic of Diasporas and brain drain, comparing -for example- the research output of Chinese / Chinese American scientists in the US with that of scientists of Mainland China;

– thirdly, those databases have a larger coverage in terms of scientific disciplines, allowing comparison between different fields of research.

Conclusions

Significant cultural biases exist, not only in the way scientists co-author publications together, but also in the way they make citations. Scientific publications authored by scientists with Chinese names are three times less cited by the international research community that they are cited by other scientists with Chinese names. We cannot conclude on the quality of Chinese research but we can challenge the commonly accepted idea that the volume of publications and citations alone indicate that China is becoming a superpower in Science and Technology.

Given the importance of bibliometric rankings in the way countries build and monitor public policies on Science and Education or international cooperation; in the way research institutions measure and reward scientific excellence of researchers and teams,  those biases should be accounted for. Otherwise, international comparisons are not ‘scientific’, not fair and can lead to wrong decisions.

[PDF 2014_ICOS_NamSor_paper_vF.pdf] [Pitch 20140828_ICOS2014_Pitch_vF.pdf]

[1] Onomastics and Big Data Mining, ParisTech Review 2013, arXiv:1310.6311 [cs.CY]

Source Data

2 Comments

Filed under FDI Magnet, General

What’s in a name in 1914, in 2014?

(a onomastics.co.uk reblog)

This month, starting 25th of August, the University of Glasgow will host the 25th International Congress of Onomastic Sciences, the premier conference in the field of name studies.

For this occasion, we have started to calibrate NamSor software to recognize Scottish names. This is work in progress, but I’d like to share some preliminary data visualizations of regional names.

2014 marks 100 years since the start of the First World War. All across Europe and beyond, families lost dear ones, children were raised without knowing their father and grand-children were born in the aftermath of this trauma – only to live another global war, WWII. Let’s respect the people who died in both wars, and let’s also listen to the message their names convey to us about who they were, about who we are.

What do personal names tell us about the world in 1914?

In 1914, Europe was composed of Nations and Nations of Regions with deeply rooted people. This was the situation before the massive rural exodus and before the international migration flows caused by either decolonization or what we call today ‘globalization’. This first global war was fought by local people who lived close by among themselves, married in their local community, often spoke their own local language…

Scottish names

We’ve analysed the Commonwealth War Grave Commission (CWGC) database to see if we could correlate onomastics and regiments. The result is presented below:

20140801_Scottish_WWI_Onomastic_Millefeuille_v002

 

We’ve found a majority of Scottish names in regiments such as: the Gordon Highlanders, the Mercantile Marine Reserve, the Royal Scots, the Cameron Highlanders, the Seaforth Highlanders, the King’s Own Scottish Borderers and also the Royal Flying Corps.

The onomastic mille-feuille is dense but hard to understand. You can think of it as a sorted list of pie charts, like this one:

20140801_Scottish_WWI_Onomastic_PieChart_MercantileMarineReserve_v001

This pie chart tells us that the Mercantile Marine Reserved was composed mostly of Scottish and Welsh soldiers.

By looking at the soldiers ranks for that particular regiment, we can produce a new onomastic mille-feuille : names DO matter when it comes to rank in 1914.

20140801_Scottish_WWI_Onomastic_MilleFeuille_MercantileMarineReserve_v001

In more easily understandable pie chart language, this means that the Firemen were mostly Scottish and Welsh, whereas the Carpenters were English.

20140801_Scottish_WWI_Onomastic_Ranks_PieChart_MercantileMarineReserve_v001

Indian names

The first world war started as a European war but populations from Africa, Asia were immediately mobilized by the colonial powers of the time : the British Colonial Empire,  France, … many soldiers came from far away to meet their death in the tranchées of Eastern  France.

The Indian names in CWGC are indicated without any given name, but with the son’s and father’s name, for example:

sonName fatherName place regiment
PURANBAHADUR GHARTI KAMANSING GHARTI NEPAL 9th Gurkha Rifles
PUNE THAPA NAIN SING THAPA GULMI NEPAL 4th Gurkha Rifles
RADHA KISHN GANGA RAM RAJPUTANA Bharatpur Infantry
SITARAM SAWANT NILU SAWANT BOMBAY 117th Mahrattas
NAMDAR KHAN HAYAT KHAN N W F  PROVINCE 21st Punjabis
SHAHAB UDDIN KARAM ILAHI PUNJAB 53rd Sikhs (Frontier Force)
RAM RAKHA CHHOTE PUNJAB Sirmur Imperial Service Sapper Corps
AMAR SINGH GURDITT SINGH PUNJAB 15th Ludhiana Sikhs
LALITBIKRAM THAPA RAMBIKRAM THAPA NEPAL 5th Gurkha Rifles (Frontier Force)
PANCHAM DHUNDA UNITED PROVINCES Army Bearer Corps
CHINNASWAMI DURUGAYA MYSORE 2nd Queen Victoria’s Own Sappers and Miners
LAKKHI JAHANGIR UNITED PROVINCES Indian Royal Artillery
SHIU DAS DUBE RAM SEWAK DUBE UNITED PROVINCES 3rd Brahmans
BATAN SINGH BELA SINGH PUNJAB 57th Wilde’s Rifles (Frontier Force)
KALU GHALE KAMI GHALE NEPAL 8th Gurkha Rifles
ISMAIL HAIDAR MANUBUDDIN SIKDAR BENGAL Indian Railway Department
FATTEH KHAN DIL DOST KHAN PUNJAB 82nd Punjabis
SUJAWAL KHAN BAHADUR KHAN PUNJAB 38th King George’s Own Central India Horse
MAHABIR MURAI LACHHMAN MURRAI UNITED PROVINCES 3rd Sappers and Miners
SURENDRANATH RIAWA CHANDI CHARAN BISWAS BENGAL Indian Labour Corps

 

So we have used a different algorithm to automatically cluster Indian names into onomastic classes. Some onomastic classes might be related to geography, to Indian casts, to social status or religious beliefs …

We can again use an onomastic mille-feuille to visualize the correlation between names and geography, but here a classic geographical map would probably tell a better story.

20140801_Indian_WWI_Onomastic_Millefeuille_v001

Distinctive patterns are recognized in names from Bombay, Madras, Delhi or Pashawar, allowing the software to cluster them into distinct onomastic classes.

And again we can then look at regiments to visualize how ethnically/linguistically diverse they were:

20140801_Indian_Regiments_WWI_Onomastic_Millefeuille_v001

 

Italian names

All regions of Italy have paid a heavy tribute to the Great War:

2014_Italian_WWI_Casualties

 

Italian regional names are particularly well differentiated, as can be seen in the following onomastic millefeuille:

2014_Italian_WWI_Onomastics

We display here some examples of typical names from different regions. Can you see how different they are?

  • IT/Abruzzi e Molise: MEZZACAPPA GIUSEPPE DI ANTONIO, PAOLILLI-TREONZE PASQUALE DI DOMENICO, BONITATIBUS ERMANNO DI ANGELO, FIDELIBUS ANGELANTONIO DI EUGENIO, PAOLILLI-TREONZE DONATO DI GAETANO, VASQUENZ AUGUSTO ANGELO DI ANTONIO, AMMAZZALORSO ANTONIO DI ANGELO.
  • IT/Basilicata: LATERZA GIOVANNI DI GIUSEPPE, SCAMORCIA GIUSEPPE DI GAETANO, ALAGIA NICOLA DI GIUSEPPE, CLAPS VITO CANIO DI GAETANO, CLOROFORMIO VITO DOMENICO DI TADDEO, SCANDIFFIO DOMENICO DI INNOCENZO, CLAPS ANGELO VITO DI VITANTONIO, CASAMASSIMA FRANCESCO PAOLO DI GIOVANNI, PENNIMPEDE GIUSEPPE DI PIETRO.
  • IT/Calabria: PROCOPIO FRANCESCO DI NICOLA, CANDREVA FRANCESCO DI GIUSEPPE, SCICCHITANO FRANCESCO DI GIUSEPPE, SPACCAROTELLA GIOVANNI DI ANGELO, CICCIU CONSOLATO DI ANTONIO, LULJ GIUSEPPE DI VINCENZO, TRUNCELLITO DOMENICO PASQUALE DI GIUSEPPE, DAVOLOS DOMENICO DI PASQUALE, CHIDICHIMO GIOVANNI DI SALVATORE.
  • IT/Campania: ANNUNZIATA GIOVANNI DI ANTONIO, PISCOPO GIOVANNI DI ANTONIO, PISCOPO GIUSEPPE DI ANTONIO, SARRAPOCHIELLO LORENZO DI NICOLA, GENETIEMPRO GIUSEPPE DI MATTEO, VALIANTAE ANIELLO DI CARMINE, DONNIACUO ALFONSO DI GIUSEPPE.
  • IT/Emilia-Romagna: SCHIAVAZAPPA BONFIGLIO DI CRISTOFORO, SAVRIE ADELCHI DI GIUSEPPE, VACONDIO BONFIGLIO DI PIETRO, GUAGLIUMI GEMINIANO DI CESARE, ASTROLOGI GIOVANNI DI FERDINANDO, SAVRIE GIUSEPPE DI PRIMO, GUAGLIUMI GIOVANNI DI LEANDRO, MANSERVIGI GIOVANNI DI SALINGUERRA.
  • IT/Lazio: ASTROLOGO ANGELO DI PACIFICO, FAPERDUE SALVATORE DI VALENTINO, CENTOSCUDI NAZZARENO DI SANTE, CARLODALATRI UMBERTO DI FRANCESCO, CAPPADOCIA GIUSEPPE DI GIOVANNI, SCHIETROMA GIUSEPPE DI PASQUALE, PALAMIDES GIOVANNI DI GIUSEPPE, GIANFERMI GIOVANNI BATTISTA DI DOMENICO, CAPPADOCIA AMEDEO DI GIUSEPPE, PIETROBONO GUGLELMO DI BENIAMINO.
  • IT/Liguria: GAGGERO GIOVANNI BATTISTA DI GIUSEPPE, KONIG GIOVANNI BATTISTA DI GIOVANNI BATTISTA FILIPPO, MONTEGHIRFO GIOVANNI DI LUIGI, MAGIONCALDA GIOVANNI BATTISTA DI GIOVANNI, BACIGALUPO GIOVANNI BATTISTA DI DOMENICO, REDEGOSO GIOVANNI BATTISTA DI BARTOLOMEO, KONIG GUGLIELMO DI PIETRO, ARBOCO GIOVANNI BATTISTA DI EMANUELE VINCENZO.
  • IT/Lombardia: SANTAMBROGIO GIUSEPPE DI FRANCESCO, RUEFF GIOVANNI DI GIOVANNI, RECALCATI GIUSEPPE DI AMBROGIO, TAGLIABUE GIUSEPPE DI ANGELO, RANZENIGO FRANCESCO DI GIOVANNI, PIANTANIDA ANTONIO DI FELICE, SALMOIRAGHI GIUSEPPE DI ATTILIO, CONSONNI GIUSEPPE DI DOMENICO.
  • IT/Marche: CUCCU GIUSEPPE DI FRANCESCO, FIORDOLIVA GIUSEPPE DI PACIFICO, CINGOLANI NAZZARENO DI PIETRO, ANGELOME MARONE DI GIUSEPPE, VOLTATTORNI NAZZARENO DI FRANCESCO, CARSTANJEN GUSTAVO DI PAOLO, MENGHI-CERRA NAZZARENO DI DAVID, VOLTATTORNI CIRIACO DI LUIGI, CARSTANJEN EDOARDO DI PAOLO, BRUZZECHESSE DOMENICO DI FRANCESCO.
  • IT/Piemonte: DESTEFANIS GIOVANNI DI GIUSEPPE, RIVOIRA GIOVANNI DI PIETRO, CUTTICA GIUSEPPE DI CARLO, BELLINO-ROCI GIUSEPPE DI NICOLAO, NEPOTE GIOVANNI DI DOMENICO, AIMAR BARTOLOMEO DI BARTOLOMEO, LANTELME GIORGIO DI FRANCESCO, GUELPA GIOVANNI DI GIOVANNI, VALSANIA GIOVANNI DI ANTONIO, ARNEODO GIUSEPPE DI GIOVANNI.
  • IT/Puglia: SPAGNULO COSIMO DAMIANO DI FRANCESCO, VANTAGGIATO GIUSEPPE DI VINCENZO, SEMERARO GIOVANNI DI GIUSEPPE, EPICOCO DOMENICO DI GIOVANNI, AGHILAR RUGGIERO DI LUIGI, CANNABONA CROCIFISSO DI PASQUALE, BAGLIVO CROCIFISSO DI ORONZO, SPEDICATO CROCEFISSO DI SALVATORE, GIANCANE CROCIFISSO DI RAFFAELE.
  • IT/Sardegna: MARONGIU SALVATORE DI ANTONIO, PORCU GIOVANNI DI FRANCESCO, MARONGIU FRANCESCO DI SALVATORE, PUTZOLU GIOVANNI DI GIUSEPPE, DESOGUS GIOVANNI DI ANTONIO, MURTAS GIOVANNI DI GIUSEPPE, LAMPIS ANTIOCO DI FRANCESCO.
  • IT/Sicilia: RAPISARDA SALVATORE DI GIUSEPPE, GIONFRIDDO PAOLO DI SALVATORE, MACALUSO GIUSEPPE DI GIUSEPPE, SPAMPINATO ANTONINO DI GIUSEPPE, PRIVITERA ANTONINO DI GIUSEPPE, SCACCIANOCE SALVATORE DI ROSARIO, RAPISARDA SALVATORE DI CARMELO, CANGIALOSI ANTONINO DI MICHEL.
  • IT/Toscana: SCHIUMARINI IACOPO DI ANTONIO, DIOLAIUTI FERRUCCIO DI GIULIO, MAZZEI EFREM DI GIUSEPPE, DELL’EUGENIO ANGIOLO DI ANTONIO, DELL’ARINGA GABBRIELLO DI DANIELE, PISTOI ASTAROTTE DI OLIMPIO, BIENTINESI MILZIADE DI GIOVANNI, ANZEMPAMBER FILIPPO DI ADOLFO, BEMPORAD DUILIO DI POLICARPO, DELL’OMODARME RANIERI DI DEMETRIO.
  • IT/Trentino-Alto Adige: DALPIAZ GIUSEPPE, ANDERLE GIOVANNI, DEVIGILI GIUSEPPE, PONTALTI GIUSEPPE, CASAGRANDA GIUSEPPE, FLAIM GIOVANNI, PALLAORO GIUSEPPE, STEDILE GIUSEPPE, DETASSIS GIUSEPPE, DELVAI GIUSEPPE.
  • IT/Umbria: DESANTIS GIUSEPPE DI DOMENICO, MAGARINI-MONTENERO DOMENICO DI BONAVENTURA, QUONDAM GIOVANNI DI NAZZARENO, GAMBELUNGHE SALVATORE DI CESARE, CENTOGAMBE DOMENICO DI FELICE, QUONDAM CASTORINO DI GIUSEPPE, BESTIACCIA GIOVENALE DI GIUSEPPE, BELLACHIOMA ASTORRE DI ALBERTO, SFORNA CRISPOLTO DI NAZZARENO, CENTOGAMBE GIUSEPPE DI PIETRO.
  • IT/Veneto: DELL’OSBEL GIOVANNI DI ANTONIO, MESTRINER GIOVANNI DI GIUSEPPE, RODIGHIERO GIOVANNI DI ANTONIO, BOF GIOVANNI DI LUIGI, DALL’OSTO GIUSEPPE DI PIETRO, SKREZENEK GIUSEPPE DI CARLO, FILOSOFO GIOBATTA DI PAOLO, MENEGUZ GIOBATTA DI ANTONIO, MESCALCHIN GIOBATTA DI ANDREA, CIPOLAT-GOTET GIOVANNI DI GRAZIADIO.

French names

The equivalent of CWGC in France is the Mémoire des Hommes database. We’ve used it to calibrate NamSor recognition of French regional names. After calibration, about 70% of names can be allocated to a particular region and we can produce the following onomastic mille-feuille, sorted according to the relative number of Bretons (people from Brittany):

20140801_France_WWI_Millefeuille_v001

We can also view the total number of casualties, broken down according to the onomastic class. It show the large number of people originally from Brittany who died during WWI, regardless of their birthplace. However, this remains debatable – as ~30% of names could not be specifically allocated to a region of origin (only recognized as French).

20140801_France_WWI_RegionalBreakdown_v001

Baptiste COULMONT, a sociologist, published a very interesting study on given names analysing the results of students at the French Baccalaureate in 2014. We’ve used a similar dataset compare regional names in 1914 and in 2014. Unfortunately, we didn’t have enough time to align the geographic mappings – but the result is visual and self-explanatory. We can see how rural exodus and internal migration have eroded the regional identity in personal names. Still we can see that even in 2014, the correlation between onomastics and geography remains strong – especially in Brittany, in the North of France, in Alsace, in Lorraine, in Loire, in Lyon, in Aquitaine and Corsica.

20140801_France_Millefeuille_1914_2014_v001

What do names tell us about the world in 2014?

A lot! Some say: too much!

Enough to make ICOS2014 a very exciting and current event. We look forward to be in Glasgow on 24th August and meet you there. Long live onomastics.co.uk

Feel free to contact us, mailto:contact@namsor.com

About

NamSor™ Applied Onomastics is a European designer of name recognition software. Our mission is to help make sense of the Big Data and understand international flows of money, ideas and people.
http://namsor.com/

NamSor is committed to promote diversity and equal opportunity and launched GendRE API, a free API to conduct analysis of gender equality using open data.

 

1 Comment

Filed under EthnoViz

Onomastic sampling for migration studies

On Friday morning, I had the opportunity to present our breakthrough data mining technology at Regent’s University Turkish Migration Conference (TMC2014, London).

The supporting presentation can be downloaded here (20140530_TMS2014_Pitch_vFf.pdf) or viewed online here.

20150601_TurkishMigrationStudies

During the following sessions by researchers from various countries (Turkey, US, UK, Germany, Netherland, Sweden, Norway, Belgium …), I learned some of the ‘jargon’ of migration studies and also something about the particular research methodologies applied in that field.

My initial vision was that onomastics (the recognition of personal names) could be applied to discover new migration patterns. It was based on several preliminary meetings with international organizations concerned with migration issues. Census data can take up to three years to process. As states struggle to provide timely and accurate data to international organizations (such as the OECD, IOM, United Nations High Commissioner for Refugees UNHCR, …), these organizations can turn to the Big Data to identify and monitor new trends. There are challenges in identifying relevant data sources to provide valuable information about less digitally connected migrants. Twitter, LinkedIn, Google, Facebook, D&B, Thomson WoS … combined with applied onomastics can tell us a lot about the changing migration patterns of STEM Workers, innovators and entrepreneurs.

STEM Workers: workers in science, technology, engineering, and mathematics; art is occasionally considered as well (STEAM Workers).

With several TMS2014 sessions focused on the question of Turkish identity, or the particular migration and integration patterns of the Turkish, Kurdish, Alevi or Circassian communities, applied onomastics clearly offers an innovative tool to look at data from a different angles (nationality/birth place/ethnicity/gender/…)

However, I found that many research studies are conducted based on an initial theoretical hypothesis. Researchers then apply various qualitative or quantitative methods (occasionally both) to assess the hypothesis. Pure quantitative methods such as ‘data mining’ or ‘graph analysis’ as seen as de-humanizing by researchers (anthropologists, sociologists, historians …), primarily interested in the human story of migration. Most researchers conduct surveys to gather the data for their study : they find people, talk to them, ask questions. How do researchers identify to group of people to be surveyed (the sample)? During the conference, I learned another jargon: network/snowball sampling.

Network/snowball sampling: Snowball sampling is based on the selection of target people in personal networks. In a first step, important people within the target group are identified (initial sample) who themselves identify further people who can be also addressed for the survey (McKenzie & Mistiaen, 2007, p. 2; Salentin, 1999, p. 124).

As often, this new word was the magic keyword to find additional resources and understand how NamSor technology could fit with the current start of migration research methodology:

This document clearly describes the various methodologies to identify the initial population of a study and the various sampling procedures. Onomastic sampling is one of them.

‘In many countries, migrants constitute a substantial part of society. In public opinion research, however, they are often inadequately or not at all considered. This paper gives a systematic overview of the underlying methodological challenges that cause this situation. Those challenges are twofold and concern (1) the definition and distinction of the terms migrant and foreigner to describe the target group and (2) the selection of adequate sampling procedures.’

‘The methodological challenge of selecting adequate sampling procedures

Even after defining the target population, researchers still face difficulties regarding sampling. The problems tackled can be divers, for instance in what way the target population can be contacted (which survey modes are culturally accepted?) and how the individual respondents can be selected (e.g. does last-birthday work?). The paper discusses four central sampling procedures which regularly come up in the literature and which are seemingly appropriate for these kinds of surveys:

1. Sampling procedures on the basis of administrative records,

2. Area sampling, like e.g. random-route-procedures,

3. Network/snowball sampling, and

4. Onomastic sampling procedures based on foreign names from directories.’

How NamSor software can help?

1. Sampling procedures on the basis of administrative records

In this sampling method, the administrative records does not reflect the fine-grain identity of the populations: ‘Turkish nationality’ or ‘Born in Turkey’ encompasses many different populations. Applied onomastics can help refine samples to more targeted populations (Turkish, Alevi, Kurdish, Syrian, …)

2. Area sampling, like e.g. random-route-procedures

In this sampling method, it’s critical to understand the geo-demographics of a territory to know where different migrants populations are concentrated. Applied onomastics can help assess the density of migrant populations at various levels (region/city/district or road) from various public data sources.

3. Network/snowball sampling

In this sampling method, the personal network of the researcher is used an an initial seed to identify further prospects for interviews. Applied onomastics could help analyse personal networks of researchers (from social networks such as Twitter, or academic sources  such as bibliographic databases) to identify larger seed networks and generate better sampling. That could help reduce the risk of biases induced by the researcher’s network (reinforcing its own personal or cultural biases).

4. Onomastic sampling procedures based on foreign names

Dictionaries of given names and family names associated with a particular culture have been used for sampling.

NamSor software goes beyond this technique to use sociolinguistics and recognize in a (fistName, lastName) pair the likely origin of a person, with high accuracy. NamSor software can help researchers conduct onomastic sampling, not just from telephone directories but also from a wide range of modern data sources : social networks, opt-in commercial databases, … with high precision and fine-grain targeting.

Conclusion

NamSor powerful technology raises many data privacy and ethical questions, but we’re glad to say that if science and migration studies can be good for society, NamSor can be too.

About NamSor:
NamSor mission is to help understand international flows of money, ideas and people. NamSor launched GendRE API, a free API to conduct analysis of gender equality using opendata. http://namesorts.com/api/

Leave a comment

Filed under General

Turkish Onomastics and Migration Patterns

Next week at Regent’s University Turkish Migration Conference (TMC2014, London), Elian Carsenat will present breakthrough data mining technology to apply onomastics (the recognition of personal names) to the discovery of new migration patterns.

20140522_TMC_Flyer

As states struggle to provide timely and accurate data to international organizations (such as the OECD, IOM, United Nations High Commissioner for Refugees UNHCR, …), these organizations can turn to the Big Data to identify and monitor new trends. What can Twitter, LinkedIn, Google, Facebook, D&B, Thomson WoS … tell us about the changing migration patterns of highly educated professionals, entrepreneurs? We’ll present how applied onomastics and the Big Data can be a game changer in migration studies, with vast implications on how countries or even regions can engage their Diaspora (to attract FDI, remittances, to build networks of expertise, …)

We look forward to see you at Regent’s University Turkish Migration Conference (TMC2014, London). Full program here.

To download the supporting presentation 20140530_TMS2014_Pitch_vFf.pdf

Further reading:

Leave a comment

Filed under EthnoViz

Cannes2015, Mind the Gender Gap

[UPDATE Sept 2014 – many of the mis-classifications listed below are now handled properly by NamSor Gender API, combining classic baby name dictionaries and more advanced sociolinguistics]

We’ve used applied onomastics to recognize the likely gender of about 5 million people in different lists of The Internet Movie Database (IMDb , 2005), using personal names. How reliable is the method? The number of actresses that we’ve classified as Male (or conversely) offers an immediate answer: misclassifications are negligible compared to the wide gender gap that is seen in stereotypically ‘male jobs‘ such as Cinematographer and ‘female jobs‘ such as Costume designer.

20140516_IMDb_GenderGap_Methodology_v002

20140518_IMDb_GenderGap_Table_byRole

Read the original article on Elena’s blog. The following post discusses the methodology in more details.

For number crunchers who would like to perform their own statistical analysis, we’ve disclosed the full data file (imdb_jobs_gender.zip). Also, NamSor Gender API to infer gender from personal names is open and free to use.

We will now disclose the main reasons for misclassifications and -when relevant- how we plan to address them in future versions of the API.

Gender of English names:

We have misclassified a few English names, like Jamie or Taylor. According to social security card applications, 83,831 male and 264,571 female people bore the name Jamie in the USA since 1879. Conversely in IMDb (or in the USA today), more men than women bear that name: 2,020 actors versus 933 actresses. Some names are genderless and name demographics change across time.

Gender of Italian names:

We’ve misclassified a few Italian Andreas. Andrea is a male name in Italy and a female name in the US:

https://api.namsor.com/onomastics/api/json/gendre/Andrea/Parker/us
returns {“scale”:0.97,”gender”:”female”}

whereas

https://api.namsor.com/onomastics/api/json/gendre/Andrea/Rossini/it
returns {“scale”:-1.0,”gender”:”male”}

In a later version of GendRE API, we will deploy our sociolinguistic algorithm to recognize that Andrea Rossini is most likely an Italian name (and consequently most likely a male name), without requiring any indication of country.

Gender of French names:

Jean is a male name in France and often a female name in the US.
https://api.namsor.com/onomastics/api/json/gendre/Jean/Valjean/fr
https://api.namsor.com/onomastics/api/json/gendre/Jean/Johnson/us

Conversely, Laurence is a female name in France and a male name in the US.
https://api.namsor.com/onomastics/api/json/gendre/Laurence/Valjean/fr
http://api.namsor.com/onomastics/api/json/gendre/Laurence/Johnson/us

Gender of Spanish/Histpanic names:

Joan is a male name in Spain and often a female name in the US.
https://api.namsor.com/onomastics/api/json/gendre/Joan/Viñas/es
http://api.namsor.com/onomastics/api/json/gendre/Joan/Smith/us

The case of IMBb names with just Initials:

A few misclassifications come from having just the initials, instead of a full given name. It’s hard to guess the gender of N. Watts-Phillips but in a later version of GendRE API, we’ll recognize that N. Zhuravlyov and N. Zeynalova are most likely Slavic (respectively) male and female names. GendRE API already does recognize the gender of Russian names when spelled in Cyrillic:
https://api.namsor.com/onomastics/api/gendre/О/Зейналова/ru

Same for Lithuanian names, V.Mainialite and V. Rucyte are most likely female names whereas V. Nikulajevas and V. Belopetravicius are most likely male names.

Chinese and Korean names:

About half the names from China (or Korea) are misclassified. There aren’t that many Chinese names in IMDb so they don’t so much affect the overall result.

Guessing the gender of a Chinese name transliterated in Latin characters is no better than flipping a coin. NamSor Gender API works when the Chinese name is in Chinese characters:

https://api.namsor.com/onomastics/api/json/gendre/声涛/周/cn

In a later version, we will recognize Chinese name in latin alphabet to filter them out from name-based  gender studies.

Other special cases:

The above list of misclassifications causes is not exhaustive: names are strongly correlated to gender, but some names are truly genderless (ex: Kerry). Let’s not forget that gender and sex can also be different concepts, with Eurovision 2014 winner Conchita Wurst being an eminent example:
https://api.namsor.com/onomastics/api/json/gendre/Conchita/Wurst
returns {“scale”:1.0,”gender”:”female”}

Conclusion

2014.05.18 _ Cannes Day 1 04

Misclassifications in the NamSor Gender API are scarce and negligible given the huge gender gap seen in the film industry. We’ve identified several opportunities to improve, combining name gender demographics with our unique algorithm of name linguistic/cultural classification. NamSor Gender API as it is could be a useful tool for gender researchers and women citizen to perform gender gap analysis on their own, using open data.

We wish Elena – and other women – the best of luck to make their way in the Festival and in the film industry. We hope NamSor Gender API will be useful to monitor gender equality progress in the years to come.

Read the full original article here on Elena’s blog.

About

Elian CARSENAT, a computer scientist trained at ENSIIE/INRIA, started his career at JP Morgan in Paris in 1997. He later worked as consultant and managed business & IT projects in London, Paris, Moscow and Shanghai.

Elian founded NamSor™ Applied Onomastics (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.

3 Comments

Filed under General