Tag Archives: Name recognition software

Onomastics to measure cultural bias in medical research

Elian CARSENAT, NamSor Applied Onomastics

Dr. Evgeny Shokhenmayer, e-onomastics

Abstract

This project involves the analysis of about over ten million medical research articles from PubMed, over three million names of scientists, authors or mentioned in citations. We propose to evaluate the correlation between the onomastic class of the article authors and that of the citation authors. We will demonstrate that the cultural bias exists and also that it evolves in time. Between 2007 and 2008, the ratio of articles authored by Chinese scientists (or scientists with Chinese names) nearly tripled. We will evaluate how fast this surge in Chinese research material (or research material produced by scientists of Chinese origin) became cross-referenced by other authors with Chinese or non-Chinese names. We hope to find that the onomastics provide a good enough estimation of the cultural bias of a research community. The findings can improve the efficiency of a particular research community, for the benefit of Science and the whole humanity.

This paper was prepared for ICOS2014, the 25th International Congress of Onomastic Sciences, the premier conference in the field of name studies

Introduction

PubMed/PMC is a large collection of scientific publication in LifeSciences. We used the 2013 data dump for data mining, with 14 million articles and 3.3 million author names. Some of the names are duplicates due to different orthographies, inconsistent use of initials and other data quality issues.

We used NamSor software to allocate an onomastic class to each author name. NamSor software with initially designed to analyse the big data in the field of economic development[1], business and marketing. The method for anthroponomical classification can be summarized as follow: judging from the name only and the publicly available list of all ~150k Olympic athletes since 1896 (and other similar lists of names), for which national team would the person most likely run? Here, the United-States are typically considered as a melting pot of other ‘cultural origins’: Ireland, Germany, etc. and not as a onomastic class on its own.

The breakdown of author names by onomastic classes is represented below :

2014_ICOS_NamSor_paper_vF_pic1

The largest groups of unique names in PubMed are British, French, German, Italian, Indian, Spanish, Dutch, etc.

An author with a French name might have a name from Brittany, Corsica or Limousin … or he might have a Canadian French name, or a Belgium French name. Or he might be an American professor with a French ancestry.

Scientists performance is often measured according to the number of publications, and the number of times a publication is cited by other publications (bibliometric rankings).

The table below shows the number of publications and the number of citations, by onomastic classes (top 20), as well as the ratio between the two metrics:

Onoma A C Ratio (C/A)
(GB,LATIN) 557,177 1,664,415 3.0
(FR,LATIN) 272,150 743,471 2.7
(DE,LATIN) 192,778 448,103 2.3
(JP,LATIN) 172,866 361,682 2.1
(IT,LATIN) 187,564 323,771 1.7
(IE,LATIN)   86,161 422,103 4.9
(NL,LATIN) 102,982 321,787 3.1
(AT,LATIN)   78,199 339,819 4.3
(CN,LATIN)* 219,040 186,464 0.9
(IN,LATIN) 153,555 221,332 1.4
(ES,LATIN) 113,407 228,650 2.0
(PL,LATIN)   47,961 268,115 5.6
(SE,LATIN)   65,717 237,017 3.6
(FI,LATIN)   35,533 247,231 7.0
(KR,LATIN) 146,444 105,605 0.7
(TW,LATIN)*   88,822 162,132 1.8
(GR,LATIN)   51,564 196,056 3.8
(DK,LATIN)   42,403 181,199 4.3
(BE,LATIN)   44,647 162,146 3.6
(CH,LATIN)   32,295 162,495 5.0
*CN+TW    307,862       348,596 1.1

This table tell us that scientists with British names have published 557 thousand articles in PubMed and have been cited 1.6 million times in other PubMed articles: the ratio is 3.

Articles written by authors with Italian names have been relatively less cited (with a ratio of 1.7) while the articles written by authors with Irish names or Finnish names have been more cited (with ratios respectively 4.9 and 7).

We cannot conclude on the overall performance of British, Italian or Finish scientists (many of them might be American scientists), but already we can observe interesting cultural biases emerging that cannot be explained by the imprecision of onomastic classification only. They raise interesting questions:

– can linguistic mastery of the English language explain why authors with British or Irish names have more citations?

– can features of a particular culture (ex. the Irish are excellent networkers and have great pubs) explain why scientific articles are more cited?

– do scientists with Italian names tend to cite more scientists with Foreign sounding names (English, Irish, etc.)?

– do scientists with Finish names tend to cite more scientists with Finish names?

– are there additional cultural biases in the publication process itself (selection, curation, promotion of scientific publications)?

– is there a gender bias worth noting (ex. male scientists are more cited; a culture with less female scientists would get a higher ratio) ?

Altogether, scientists with Chinese names -with names from mainland China or Taiwan- have altogether produced 307 thousand articles and been cited 348 thousand times: a ratio of 1.1, in the low range. We will now focus the rest of this paper on Chinese names: publications authored by a scientist with a Chinese name, or citations of scientists with Chinese names.

Scientists with Chinese names in PubMed

Globally, the number of publications in life sciences has been growing exponentially. Many countries and institutions encourage scientists to publish and link performance to bibliometric rankings (ie. publications in reputable journals, number of citations, etc.)

2014_ICOS_NamSor_paper_vF_pic2

From this chart, we can observe,

– that the absolute number of publications authored by scientists with a Chinese name has nearly tripled between 2007 and 2008 (x2.5, from 7k to 17k);

– that the relative share of publications authored by scientists with a Chinese name (compared to other onomastic classes) is also growing steadily.

This growth in the number of publications by authors with Chinese names, in absolute and relative terms, is matched by a drop in the ratio of citation/authorship :

2014_ICOS_NamSor_paper_vF_pic3

Year A C Ratio (C/A)
2012 81326 68038  0.8
2011 52396 42371  0.8
2010 33821 49260  1.5
2009 24726 35715  1.4
2008 17258 26321  1.5
2007 6944 17234  2.5
2006 4770 11299  2.4
2005 3260 6910  2.1
2004 1830 3782  2.1
2003 1195 2211  1.9
2002 849 1436  1.7
Before 3477 3823  1.1

Next, we will look at co-authorships. We do expect co-authorships to be more frequent within a same onomastic class, because of the correlation with geography : scientists with an Italian name might live in Italy, work in the same University on a research project, publish together the result of their research. We also expect to find diversity: many publications are the result of an international cooperation ; scientists are internationally mobile; last but not least countries like the US, Switzerland attract talents from everywhere and as a result of this global ‘brain drain’ produce very international research teams.

Both aspects, affinity and diversity, are reflected in the following matrix – displaying the number of co-authorships between onomastic classes:

2014_ICOS_NamSor_paper_vF_pic4

For example, the first column of the matrix (reflected in the pie chart below) shows that scientists with British names have a strong affinity to be co-author with scientists with British names, but also that they are likely to publish (in order) with scientists with French names, German names, Irish names, Italian names etc.

2014_ICOS_NamSor_paper_vF_pic5

Scientists with Chinese names have an even stronger affinity to be co-authors with scientists with Chinese names; they are likely to publish (in order) with scientists with British names, French names, German names, Italian names, Irish names, Korean names etc.

2014_ICOS_NamSor_paper_vF_pic6

Next, we will look at citations. In a perfect world, we expect citations to made based on the merits of scientific research only. We assume some ‘invisible hand’ will self-regulate the visibility of publications among research communities -so all relevant research is known by the experts of the field. If scientific excellence is equally distributed, we expect the number of publications citing authors of a particular onomastic class to be proportional to the number of authors of that particular onomastic class.  However, the following table tells a different story.

Onomastic Class Onoma Authored % Onoma
Self Citations %
Bias Factor
(GB,LATIN) 16.6% 17.0% 1.02
(FR,LATIN) 8.1% 7.6% 0.94
(IT,LATIN) 5.6% 3.8% 0.68
(DE,LATIN) 5.8% 6.1% 1.05
(CN+TW,LATIN) 9.2% 12.1% 1.32
(ES,LATIN) 3.4% 3.8% 1.13
(JP,LATIN) 5.2% 19.3% 3.73
(IE,LATIN) 2.6% 4.4% 1.73
(NL,LATIN) 3.1% 5.6% 1.83
(AT,LATIN) 2.3% 4.2% 1.79
(SE,LATIN) 2.0% 3.5% 1.76
(IN,LATIN) 4.6% 4.1% 0.89
(PT,LATIN) 1.9% 2.3% 1.17
(GR,LATIN) 1.5% 2.8% 1.82
(KR,LATIN) 4.4% 3.0% 0.68
(BE,LATIN) 1.3% 2.6% 1.98
(DK,LATIN) 1.3% 3.4% 2.65

In this table, we observe that authors with British names represent 16.6% of publications, but 17% of their citations : a bias factor of 1.02 (almost no bias). Conversely, we observe that authors with French names represent 8.1% of publications, but only 7.6% of their citations : a bias factor of 0.94 indicating that authors with French names tend to cite authors with foreign names more.

As for authors with Chinese names, they represent 9.2% of the publications, but 12.1% of their citations : a bias factor of 1.32 indicating that they tend to cite authors with Chinese names more.

Authors with Chinese names have a positive bias in citing authors with Chinese names, however we can see other cases where the bias is even stronger: authors with Japanese names citing authors with Japanese names, authors with Danish names…

More interesting, the following table shows that -apart from authors with a Chinese name- every other onomastic class (British, French, Italian, German etc.) have a negative bias towards citing authors with a Chinese name.

Onomastic class Chinese Onoma Citation Pct% Bias Factor
(GB,LATIN) 3.9% 0.43
(FR,LATIN) 3.9% 0.42
(IT,LATIN) 3.9% 0.43
(DE,LATIN) 4.1% 0.44
(CN+TW,LATIN) 12.1% 1.32
(ES,LATIN) 4.0% 0.43
(JP,LATIN) 5.2% 0.56
(IE,LATIN) 4.0% 0.44
(NL,LATIN) 3.5% 0.38
(AT,LATIN) 4.1% 0.44
(SE,LATIN) 3.6% 0.40
(IN,LATIN) 5.9% 0.65
(PT,LATIN) 4.0% 0.43
(GR,LATIN) 3.9% 0.42
(KR,LATIN) 6.8% 0.74
(BE,LATIN) 3.8% 0.42
(DK,LATIN) 3.9% 0.42

Authors with a Chinese name tend to cite authors with a Chinese name more. Comparatively, scientists with non Chinese names (British, French, Italian, German etc.) have a bias factor of 0.46 and are 3 times less likely to cite publications authored by a scientist with a Chinese name.

We will now see of the biases factors evolve between 2002 and 2012:

2014_ICOS_NamSor_paper_vF_pic7

According to this table, the positive bias factor of authors with Chinese names in citing other authors with Chinese names remains roughly stable. On the other hand, the negative bias factor of scientists with non-Chinese names in citing authors with Chinese names is generally increasing.

Manual controls

Given the large number of names automatically classified in a taxonomy based on geographic origin (China, etc.) we could not verify manually the entire database. We verified manually two randomly selected subsets:

– firstly, a list of 1280 names recognized by the software as Chinese names;

– secondly, a list of ~10000 names classified by the software into the full taxonomy (over 100 onomastic classes, corresponding to different countries of origin)

According to the first validation method, 83% of names the software recognized as Chinese were manually verified as Chinese; 2% unknown; 15% as non-Chinese (ie. mis-classifications).

The software outputs a confidence level. 76% of the names were classified with positive confidence. For the names recognized as Chinese with a positive confidence, 94% were manually verified as Chinese; 1% unknown; 4% as non-Chinese (ie. mis-classification).

2014_ICOS_NamSor_paper_vF_pic8

In PubMed, many names do not have a full first name, only initials.

For names classified with positive confidence, we found that first names of just one or two character (ex. J or JH) accounted for 90% of mis-classifications. When the input includes a full name (as would generally be the case with other bibliometric sources such as Thomson WoS, Scopus or ORCID) the accuracy is 99%.

2014_ICOS_NamSor_paper_vF_pic9

According to the second validation method, we can calculate the usual metrics used in classification : precision and recall.

10172 names were manually classified by a manual operator independently. In this method, errors could be made by the computer and also by the manual operator.

For the calculations below, we assume the assume the manual operator made no mistakes (this is not the case, error is human). The manual operator could classify 50% of names, left the rest as ‘Not Sure’.

For Chinese, non Chinese names, the software precision was respectively 81% and 97% and the recall was 59% and 99%. For names classified by the software with positive confidence (52% of all names), the precision was 93% and the recall was 69%. Excluding the names with first name length < 2 (initials, such as J or JH) the precision was 97% and the recall was 72%.

If conversely, we assume that the computer made no mistakes, then we can compare the precision and recall of the operator with that of the computer:

Chinese Names Non Chinese Names
All Names Computer Human Computer Human
Precision 81% 59% 97% 99%
Recall 59% 42% 99% 48%
Chinese Names Non Chinese Names
Confidence>0 Computer Human Computer Human
Precision 93% 69% 96% 99%
Recall 69% 49% 99% 48%
Chinese Names Non Chinese Names
Confidence>0 && Len(firstName)>2 Computer Human Computer Human
Precision 97% 72% 96% 100%
Recall 72% 51% 100% 48%

This method of cross validation between computer and human could be improved by having several manual checks by different operators to obtain a good validation sample.

Future work

For future work, we would data mine the large commercial bibliographic databases (Thomson WoS, Scopus and possibly ORCID) because they offer better data quality and useful additional information:

– firstly, they have the full name in addition to the short name cited with just initials; this significantly reduces the error rate of onomastic classification

– secondly, they link scientists to research institutions (affiliations) and geographies (country of affiliation) ; this allows additional analysis on the topic of Diasporas and brain drain, comparing -for example- the research output of Chinese / Chinese American scientists in the US with that of scientists of Mainland China;

– thirdly, those databases have a larger coverage in terms of scientific disciplines, allowing comparison between different fields of research.

Conclusions

Significant cultural biases exist, not only in the way scientists co-author publications together, but also in the way they make citations. Scientific publications authored by scientists with Chinese names are three times less cited by the international research community that they are cited by other scientists with Chinese names. We cannot conclude on the quality of Chinese research but we can challenge the commonly accepted idea that the volume of publications and citations alone indicate that China is becoming a superpower in Science and Technology.

Given the importance of bibliometric rankings in the way countries build and monitor public policies on Science and Education or international cooperation; in the way research institutions measure and reward scientific excellence of researchers and teams,  those biases should be accounted for. Otherwise, international comparisons are not ‘scientific’, not fair and can lead to wrong decisions.

[PDF 2014_ICOS_NamSor_paper_vF.pdf] [Pitch 20140828_ICOS2014_Pitch_vF.pdf]

[1] Onomastics and Big Data Mining, ParisTech Review 2013, arXiv:1310.6311 [cs.CY]

Source Data

2 Comments

Filed under FDI Magnet, General

What’s in a name in 1914, in 2014?

(a onomastics.co.uk reblog)

This month, starting 25th of August, the University of Glasgow will host the 25th International Congress of Onomastic Sciences, the premier conference in the field of name studies.

For this occasion, we have started to calibrate NamSor software to recognize Scottish names. This is work in progress, but I’d like to share some preliminary data visualizations of regional names.

2014 marks 100 years since the start of the First World War. All across Europe and beyond, families lost dear ones, children were raised without knowing their father and grand-children were born in the aftermath of this trauma – only to live another global war, WWII. Let’s respect the people who died in both wars, and let’s also listen to the message their names convey to us about who they were, about who we are.

What do personal names tell us about the world in 1914?

In 1914, Europe was composed of Nations and Nations of Regions with deeply rooted people. This was the situation before the massive rural exodus and before the international migration flows caused by either decolonization or what we call today ‘globalization’. This first global war was fought by local people who lived close by among themselves, married in their local community, often spoke their own local language…

Scottish names

We’ve analysed the Commonwealth War Grave Commission (CWGC) database to see if we could correlate onomastics and regiments. The result is presented below:

20140801_Scottish_WWI_Onomastic_Millefeuille_v002

 

We’ve found a majority of Scottish names in regiments such as: the Gordon Highlanders, the Mercantile Marine Reserve, the Royal Scots, the Cameron Highlanders, the Seaforth Highlanders, the King’s Own Scottish Borderers and also the Royal Flying Corps.

The onomastic mille-feuille is dense but hard to understand. You can think of it as a sorted list of pie charts, like this one:

20140801_Scottish_WWI_Onomastic_PieChart_MercantileMarineReserve_v001

This pie chart tells us that the Mercantile Marine Reserved was composed mostly of Scottish and Welsh soldiers.

By looking at the soldiers ranks for that particular regiment, we can produce a new onomastic mille-feuille : names DO matter when it comes to rank in 1914.

20140801_Scottish_WWI_Onomastic_MilleFeuille_MercantileMarineReserve_v001

In more easily understandable pie chart language, this means that the Firemen were mostly Scottish and Welsh, whereas the Carpenters were English.

20140801_Scottish_WWI_Onomastic_Ranks_PieChart_MercantileMarineReserve_v001

Indian names

The first world war started as a European war but populations from Africa, Asia were immediately mobilized by the colonial powers of the time : the British Colonial Empire,  France, … many soldiers came from far away to meet their death in the tranchées of Eastern  France.

The Indian names in CWGC are indicated without any given name, but with the son’s and father’s name, for example:

sonName fatherName place regiment
PURANBAHADUR GHARTI KAMANSING GHARTI NEPAL 9th Gurkha Rifles
PUNE THAPA NAIN SING THAPA GULMI NEPAL 4th Gurkha Rifles
RADHA KISHN GANGA RAM RAJPUTANA Bharatpur Infantry
SITARAM SAWANT NILU SAWANT BOMBAY 117th Mahrattas
NAMDAR KHAN HAYAT KHAN N W F  PROVINCE 21st Punjabis
SHAHAB UDDIN KARAM ILAHI PUNJAB 53rd Sikhs (Frontier Force)
RAM RAKHA CHHOTE PUNJAB Sirmur Imperial Service Sapper Corps
AMAR SINGH GURDITT SINGH PUNJAB 15th Ludhiana Sikhs
LALITBIKRAM THAPA RAMBIKRAM THAPA NEPAL 5th Gurkha Rifles (Frontier Force)
PANCHAM DHUNDA UNITED PROVINCES Army Bearer Corps
CHINNASWAMI DURUGAYA MYSORE 2nd Queen Victoria’s Own Sappers and Miners
LAKKHI JAHANGIR UNITED PROVINCES Indian Royal Artillery
SHIU DAS DUBE RAM SEWAK DUBE UNITED PROVINCES 3rd Brahmans
BATAN SINGH BELA SINGH PUNJAB 57th Wilde’s Rifles (Frontier Force)
KALU GHALE KAMI GHALE NEPAL 8th Gurkha Rifles
ISMAIL HAIDAR MANUBUDDIN SIKDAR BENGAL Indian Railway Department
FATTEH KHAN DIL DOST KHAN PUNJAB 82nd Punjabis
SUJAWAL KHAN BAHADUR KHAN PUNJAB 38th King George’s Own Central India Horse
MAHABIR MURAI LACHHMAN MURRAI UNITED PROVINCES 3rd Sappers and Miners
SURENDRANATH RIAWA CHANDI CHARAN BISWAS BENGAL Indian Labour Corps

 

So we have used a different algorithm to automatically cluster Indian names into onomastic classes. Some onomastic classes might be related to geography, to Indian casts, to social status or religious beliefs …

We can again use an onomastic mille-feuille to visualize the correlation between names and geography, but here a classic geographical map would probably tell a better story.

20140801_Indian_WWI_Onomastic_Millefeuille_v001

Distinctive patterns are recognized in names from Bombay, Madras, Delhi or Pashawar, allowing the software to cluster them into distinct onomastic classes.

And again we can then look at regiments to visualize how ethnically/linguistically diverse they were:

20140801_Indian_Regiments_WWI_Onomastic_Millefeuille_v001

 

Italian names

All regions of Italy have paid a heavy tribute to the Great War:

2014_Italian_WWI_Casualties

 

Italian regional names are particularly well differentiated, as can be seen in the following onomastic millefeuille:

2014_Italian_WWI_Onomastics

We display here some examples of typical names from different regions. Can you see how different they are?

  • IT/Abruzzi e Molise: MEZZACAPPA GIUSEPPE DI ANTONIO, PAOLILLI-TREONZE PASQUALE DI DOMENICO, BONITATIBUS ERMANNO DI ANGELO, FIDELIBUS ANGELANTONIO DI EUGENIO, PAOLILLI-TREONZE DONATO DI GAETANO, VASQUENZ AUGUSTO ANGELO DI ANTONIO, AMMAZZALORSO ANTONIO DI ANGELO.
  • IT/Basilicata: LATERZA GIOVANNI DI GIUSEPPE, SCAMORCIA GIUSEPPE DI GAETANO, ALAGIA NICOLA DI GIUSEPPE, CLAPS VITO CANIO DI GAETANO, CLOROFORMIO VITO DOMENICO DI TADDEO, SCANDIFFIO DOMENICO DI INNOCENZO, CLAPS ANGELO VITO DI VITANTONIO, CASAMASSIMA FRANCESCO PAOLO DI GIOVANNI, PENNIMPEDE GIUSEPPE DI PIETRO.
  • IT/Calabria: PROCOPIO FRANCESCO DI NICOLA, CANDREVA FRANCESCO DI GIUSEPPE, SCICCHITANO FRANCESCO DI GIUSEPPE, SPACCAROTELLA GIOVANNI DI ANGELO, CICCIU CONSOLATO DI ANTONIO, LULJ GIUSEPPE DI VINCENZO, TRUNCELLITO DOMENICO PASQUALE DI GIUSEPPE, DAVOLOS DOMENICO DI PASQUALE, CHIDICHIMO GIOVANNI DI SALVATORE.
  • IT/Campania: ANNUNZIATA GIOVANNI DI ANTONIO, PISCOPO GIOVANNI DI ANTONIO, PISCOPO GIUSEPPE DI ANTONIO, SARRAPOCHIELLO LORENZO DI NICOLA, GENETIEMPRO GIUSEPPE DI MATTEO, VALIANTAE ANIELLO DI CARMINE, DONNIACUO ALFONSO DI GIUSEPPE.
  • IT/Emilia-Romagna: SCHIAVAZAPPA BONFIGLIO DI CRISTOFORO, SAVRIE ADELCHI DI GIUSEPPE, VACONDIO BONFIGLIO DI PIETRO, GUAGLIUMI GEMINIANO DI CESARE, ASTROLOGI GIOVANNI DI FERDINANDO, SAVRIE GIUSEPPE DI PRIMO, GUAGLIUMI GIOVANNI DI LEANDRO, MANSERVIGI GIOVANNI DI SALINGUERRA.
  • IT/Lazio: ASTROLOGO ANGELO DI PACIFICO, FAPERDUE SALVATORE DI VALENTINO, CENTOSCUDI NAZZARENO DI SANTE, CARLODALATRI UMBERTO DI FRANCESCO, CAPPADOCIA GIUSEPPE DI GIOVANNI, SCHIETROMA GIUSEPPE DI PASQUALE, PALAMIDES GIOVANNI DI GIUSEPPE, GIANFERMI GIOVANNI BATTISTA DI DOMENICO, CAPPADOCIA AMEDEO DI GIUSEPPE, PIETROBONO GUGLELMO DI BENIAMINO.
  • IT/Liguria: GAGGERO GIOVANNI BATTISTA DI GIUSEPPE, KONIG GIOVANNI BATTISTA DI GIOVANNI BATTISTA FILIPPO, MONTEGHIRFO GIOVANNI DI LUIGI, MAGIONCALDA GIOVANNI BATTISTA DI GIOVANNI, BACIGALUPO GIOVANNI BATTISTA DI DOMENICO, REDEGOSO GIOVANNI BATTISTA DI BARTOLOMEO, KONIG GUGLIELMO DI PIETRO, ARBOCO GIOVANNI BATTISTA DI EMANUELE VINCENZO.
  • IT/Lombardia: SANTAMBROGIO GIUSEPPE DI FRANCESCO, RUEFF GIOVANNI DI GIOVANNI, RECALCATI GIUSEPPE DI AMBROGIO, TAGLIABUE GIUSEPPE DI ANGELO, RANZENIGO FRANCESCO DI GIOVANNI, PIANTANIDA ANTONIO DI FELICE, SALMOIRAGHI GIUSEPPE DI ATTILIO, CONSONNI GIUSEPPE DI DOMENICO.
  • IT/Marche: CUCCU GIUSEPPE DI FRANCESCO, FIORDOLIVA GIUSEPPE DI PACIFICO, CINGOLANI NAZZARENO DI PIETRO, ANGELOME MARONE DI GIUSEPPE, VOLTATTORNI NAZZARENO DI FRANCESCO, CARSTANJEN GUSTAVO DI PAOLO, MENGHI-CERRA NAZZARENO DI DAVID, VOLTATTORNI CIRIACO DI LUIGI, CARSTANJEN EDOARDO DI PAOLO, BRUZZECHESSE DOMENICO DI FRANCESCO.
  • IT/Piemonte: DESTEFANIS GIOVANNI DI GIUSEPPE, RIVOIRA GIOVANNI DI PIETRO, CUTTICA GIUSEPPE DI CARLO, BELLINO-ROCI GIUSEPPE DI NICOLAO, NEPOTE GIOVANNI DI DOMENICO, AIMAR BARTOLOMEO DI BARTOLOMEO, LANTELME GIORGIO DI FRANCESCO, GUELPA GIOVANNI DI GIOVANNI, VALSANIA GIOVANNI DI ANTONIO, ARNEODO GIUSEPPE DI GIOVANNI.
  • IT/Puglia: SPAGNULO COSIMO DAMIANO DI FRANCESCO, VANTAGGIATO GIUSEPPE DI VINCENZO, SEMERARO GIOVANNI DI GIUSEPPE, EPICOCO DOMENICO DI GIOVANNI, AGHILAR RUGGIERO DI LUIGI, CANNABONA CROCIFISSO DI PASQUALE, BAGLIVO CROCIFISSO DI ORONZO, SPEDICATO CROCEFISSO DI SALVATORE, GIANCANE CROCIFISSO DI RAFFAELE.
  • IT/Sardegna: MARONGIU SALVATORE DI ANTONIO, PORCU GIOVANNI DI FRANCESCO, MARONGIU FRANCESCO DI SALVATORE, PUTZOLU GIOVANNI DI GIUSEPPE, DESOGUS GIOVANNI DI ANTONIO, MURTAS GIOVANNI DI GIUSEPPE, LAMPIS ANTIOCO DI FRANCESCO.
  • IT/Sicilia: RAPISARDA SALVATORE DI GIUSEPPE, GIONFRIDDO PAOLO DI SALVATORE, MACALUSO GIUSEPPE DI GIUSEPPE, SPAMPINATO ANTONINO DI GIUSEPPE, PRIVITERA ANTONINO DI GIUSEPPE, SCACCIANOCE SALVATORE DI ROSARIO, RAPISARDA SALVATORE DI CARMELO, CANGIALOSI ANTONINO DI MICHEL.
  • IT/Toscana: SCHIUMARINI IACOPO DI ANTONIO, DIOLAIUTI FERRUCCIO DI GIULIO, MAZZEI EFREM DI GIUSEPPE, DELL’EUGENIO ANGIOLO DI ANTONIO, DELL’ARINGA GABBRIELLO DI DANIELE, PISTOI ASTAROTTE DI OLIMPIO, BIENTINESI MILZIADE DI GIOVANNI, ANZEMPAMBER FILIPPO DI ADOLFO, BEMPORAD DUILIO DI POLICARPO, DELL’OMODARME RANIERI DI DEMETRIO.
  • IT/Trentino-Alto Adige: DALPIAZ GIUSEPPE, ANDERLE GIOVANNI, DEVIGILI GIUSEPPE, PONTALTI GIUSEPPE, CASAGRANDA GIUSEPPE, FLAIM GIOVANNI, PALLAORO GIUSEPPE, STEDILE GIUSEPPE, DETASSIS GIUSEPPE, DELVAI GIUSEPPE.
  • IT/Umbria: DESANTIS GIUSEPPE DI DOMENICO, MAGARINI-MONTENERO DOMENICO DI BONAVENTURA, QUONDAM GIOVANNI DI NAZZARENO, GAMBELUNGHE SALVATORE DI CESARE, CENTOGAMBE DOMENICO DI FELICE, QUONDAM CASTORINO DI GIUSEPPE, BESTIACCIA GIOVENALE DI GIUSEPPE, BELLACHIOMA ASTORRE DI ALBERTO, SFORNA CRISPOLTO DI NAZZARENO, CENTOGAMBE GIUSEPPE DI PIETRO.
  • IT/Veneto: DELL’OSBEL GIOVANNI DI ANTONIO, MESTRINER GIOVANNI DI GIUSEPPE, RODIGHIERO GIOVANNI DI ANTONIO, BOF GIOVANNI DI LUIGI, DALL’OSTO GIUSEPPE DI PIETRO, SKREZENEK GIUSEPPE DI CARLO, FILOSOFO GIOBATTA DI PAOLO, MENEGUZ GIOBATTA DI ANTONIO, MESCALCHIN GIOBATTA DI ANDREA, CIPOLAT-GOTET GIOVANNI DI GRAZIADIO.

French names

The equivalent of CWGC in France is the Mémoire des Hommes database. We’ve used it to calibrate NamSor recognition of French regional names. After calibration, about 70% of names can be allocated to a particular region and we can produce the following onomastic mille-feuille, sorted according to the relative number of Bretons (people from Brittany):

20140801_France_WWI_Millefeuille_v001

We can also view the total number of casualties, broken down according to the onomastic class. It show the large number of people originally from Brittany who died during WWI, regardless of their birthplace. However, this remains debatable – as ~30% of names could not be specifically allocated to a region of origin (only recognized as French).

20140801_France_WWI_RegionalBreakdown_v001

Baptiste COULMONT, a sociologist, published a very interesting study on given names analysing the results of students at the French Baccalaureate in 2014. We’ve used a similar dataset compare regional names in 1914 and in 2014. Unfortunately, we didn’t have enough time to align the geographic mappings – but the result is visual and self-explanatory. We can see how rural exodus and internal migration have eroded the regional identity in personal names. Still we can see that even in 2014, the correlation between onomastics and geography remains strong – especially in Brittany, in the North of France, in Alsace, in Lorraine, in Loire, in Lyon, in Aquitaine and Corsica.

20140801_France_Millefeuille_1914_2014_v001

What do names tell us about the world in 2014?

A lot! Some say: too much!

Enough to make ICOS2014 a very exciting and current event. We look forward to be in Glasgow on 24th August and meet you there. Long live onomastics.co.uk

Feel free to contact us, mailto:contact@namsor.com

About

NamSor™ Applied Onomastics is a European designer of name recognition software. Our mission is to help make sense of the Big Data and understand international flows of money, ideas and people.
http://namsor.com/

NamSor is committed to promote diversity and equal opportunity and launched GendRE API, a free API to conduct analysis of gender equality using open data.

 

1 Comment

Filed under EthnoViz

Onomastic sampling for migration studies

On Friday morning, I had the opportunity to present our breakthrough data mining technology at Regent’s University Turkish Migration Conference (TMC2014, London).

The supporting presentation can be downloaded here (20140530_TMS2014_Pitch_vFf.pdf) or viewed online here.

20150601_TurkishMigrationStudies

During the following sessions by researchers from various countries (Turkey, US, UK, Germany, Netherland, Sweden, Norway, Belgium …), I learned some of the ‘jargon’ of migration studies and also something about the particular research methodologies applied in that field.

My initial vision was that onomastics (the recognition of personal names) could be applied to discover new migration patterns. It was based on several preliminary meetings with international organizations concerned with migration issues. Census data can take up to three years to process. As states struggle to provide timely and accurate data to international organizations (such as the OECD, IOM, United Nations High Commissioner for Refugees UNHCR, …), these organizations can turn to the Big Data to identify and monitor new trends. There are challenges in identifying relevant data sources to provide valuable information about less digitally connected migrants. Twitter, LinkedIn, Google, Facebook, D&B, Thomson WoS … combined with applied onomastics can tell us a lot about the changing migration patterns of STEM Workers, innovators and entrepreneurs.

STEM Workers: workers in science, technology, engineering, and mathematics; art is occasionally considered as well (STEAM Workers).

With several TMS2014 sessions focused on the question of Turkish identity, or the particular migration and integration patterns of the Turkish, Kurdish, Alevi or Circassian communities, applied onomastics clearly offers an innovative tool to look at data from a different angles (nationality/birth place/ethnicity/gender/…)

However, I found that many research studies are conducted based on an initial theoretical hypothesis. Researchers then apply various qualitative or quantitative methods (occasionally both) to assess the hypothesis. Pure quantitative methods such as ‘data mining’ or ‘graph analysis’ as seen as de-humanizing by researchers (anthropologists, sociologists, historians …), primarily interested in the human story of migration. Most researchers conduct surveys to gather the data for their study : they find people, talk to them, ask questions. How do researchers identify to group of people to be surveyed (the sample)? During the conference, I learned another jargon: network/snowball sampling.

Network/snowball sampling: Snowball sampling is based on the selection of target people in personal networks. In a first step, important people within the target group are identified (initial sample) who themselves identify further people who can be also addressed for the survey (McKenzie & Mistiaen, 2007, p. 2; Salentin, 1999, p. 124).

As often, this new word was the magic keyword to find additional resources and understand how NamSor technology could fit with the current start of migration research methodology:

This document clearly describes the various methodologies to identify the initial population of a study and the various sampling procedures. Onomastic sampling is one of them.

‘In many countries, migrants constitute a substantial part of society. In public opinion research, however, they are often inadequately or not at all considered. This paper gives a systematic overview of the underlying methodological challenges that cause this situation. Those challenges are twofold and concern (1) the definition and distinction of the terms migrant and foreigner to describe the target group and (2) the selection of adequate sampling procedures.’

‘The methodological challenge of selecting adequate sampling procedures

Even after defining the target population, researchers still face difficulties regarding sampling. The problems tackled can be divers, for instance in what way the target population can be contacted (which survey modes are culturally accepted?) and how the individual respondents can be selected (e.g. does last-birthday work?). The paper discusses four central sampling procedures which regularly come up in the literature and which are seemingly appropriate for these kinds of surveys:

1. Sampling procedures on the basis of administrative records,

2. Area sampling, like e.g. random-route-procedures,

3. Network/snowball sampling, and

4. Onomastic sampling procedures based on foreign names from directories.’

How NamSor software can help?

1. Sampling procedures on the basis of administrative records

In this sampling method, the administrative records does not reflect the fine-grain identity of the populations: ‘Turkish nationality’ or ‘Born in Turkey’ encompasses many different populations. Applied onomastics can help refine samples to more targeted populations (Turkish, Alevi, Kurdish, Syrian, …)

2. Area sampling, like e.g. random-route-procedures

In this sampling method, it’s critical to understand the geo-demographics of a territory to know where different migrants populations are concentrated. Applied onomastics can help assess the density of migrant populations at various levels (region/city/district or road) from various public data sources.

3. Network/snowball sampling

In this sampling method, the personal network of the researcher is used an an initial seed to identify further prospects for interviews. Applied onomastics could help analyse personal networks of researchers (from social networks such as Twitter, or academic sources  such as bibliographic databases) to identify larger seed networks and generate better sampling. That could help reduce the risk of biases induced by the researcher’s network (reinforcing its own personal or cultural biases).

4. Onomastic sampling procedures based on foreign names

Dictionaries of given names and family names associated with a particular culture have been used for sampling.

NamSor software goes beyond this technique to use sociolinguistics and recognize in a (fistName, lastName) pair the likely origin of a person, with high accuracy. NamSor software can help researchers conduct onomastic sampling, not just from telephone directories but also from a wide range of modern data sources : social networks, opt-in commercial databases, … with high precision and fine-grain targeting.

Conclusion

NamSor powerful technology raises many data privacy and ethical questions, but we’re glad to say that if science and migration studies can be good for society, NamSor can be too.

About NamSor:
NamSor mission is to help understand international flows of money, ideas and people. NamSor launched GendRE API, a free API to conduct analysis of gender equality using opendata. http://namesorts.com/api/

Leave a comment

Filed under General

Onomastics API for Gender Studies

[AGENDA] Meet us on 29 April 2014 at DataTuesday Paris with Girls in Tech Paris, on the topic ‘Women & Data’.

Gender Equality in French Politics

Women fill 26.17% of the seats at the French National Assembly (‘L’Assemblée Nationale’), according to the count of ‘M.’ and ‘Mme’ at
http://www.assemblee-nationale.fr/qui/xml/liste_alpha.asp?legislature=14
That’s double the figure of ten years ago (2002: 10.9%), good job ladies!

If that list did not indicate M. and Mme, could we still recognize the gender from the politician name? NamSor has published a simple API for Gender Studies which would give the following result: 26.31% (more that 99% accurate compared to the actual figure).

What about the Corporate World?

Playing with old data from a previous life in the corporate world (which cannot be disclosed), applied onomastics tell us that among ~4000 top company executives with a median base salary of 230,000 $ (USD), men landed a neat 890 million $ while women got 143 million $ in total. This huge gap is the result of less women having a top job and men earning ~20% more on average for the same job.

20140314_GenderEquality_Teaser_v001

Currently, the  Gendre API is in Beta Version and free to use.

Read also:

GenderEquality.java

You can download the sample program GenderEquality.java.zip

Detailed input/output

https://namsor-gendre.p.mashape.com/gendre/Damien/Abad/fr returned -0.9979281991518565
https://namsor-gendre.p.mashape.com/gendre/Laurence/Abeille/fr returned 0.9984725610426144
https://namsor-gendre.p.mashape.com/gendre/Ibrahim/Aboubacar/fr returned -1.0
https://namsor-gendre.p.mashape.com/gendre/Élie/Aboud/fr returned -0.9749559773038545
https://namsor-gendre.p.mashape.com/gendre/Bernard/Accoyer/fr returned -0.9996548690100067
https://namsor-gendre.p.mashape.com/gendre/Patricia/Adam/fr returned 0.9997681752981121
[…] namsor_api_calls.zip

Leave a comment

Filed under General

Hispanic, French, German names in the United-States

NamSor has mapped Hispanic Twitter accounts around the world. Not just Hispanic: French and German as well.

This interactive world map of the Hispanic, French and German e-Diasporas was produced using Twitter account data.

To access the interactive map, click here: http://cdb.io/1dqVd2n

20140503_US_Twitter_GEOnomastics_vF

Twitter is an interesting source because about 3 per cent of Twitter accounts opt-in to show their Tweet location (using GPS from a smartphone) and can be visualised on a map.

Our method of anthroponomical classification can be summarized as follow: judging from the Twitter name only and the publicly available list of all ~150k Olympic athletes since 1896, for which team would the person most likely run (of France, Spain, Germany)?

NamSor Applied Onomastics is a European vendor of name recognition software (NamSor sorts names), which aims to help understand international flows of money, ideas and people. namsor.com

Further reading :

Leave a comment

Filed under EthnoViz

Revealing the Irish, French, Indonesian digital diasporas

An Irish Times @GenerationEmigration reblog

NamSor technology has mapped the location of Irish-owned Twitter accounts around the world.
To access the interactive map, click here: http://cdb.io/1h8kTDG

20140101_TwitterGEOnomastics_IrishTimes_1

Elian Carsenat and Michel Fortin

Before Christmas, we came to Ireland to present NamSor, a piece of name recognition software which uncovered the Irish ‘digital diaspora’ for the first time. This interactive world map of the Irish, French and Indonesian e-Diasporas was produced using Twitter account data.

Twitter is an interesting data source because about 3 per cent of Twitter accounts opt-in to show their Tweet location (using GPS from a smartphone) and can be visualised on a map. We were interested to visualise the Irish digital diaspora, not just in the US and the UK, but globally. Our assumption was that the Irish themselves are familiar with the history and sociology of the Irish diaspora in the US and the UK (and such organisations like IDA Ireland and Tourism Ireland have been successful in leveraging those), but what about Latin America, Eastern Europe, the Middle East and Asia? It is interesting to see how large and dispersed the Irish diaspora is in the US, and how small and concentrated it is in populous Indonesia.

20140101_TwitterGEOnomastics_IrishTimes_2

The scientific jargon for this special data mining is applied onomastics. We’ve worked with many different databases before, using onomastics for a specific purpose. For example, to help the Lithuanian Investment Promotion Agency understand the sociology of its diaspora and attract foreign direct investments (FDI), we’ve data mined Factiva C&E, a large database of company directors worldwide. We’ve also analysed PubMed, a scientific database used by doctors and biotechnology researchers, to recognise where international talent flows in that competitive field.

We spent a lot of time in Dublin with Kingsley Aikins, chief executive of Diaspora Matters, who is well known internationally in the diaspora field and has worked with many other countries as well as Ireland.

He believes the product could be a real game changer in the diaspora field and could help answer the perennial question all countries ask about their diasporas – who are they, where are they and what are they doing. He believes that we now live in a networked age and the key to success of diaspora engagement is in building global networks. Namsor will help find these people and enable new diaspora networks to be developed.

He also referred to the emerging global war for talent and how diasporas are going to be critically important sources of talent. Countries who know and keep in touch with their diasporas will have a competitive advantage. This will apply not only to those wishing to return to their home country but also to those wishing to be involved and help with DDI (Diaspora Direct Investment). Malaysia, Vietnam and Indonesia have already introduced initiatives in these areas.

We were impressed with the success of the Gathering, bringing several hundred thousand people to Ireland. This is an innovative initiative and must have strengthened the bonds between the Irish diaspora and Ireland.

There may not be such thing as a ‘French diaspora’, but we see more and more French people going abroad, especially the young and talented seeking an international experience. We’ve seen a lot of them in Dublin! Our impression is that the French abroad don’t really know or help each other as effectively as in other cultures, such as the Irish. French diplomats, large companies, entrepreneurs established abroad, exporting SMEs, professors and students all seem to live in separate worlds. France could learn a lot from what Ireland is doing.

NamSor Applied Onomastics is a European vendor of name recognition software (NamSor sorts names), which aims to help understand international flows of money, ideas and people. namsor.com

Diaspora Matters is a consultancy company based in Dublin advising governments, companies, organisations and individuals on how to develop strategies and programmes to connect with their Diasporas. diasporamatters.com

This article was inspired by and original article published in onomastics.co.uk

Leave a comment

Filed under EthnoViz, General

Making sense of Big Data : mining Twitter names

Millions of geo tweets in various languages, discussing anything from ‘hey, I’m here‘ to finance, geopolitics or marketing. How do you make sense of them?

We’ve used name recognition (applied onomastics) to filter information and produce unique maps of the e-Diasporas. Where are the digitally connected Italian, Turkish and Russian today? They may be migrants, tourists, business travellers, student, visiting scientists…

To jump directly to the interactive map, click here : http://cdb.io/1iSeWw2 or read more about our methodology.

Italian, Russina, Turkish Twitter

Italian, Russina, Turkish Twitter

TIP : Filter out layers and zoom in/out.
Below we filtered out the Turkish Twitter layer to visualize where the Russian & Italian tourists go to holiday in Turkey

Russian, Italians in Turkey

Russian, Italians in Turkey

The Italian America :

Italian America

Italian America

Further reading :

Leave a comment

Filed under EthnoViz

Cap Digital Meetup : What Tourism in 5 years?

On October 23, the French IT Cluster CapDigital organized the first Digital Tourism Meetup and on that same day Paris City Council launched the Welcome City Lab, an initiative to incubate tourism startups.

We presented a new offering for Tourism Promotion Agencies, to analyse tourism flows using name recognition and the ‘big data’. Read our detailed example on recognizing Irish, Russian and Swedish digital Diasporas on Twitter.

Irish Twitter GEOnomastics

Irish Twitter GEOnomastics

France is among the top destinations for Tourism, yet challenged by other destinations. Can innovation reverse the current trend?

AtoutFrance would know. Unfortunately, nobody from AtoutFrance -the French Tourism Promotion Agency- nor from the CNT -the governmental advisory body for Tourism- attended this event.

You will find our presentation here : 20131023_NamSor_eTourism_vFf.pdf

France Dubai Tourism

France Dubai Tourism 1/2

France Dubai Tourism

France Dubai Tourism 2/2

Leave a comment

Filed under General

Onomastics and international event marketing : post hoc analysis of OECD Forum 2013

Choosing the right location to organize international exhibitions and events is critical to reach your targeted audience. For open events choosing a particular location will generally induce an undesired local bias, which you can compensate by proactively trying to attract the right mix of international participants.

Best practices to attract international visitors

To do that, onomastics (name recognition) can be a useful communication tool to leverage the right channels. It is also a great data mining tool to analyse, after the event, the international mix of participants. Company nationality, nationality or residence of a participant, are often misleading information. For example, let us look at the participants list of the OECD Forum 2013, which took place on 28-29 May at the OECD headquarters in Paris and attracted about 1500 visitors.

Firstly, attributing a nationality to companies is becoming increasingly difficult. ‘The Parliamentary Network on the World Bank & IMF‘ is clearly international. ‘NetworkIrlande‘ is based in Paris but with a clear Irish touch. What about ‘Ernst & Young‘? It could be classified as international, if you attracted a partner in charge of a global business unit. But it could also be classified as local if you attracted a local consultant hoping to network with his clients in Paris.

Secondly, considering only the Country of Residence leads to overestimating the local bias. Give the chart below a casual glance and you will get the impression that the OECD, being in Paris, struggles to attract visitors from outside of France.

20130621 OECDForum2013 by Country Of Residence

That’s not a fair view of reality. For example, the French residents include a large number of international diplomats based in France as well as international researchers at the OECD itself.

Thirdly, looking at people’s names provides additional information though it remains imperfect. It partially corrects the local bias. In the chart below, the OECD Forum looks indeed more international.

20130621 OECDForum2013 by Onomastic Class

Our method of anthroponomical classification can be summarized as follow: if a OECD participant were to become an Olympic athlete (after a bit of training of course), judging from his name only and the publicly available list of all ~150k Olympic athletes since 1896, for which team would he most likely run?

This information is also imperfect. In this chart ‘France‘ means French, as in French name. Many names from, say, Québec, Luxembourg, Belgium, etc. will be misclassified. The United-States do not appear in the chart but ‘Great Britain‘ and ‘Ireland‘ will include names from North-America.

Combining both information (residency and onomastics) we can drill-down into the demographics of a French residents and further reduce our false impression of a disproportionate local bias.

20130621 OECDForum2013 French Residends by Onomastic Class

We finally get a more realistic picture of the proportion of local visitors (though it may still include a few people from Belgium, Luxembourg, Québec, etc.). Now, how can we measure the local bias? The question is : how many Frenchmen would normally be present at the OECD Forum, should it take place in a neighbouring country – and to do that we could use other international events as a benchmark. Our conclusion : there is a bias, but it is reasonably small. 

20130621 OECDForum2013 localBias

Good news, the OECD Forum is a truly international event: no need to move it to London, Geneva or Berlin !

Get the dataset here OECD_FORUM_2013_PARTICIPANTS.zip

Leave a comment

Filed under EthnoViz

IPA innovates to originate FDI deals using onomastics

NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.

NamSor™ name recognition software filters data from millions of meaningless elements to a few dozen actionable names. Domas Girtavicius, a Senior consultant at Invest Lithuania, said “we were impressed by the accuracy of the name recognition software: it reliably predicts the country of origin and the number of false positives is fully manageable”. Elian Carsenat, the founder of NamSor™, said “searching for names in the Big Data is like seeking a gold needle in a haystack: doable once the right tool exists”.

What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have attracted huge amounts of money from America – due largely to a century of personal and familial ties – and they have used this money to build factories[i] ”.

A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian origin living abroad, there is a good many personal and familial ties to be leveraged to attract new investment projects to the country. NamSor name recognition software helped discover those ties.

Recognizing names and their origin in global professional databases allows Investment Promotion Agencies to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out to them. Another method to accelerate the origination of new leads is to better understand and leverage the existing network of foreign businessmen in the country itself.

May 2013

About NamSor
NamSor™ Applied Onomastics[ii] is a European vendor of Name Recognition Software. It also offers consulting services to help Investment Promotion Agencies and countries reconnect with their business communities abroad.

About FDIMagnet,

FDIMagnet is NamSor™ offering for Investment Promotion. We use our unique data mining software to offer differentiated Foreign Direct Investment (FDI) services:

–   Diaspora Direct Investments (DDI)

–   Smart Investors Targeting & CRM

–   FDI Targeted Communication


[i] U.S. Foreign Direct Investment in Ireland: Making the Most of Other People’s Money, Rebekah Berry (2002)

[ii] Onomastics (or onomatology) is the science of proper names. NamSor and NomTri are registered trademarks.

PDF version : 201305 InvestLithuania.pdf

1 Comment

Filed under FDI Magnet, General