Tag Archives: Chinese names

What’s in a scientist name? Applying onomastics in scientometrics: the case of Cancer Research

The IREG Observatory on Academic Ranking and Excellence is an international institutional non-profit association of ranking organizations, universities and other bodies interested in university rankings and academic excellence.

Our friend Tania Vichnevskaia of the French National Institute for Health (INSERM) presented the following paper ‘Applying onomastics to scientometrics’ on Monday at IREG International symposium organised by University of Maribor and Shanghai Jiao Tong University.

Download PDF 20150119_IREG2015_INSERM_NamSor_vF.pdf

On this same topic:

About NamSor

NamSor™ Applied Onomastics is a European vendor of Name Recognition Software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.

NamSor launched FDI Magnet,  a consulting offering to help Investment Promotion Agencies and High-Tech Clusters leverage a Diaspora to connect with business and scientific communities abroad.

Leave a comment

Filed under FDI Magnet, General

Onomastics to measure cultural bias in medical research

Elian CARSENAT, NamSor Applied Onomastics

Dr. Evgeny Shokhenmayer, e-onomastics


This project involves the analysis of about over ten million medical research articles from PubMed, over three million names of scientists, authors or mentioned in citations. We propose to evaluate the correlation between the onomastic class of the article authors and that of the citation authors. We will demonstrate that the cultural bias exists and also that it evolves in time. Between 2007 and 2008, the ratio of articles authored by Chinese scientists (or scientists with Chinese names) nearly tripled. We will evaluate how fast this surge in Chinese research material (or research material produced by scientists of Chinese origin) became cross-referenced by other authors with Chinese or non-Chinese names. We hope to find that the onomastics provide a good enough estimation of the cultural bias of a research community. The findings can improve the efficiency of a particular research community, for the benefit of Science and the whole humanity.

This paper was prepared for ICOS2014, the 25th International Congress of Onomastic Sciences, the premier conference in the field of name studies


PubMed/PMC is a large collection of scientific publication in LifeSciences. We used the 2013 data dump for data mining, with 14 million articles and 3.3 million author names. Some of the names are duplicates due to different orthographies, inconsistent use of initials and other data quality issues.

We used NamSor software to allocate an onomastic class to each author name. NamSor software with initially designed to analyse the big data in the field of economic development[1], business and marketing. The method for anthroponomical classification can be summarized as follow: judging from the name only and the publicly available list of all ~150k Olympic athletes since 1896 (and other similar lists of names), for which national team would the person most likely run? Here, the United-States are typically considered as a melting pot of other ‘cultural origins’: Ireland, Germany, etc. and not as a onomastic class on its own.

The breakdown of author names by onomastic classes is represented below :


The largest groups of unique names in PubMed are British, French, German, Italian, Indian, Spanish, Dutch, etc.

An author with a French name might have a name from Brittany, Corsica or Limousin … or he might have a Canadian French name, or a Belgium French name. Or he might be an American professor with a French ancestry.

Scientists performance is often measured according to the number of publications, and the number of times a publication is cited by other publications (bibliometric rankings).

The table below shows the number of publications and the number of citations, by onomastic classes (top 20), as well as the ratio between the two metrics:

Onoma A C Ratio (C/A)
(GB,LATIN) 557,177 1,664,415 3.0
(FR,LATIN) 272,150 743,471 2.7
(DE,LATIN) 192,778 448,103 2.3
(JP,LATIN) 172,866 361,682 2.1
(IT,LATIN) 187,564 323,771 1.7
(IE,LATIN)   86,161 422,103 4.9
(NL,LATIN) 102,982 321,787 3.1
(AT,LATIN)   78,199 339,819 4.3
(CN,LATIN)* 219,040 186,464 0.9
(IN,LATIN) 153,555 221,332 1.4
(ES,LATIN) 113,407 228,650 2.0
(PL,LATIN)   47,961 268,115 5.6
(SE,LATIN)   65,717 237,017 3.6
(FI,LATIN)   35,533 247,231 7.0
(KR,LATIN) 146,444 105,605 0.7
(TW,LATIN)*   88,822 162,132 1.8
(GR,LATIN)   51,564 196,056 3.8
(DK,LATIN)   42,403 181,199 4.3
(BE,LATIN)   44,647 162,146 3.6
(CH,LATIN)   32,295 162,495 5.0
*CN+TW    307,862       348,596 1.1

This table tell us that scientists with British names have published 557 thousand articles in PubMed and have been cited 1.6 million times in other PubMed articles: the ratio is 3.

Articles written by authors with Italian names have been relatively less cited (with a ratio of 1.7) while the articles written by authors with Irish names or Finnish names have been more cited (with ratios respectively 4.9 and 7).

We cannot conclude on the overall performance of British, Italian or Finish scientists (many of them might be American scientists), but already we can observe interesting cultural biases emerging that cannot be explained by the imprecision of onomastic classification only. They raise interesting questions:

– can linguistic mastery of the English language explain why authors with British or Irish names have more citations?

– can features of a particular culture (ex. the Irish are excellent networkers and have great pubs) explain why scientific articles are more cited?

– do scientists with Italian names tend to cite more scientists with Foreign sounding names (English, Irish, etc.)?

– do scientists with Finish names tend to cite more scientists with Finish names?

– are there additional cultural biases in the publication process itself (selection, curation, promotion of scientific publications)?

– is there a gender bias worth noting (ex. male scientists are more cited; a culture with less female scientists would get a higher ratio) ?

Altogether, scientists with Chinese names -with names from mainland China or Taiwan- have altogether produced 307 thousand articles and been cited 348 thousand times: a ratio of 1.1, in the low range. We will now focus the rest of this paper on Chinese names: publications authored by a scientist with a Chinese name, or citations of scientists with Chinese names.

Scientists with Chinese names in PubMed

Globally, the number of publications in life sciences has been growing exponentially. Many countries and institutions encourage scientists to publish and link performance to bibliometric rankings (ie. publications in reputable journals, number of citations, etc.)


From this chart, we can observe,

– that the absolute number of publications authored by scientists with a Chinese name has nearly tripled between 2007 and 2008 (x2.5, from 7k to 17k);

– that the relative share of publications authored by scientists with a Chinese name (compared to other onomastic classes) is also growing steadily.

This growth in the number of publications by authors with Chinese names, in absolute and relative terms, is matched by a drop in the ratio of citation/authorship :


Year A C Ratio (C/A)
2012 81326 68038  0.8
2011 52396 42371  0.8
2010 33821 49260  1.5
2009 24726 35715  1.4
2008 17258 26321  1.5
2007 6944 17234  2.5
2006 4770 11299  2.4
2005 3260 6910  2.1
2004 1830 3782  2.1
2003 1195 2211  1.9
2002 849 1436  1.7
Before 3477 3823  1.1

Next, we will look at co-authorships. We do expect co-authorships to be more frequent within a same onomastic class, because of the correlation with geography : scientists with an Italian name might live in Italy, work in the same University on a research project, publish together the result of their research. We also expect to find diversity: many publications are the result of an international cooperation ; scientists are internationally mobile; last but not least countries like the US, Switzerland attract talents from everywhere and as a result of this global ‘brain drain’ produce very international research teams.

Both aspects, affinity and diversity, are reflected in the following matrix – displaying the number of co-authorships between onomastic classes:


For example, the first column of the matrix (reflected in the pie chart below) shows that scientists with British names have a strong affinity to be co-author with scientists with British names, but also that they are likely to publish (in order) with scientists with French names, German names, Irish names, Italian names etc.


Scientists with Chinese names have an even stronger affinity to be co-authors with scientists with Chinese names; they are likely to publish (in order) with scientists with British names, French names, German names, Italian names, Irish names, Korean names etc.


Next, we will look at citations. In a perfect world, we expect citations to made based on the merits of scientific research only. We assume some ‘invisible hand’ will self-regulate the visibility of publications among research communities -so all relevant research is known by the experts of the field. If scientific excellence is equally distributed, we expect the number of publications citing authors of a particular onomastic class to be proportional to the number of authors of that particular onomastic class.  However, the following table tells a different story.

Onomastic Class Onoma Authored % Onoma
Self Citations %
Bias Factor
(GB,LATIN) 16.6% 17.0% 1.02
(FR,LATIN) 8.1% 7.6% 0.94
(IT,LATIN) 5.6% 3.8% 0.68
(DE,LATIN) 5.8% 6.1% 1.05
(CN+TW,LATIN) 9.2% 12.1% 1.32
(ES,LATIN) 3.4% 3.8% 1.13
(JP,LATIN) 5.2% 19.3% 3.73
(IE,LATIN) 2.6% 4.4% 1.73
(NL,LATIN) 3.1% 5.6% 1.83
(AT,LATIN) 2.3% 4.2% 1.79
(SE,LATIN) 2.0% 3.5% 1.76
(IN,LATIN) 4.6% 4.1% 0.89
(PT,LATIN) 1.9% 2.3% 1.17
(GR,LATIN) 1.5% 2.8% 1.82
(KR,LATIN) 4.4% 3.0% 0.68
(BE,LATIN) 1.3% 2.6% 1.98
(DK,LATIN) 1.3% 3.4% 2.65

In this table, we observe that authors with British names represent 16.6% of publications, but 17% of their citations : a bias factor of 1.02 (almost no bias). Conversely, we observe that authors with French names represent 8.1% of publications, but only 7.6% of their citations : a bias factor of 0.94 indicating that authors with French names tend to cite authors with foreign names more.

As for authors with Chinese names, they represent 9.2% of the publications, but 12.1% of their citations : a bias factor of 1.32 indicating that they tend to cite authors with Chinese names more.

Authors with Chinese names have a positive bias in citing authors with Chinese names, however we can see other cases where the bias is even stronger: authors with Japanese names citing authors with Japanese names, authors with Danish names…

More interesting, the following table shows that -apart from authors with a Chinese name- every other onomastic class (British, French, Italian, German etc.) have a negative bias towards citing authors with a Chinese name.

Onomastic class Chinese Onoma Citation Pct% Bias Factor
(GB,LATIN) 3.9% 0.43
(FR,LATIN) 3.9% 0.42
(IT,LATIN) 3.9% 0.43
(DE,LATIN) 4.1% 0.44
(CN+TW,LATIN) 12.1% 1.32
(ES,LATIN) 4.0% 0.43
(JP,LATIN) 5.2% 0.56
(IE,LATIN) 4.0% 0.44
(NL,LATIN) 3.5% 0.38
(AT,LATIN) 4.1% 0.44
(SE,LATIN) 3.6% 0.40
(IN,LATIN) 5.9% 0.65
(PT,LATIN) 4.0% 0.43
(GR,LATIN) 3.9% 0.42
(KR,LATIN) 6.8% 0.74
(BE,LATIN) 3.8% 0.42
(DK,LATIN) 3.9% 0.42

Authors with a Chinese name tend to cite authors with a Chinese name more. Comparatively, scientists with non Chinese names (British, French, Italian, German etc.) have a bias factor of 0.46 and are 3 times less likely to cite publications authored by a scientist with a Chinese name.

We will now see of the biases factors evolve between 2002 and 2012:


According to this table, the positive bias factor of authors with Chinese names in citing other authors with Chinese names remains roughly stable. On the other hand, the negative bias factor of scientists with non-Chinese names in citing authors with Chinese names is generally increasing.

Manual controls

Given the large number of names automatically classified in a taxonomy based on geographic origin (China, etc.) we could not verify manually the entire database. We verified manually two randomly selected subsets:

– firstly, a list of 1280 names recognized by the software as Chinese names;

– secondly, a list of ~10000 names classified by the software into the full taxonomy (over 100 onomastic classes, corresponding to different countries of origin)

According to the first validation method, 83% of names the software recognized as Chinese were manually verified as Chinese; 2% unknown; 15% as non-Chinese (ie. mis-classifications).

The software outputs a confidence level. 76% of the names were classified with positive confidence. For the names recognized as Chinese with a positive confidence, 94% were manually verified as Chinese; 1% unknown; 4% as non-Chinese (ie. mis-classification).


In PubMed, many names do not have a full first name, only initials.

For names classified with positive confidence, we found that first names of just one or two character (ex. J or JH) accounted for 90% of mis-classifications. When the input includes a full name (as would generally be the case with other bibliometric sources such as Thomson WoS, Scopus or ORCID) the accuracy is 99%.


According to the second validation method, we can calculate the usual metrics used in classification : precision and recall.

10172 names were manually classified by a manual operator independently. In this method, errors could be made by the computer and also by the manual operator.

For the calculations below, we assume the assume the manual operator made no mistakes (this is not the case, error is human). The manual operator could classify 50% of names, left the rest as ‘Not Sure’.

For Chinese, non Chinese names, the software precision was respectively 81% and 97% and the recall was 59% and 99%. For names classified by the software with positive confidence (52% of all names), the precision was 93% and the recall was 69%. Excluding the names with first name length < 2 (initials, such as J or JH) the precision was 97% and the recall was 72%.

If conversely, we assume that the computer made no mistakes, then we can compare the precision and recall of the operator with that of the computer:

Chinese Names Non Chinese Names
All Names Computer Human Computer Human
Precision 81% 59% 97% 99%
Recall 59% 42% 99% 48%
Chinese Names Non Chinese Names
Confidence>0 Computer Human Computer Human
Precision 93% 69% 96% 99%
Recall 69% 49% 99% 48%
Chinese Names Non Chinese Names
Confidence>0 && Len(firstName)>2 Computer Human Computer Human
Precision 97% 72% 96% 100%
Recall 72% 51% 100% 48%

This method of cross validation between computer and human could be improved by having several manual checks by different operators to obtain a good validation sample.

Future work

For future work, we would data mine the large commercial bibliographic databases (Thomson WoS, Scopus and possibly ORCID) because they offer better data quality and useful additional information:

– firstly, they have the full name in addition to the short name cited with just initials; this significantly reduces the error rate of onomastic classification

– secondly, they link scientists to research institutions (affiliations) and geographies (country of affiliation) ; this allows additional analysis on the topic of Diasporas and brain drain, comparing -for example- the research output of Chinese / Chinese American scientists in the US with that of scientists of Mainland China;

– thirdly, those databases have a larger coverage in terms of scientific disciplines, allowing comparison between different fields of research.


Significant cultural biases exist, not only in the way scientists co-author publications together, but also in the way they make citations. Scientific publications authored by scientists with Chinese names are three times less cited by the international research community that they are cited by other scientists with Chinese names. We cannot conclude on the quality of Chinese research but we can challenge the commonly accepted idea that the volume of publications and citations alone indicate that China is becoming a superpower in Science and Technology.

Given the importance of bibliometric rankings in the way countries build and monitor public policies on Science and Education or international cooperation; in the way research institutions measure and reward scientific excellence of researchers and teams,  those biases should be accounted for. Otherwise, international comparisons are not ‘scientific’, not fair and can lead to wrong decisions.

[PDF 2014_ICOS_NamSor_paper_vF.pdf] [Pitch 20140828_ICOS2014_Pitch_vF.pdf]

[1] Onomastics and Big Data Mining, ParisTech Review 2013, arXiv:1310.6311 [cs.CY]

Source Data


Filed under FDI Magnet, General

Is China really becoming a science and technology superpower?

[read the FULL PAPER : ‘Measuring cultural biases in medical research‘] [in French]

Every now and then we hear stories about the making of China as a scientific superpower. How it overtook France in the global University Rankings:

‘In the 2014 edition of the world 500 top research universities, China keeps progressing. China has 9 universities among the top 200 (77 in the USA, 20 in Great-Britain, 14 in Germany, 8 in France). The trend is impressive: in 2004, China had only one University in the to 200.’ Source: LesEchos.fr

How it keeps increasing its volume of scientific publications produced:

‘The increased share of scientific publications produced by China (+231% between 2002 and 2012) is another indicator of the Chinese scientific growth. According to Ghislaine Filliatreau [of the French OST], it’s not only just an augmentation in volume but also in quality.’ LesEchos.fr

In China really becoming a scientific superpower? It may be so. We should however be careful how we interpret bibliometric information (volume and quality of scientific publications). There could be huge cultural biases currently unaccounted for, impacting international bibliometric rankings.

For example, let’s look at Scimago Journal and Country Ranking, an index based on the Scopus® database (Elsevier B.V.). China is already the second country in the world according to the number of scientific publications produced between 1996 and 2003. But if we consider the number of citations excluding self-citations, then China comes after Spain. The reason is that the ratio of citations per citable document (excluding self-citations) is lower than average.

Scimago Country Rankings

Next week, at ICOS2014 (the 25th International Congress of Onomastic Sciences and premier conference in the field of name studies), we will explore some of the cultural biases at play in LifeSciences. A presentation of PubMed (MedLine/PMC) data mining using NamSor software, conducted with onomast Eugène Schochenmayer, will take place at Glasgow University on the 28th of August.

Onomastics to measure cultural bias in medical research (ABSTRACT)

This project involves the analysis of about one million medical research articles from PubMed. We propose to evaluate the correlation between the onomastic class of the article authors and that of the citation authors. We will demonstrate that the cultural bias exists and also that it evolves in time. Between 2007 and 2008, the ratio of articles authored by Chinese scientists (or scientists with Chinese names) nearly tripled. We will evaluate how fast this surge in Chinese research material (or research material produced by scientists of Chinese origin) became cross-referenced by other authors with Chinese or non-Chinese names. We hope to find that the onomastics provide a good enough estimation of the cultural bias of a research community. The findings can improve the efficiency of a particular research community, for the benefit of Science and the whole humanity.

Some of the tools we’ve used to produce this research:

  • MonetDB, the open-source column-store pioneer; due to the multiplicative aspect of some queries (ex. counting articles authored by a scientist with a Chinese name, cited by a scientist with -say- an Italian name) the volume was huge and we couldn’t do with a classic database
  • RapidMiner, a leading open-source data mining and predictive analytics software
  • Our own RapidMiner Onomastics Extension, to predict the gender and likely origin of personal names

About Evgeny Shokhenmayer

Dr. Evgeny Shokhenmayer (MoDyCo, Paris 10), editor of e-Onomastics, online blog focused on onomastics research and publications.

About NamSor

NamSor™ Applied Onomastics is a European vendor of Name Recognition Software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people. NamSor launched @FDIMagnet,  a consulting offering to help Investment Promotion Agencies and High-Tech Clusters leverage a Diaspora to connect with business and scientific communities abroad.

1 Comment

Filed under FDI Magnet, General

Chinese Name Gender Guesser API

[Read also : Is China really becoming a science and technology superpower?] [What’s in a Scientist Name?]

NamSor Gender API, the online free service to predict the gender of a personal name, now supports Chinese names.20140327_GenderAPI_Chinese_Teaser_vF

The need for such a service is not immediately obvious, since most Internet companies in China have access to PRC’s National identification number database, with an odd/even number indicating respectively Male/Female with 100% accuracy.

The API is in the form api/gendre/firstName/lastName/countryIso2 and returns a value in range -1(Male)..+1 (Female). Let’s try with 周声涛and 张淑珍:

The API also works with traditional characters used in Taiwan, for example 張淑珍

To predict the likely gender of a Chinese name, the API was originally using Wudi’s Gender Guesser data file with rewritten code. It has now been refactored to benefit from NamSor socio-linguistics algorithm.

Other resources :

About NamSor

NamSor™ Applied Onomastics is a European vendor of Name Recognition Software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.

Reach us at: contact@namsor.com

1 Comment

Filed under General

Onomastics for Business Data Mining

This is a reblog of ParisTech Review original article.

Can name data mining help economic development?

As of today, the main business application of onomastics is naming, or branding: finding the proper name for your company or your product to stand out in the world. Meaningfully, Onoma – the Greek root for name – is also a registered trademark of Nomen, the naming agency founded by Marcel Botton in 1981. Nomen initially licensed one of Roland Moreno’s inventions, the Radoteur name generator, and created many distinctive and global brand names such as: Vinci, Clio or Amundi. But once your business has a name, should you forget about onomastics? Not anymore. Globalization, digitalization and the Big Data open new fields to experiment disruptive applications in Sales & Marketing, Communication, HR and Risk Management. Though discriminating names carries a high risk of abuse, it can also drive new, unexpected ways for developing poor areas.

Our human brain interprets names every day, as we understand a language, as we know a particular culture or region of the world: the likely menu of a restaurant, the industrial sector of a company… even a dog’s name might tell you something about its owner. Personal names (first name, last name, a Twitter handle) carry meanings which vary according to one’s language and culture, but often form an essential part of one’s identity.

Extracting semantics from names

How exactly my brain works is not clear even to myself, but what if I could program a computer to extract semantics from names: would it provide valuable business intelligence? Some people in the US think so. The Central Intelligence Agency (CIA) has a long standing experience in extracting intelligence from personal names: back in the 80s they used LAS name recognition software to help identify Russian spies, recognize false identities, track soviet influence. LAS could rely on the CIA to help collect a database with one billion names to calibrate the software. That’s about the total world developed population at the time.

After thriving on the surge in US security and foreign intelligence budgets post-9/11, LAS considered diversification and started to address other markets: Marketing, Financial Services Compliance (notably KYC, ie. Know Your Customer). LAS was acquired by IBM in 2006. But to further increase their leadership, in 2011 the US security agencies used the MITRE Corporation to help foster further “innovation in technologies of interest to the federal government. Challenge #1 entailed multicultural name matching—a technology that is a key component of identity matching, which involves measuring the similarity of database records referring to people. Uses include verifying eligibility for Social Security or medical benefits, identifying and reunifying families in disaster relief operations, vetting persons against a travel watch list, and merging or eliminating duplicate records in databases. Person name matching can also be used to improve the accuracy and speed of document searches, social network analysis, and other tasks in which the same person might be referred to by multiple versions or spellings of a name”. A name tells more – or something different – than just a nationality of origin. For example, Boston terrorists Tamerlan and Dzhokhar Tsarnaev have names with a -v termination typical of Slavic names (as found in Russia or in Bulgaria) but can be recognized as originally from Caucasus. There was some media report in the aftermath of the bombing that the FBI didn’t know Boston bomber travelled to volatile Dagestan region in Russia in 2012 because “his name was misspelled on travel documents”. However this information remained unconfirmed and is probably not accurate given the massive US investment in name-matching technology.

In Europe, the legal framework to leverage such tools varies from country to country, but is generally very strict. The directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, article 8, states that “Member States shall prohibit the processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, […]” . In principle, this directive applies to Security Agencies as well, however there are exemptions which member states can interpret differently.

By making the distinction between the language ‘discriminatory ethnic profiling’ rather than the more common ‘ethnic profiling’ to describe the practice of basing law enforcement decisions solely or mainly on an individual’s race, ethnicity or religion the European Union recognizes the need of security forces to understand the complex relationships that exist between nationality, geography, and more subjective concepts such as: ethnic origins, cultural backgrounds, civilisations, religions. How the knowledge might be applied, how the data might be collected remains a matter of national security. The UK and France, for example, are known to have different views on this topic. In any case, what is done in practice by anti-terrorism agencies is not public information.

Security, border control, etc. is a business in its own right. What about other sectors?

Customer intelligence: business potential and ethical issues

In Sales & Marketing, onomastics can be used to enrich a customer database with information extracted from names that would not be practically or economically available otherwise. So retailers and luxury brands – especially in food, clothing and cosmetics where ethnicity plays a significant role – can improve customer intelligence and use those insights to better interact through online channels. Echoing concerns expressed by early 20th century John Wanamaker “Half the money I spend on advertising is wasted; the trouble is I don’t know which half”, companies like L’Oreal that spend several billion dollars a year on communication and advertising continuously try to improve the efficiency of their targeting.

Let us look at a more sophisticated example, for example public–private partnership (PPP) projects in mining, energy or infrastructure. Those projects can have significant social impacts in a territory and raise various political or economic issues. Understanding the human geography and recognizing the interests of the communities cohabiting in that territory can be critical to obtain a buy-in from all stakeholders. Onomastics, combined with geo-demographic segmentation, can help rapidly build geographic maps that can be used both for decision making and communication purposes. Automatic name clustering is the underlying technology that will help decrypt the complex identities present in large or small territories (from a continent, to a road). The objective is to answer tough questions and manage unavoidable frustrations though appropriate communication. Where should a tramway line pass in a multi-ethnic region? How to redistribute offshore oil revenues in the lands?

Concerning HR, I recently spoke with an executive at a large European bank who regretted that not enough trustworthy expatriates had been sent to control a large acquisition in a BRIC country, costing several hundred million Euros in write-offs. Among thousands of employees at the European head-office, the bank could have recognized the names of few people likely to accept an expatriation back to their home country. Having some people knowing both languages and both corporate cultures would have helped bridge the inter-cultural gap between the local management and other expatriates, saving millions of Euros.

In the digital world, onomastics brings a new view angle to social graph analysis: it can help colourize online communities, profile opinion leaders according to their audience. On Twitter, for example, you can more easily create a communication channels, well targeted on a particular community (business expatriates, tourists, migrants, but also international investors…)

Let’s now consider a provocative and controversial use of onomastics that will help us move on to the topic of ethics. Different cultures, nationalities and social backgrounds imply different behaviours, with respect to Money and Risk taking: earning, saving, spending, gambling, investing, donating, risking death and loosing it all… It is a fact that people with aristocratic names (in places where there is such an object as aristocratic names) would earn more and obtain cheaper credit than people with names typical of the lower class or a recent immigration wave. Why not take shortcuts: a bank could adjust the price of a credit, according to the borrower’s name; a car insurance company could adjust its evaluation of the risk (including the risks of insurance fraud, dangerous driving…) according to the name on the application form. They would better measure their risk. Furthermore, they could offer more competitive prices for categories of clients and they could better target them commercially.

Such use is highly controversial, since it raises the question of Equality (or inequalities) and discrimination. But discrimination is a fact, and onomastics can allow us to better see and understand how it works. Why should people with different sounding names hit glass ceilings in the first place, regardless of their skills? Casanova chose his own name de Seingalt and wondered if D’Alembert would have attained his high fame, his universal reputation, if he had been satisfied with his name of M. Le Rond, or Mr. Allround.

I am a supporter of Equal Opportunity Rights. And yet, I built a powerful discrimination algorithm based on names. NamSor is a piece of name recognition software which applies onomastics to analyse global flows of money, ideas and people. As any powerful new technology, it carries potential risks of abuse but I believe there is a positive use for it.

One classical application where onomastics plays a significant role is called geo-demographics: it consists in analysing the sociology of a particular territory (including the cultural and ethnic origins of its inhabitants) inferred from open sources and census data. Geo-demographics can be a useful tool to ensure, for example, that all populations have an equitable access to public services, such as hospitals. The company Experian is one of the leaders in that field, especially strong in the UK.

The effective use of the Big Data & Open Data is widely considered to be a critical enabler for future SmartCities : enabling dynamic allocation of resources, more efficient use of energy, prompt response to a crisis and so on. The combination of social networks and mobile applications with geo-localized devices opens new possibilities. Recognizing the diversity of populations that cohabit across space and time can help design more inclusive cities and transportation systems. Sensors that discriminate populations (in the sense of perceiving) can draw the clear picture needed to prevent discrimination (in the sense of favouring) and help defuse some of the time bombs ticking here and there.

Targeting diasporas: a game changer for development?

But the most promising use of software such as NamSor could be elsewhere – though it still deals with territorial equality. It is quite common for regions of the world that are less economically developed to use their own weakness (poorer people) as a strength (cheaper labour) to attract investments. The idea is to trigger a virtuous circle of job creation, infrastructure development, better education, migration flow reversal, etc. commonly known as the FDI Magnet effect. The region becomes more attractive and gradually moves up in the global value chain. As it loses competitiveness in terms of cheap labour because of the new wealth of its population, it develops a different economy based on innovation, services, tourism, consumption.

Most countries implement some kind of policy to direct flows of investments in poorer regions, as a mean to preserve their territorial cohesion and integrity. Those policies are most effective when they combine with successful private initiatives. So the objective of many Investment Promotion Agencies (IPA) is not so much to attract big money, as to attract a great business that will employ and help grow their people. The global competition to attract such investments is fearful.

Poorer regions have another weakness, which can be turned into a strength. Emigration is generally an opportunity loss, but after some years it generates a Diaspora which can be leveraged to attract investments back to the region.

For example, Ireland took decisive steps during the early 80’s to proactively reconnect with its emigrants or with successful businessmen of Irish descent. Rebekah Berry reminds us that “as recently as 1986 Ireland was one of the poorest countries in the European Union, but [in 2002] it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have attracted huge amounts of money from America – due largely to a century of personal and familial ties – and they have used this money to build factories ”.

The regions of Ningxia, Gansu and Qinghai have amongst the lowest number of millionaires in China. But if they could reconnect with the few they have, in Beijing, Shanghai or even abroad, wouldn’t it make a difference?

For that purpose, onomastics can be a useful tool and it has served the development strategy of a European country, Lithuania.

InvestLithuania is the first Investment Promotion Agency (IPA) to use name recognition to originate FDI deals. With three million people living in Lithuania and nearly one million people of Lithuanian origin living abroad, there is a good many personal and familial ties to be leveraged to attract new investment projects to the country. NamSor name recognition software helped discover those ties. Another method to accelerate the origination of new investment leads is to better understand and leverage the existing network of foreign businessmen in the country itself. Domas Girtavicius, a Senior consultant at Invest Lithuania, said “we were impressed by the accuracy of the name recognition software: it reliably predicts the country of origin and the number of false positives is fully manageable”.

This project with InvestLithuania was very successful and consequently I was invited to participate to the World Lithuanian Economic Forum (WLEF), which took place in Vilnius this year, on the 3rd of June. This Forum is organized by Global Lithuanian Leaders (GLL), a non-profit association whose mission is to reconnect with Lithuanians and friends of Lithuania abroad. I found the GLL to be a great initiative, providing the country with a wealth of expertise from different parts of the word, across all domains (politics, education, culture, business…), and also bridging some of the cultural gaps that necessarily exist in such a matrix (place / domain). Specifically, the GLL helps bring elements of culture from the US and UK, such as entrepreneurship and business networking.

While some diasporas, especially those originating from the Mediterranean, have a millennium standing culture of business and personal networking, other countries struggle to adjust to their new situation. What is the value of a social network such as LinkedIn to the Lebanese Diaspora? Low. What better communication tool in Marseilles than “word of mouth” to launch Massilia Mundi, which aims to become the social network of that city international Diaspora? But for many Investment Promotion Agencies (IPAs), LinkedIn is an essential tool. For example, in traditional Lithuanian culture, people treasure strong family ties and personal links with close friends, but do not nurture a wide network of professional connections or casual contacts. I believe many countries are in a similar situation, where a dedicated organisation could help reconnect people : for them, tools such as the social networks, professional databases and onomastics can make a difference.

Could that work also for regions in China? In 2005-2009, while I was working for a global consulting firm, I had the opportunity of managing an project in banking, with a mix of Chinese and French teams: a team in Paris which included several young ParisTech graduates of Chinese origin and a team in Shanghai. I remember the excitement and the pleasure of the entire team – including myself – to do a project connected with China, with the opportunity of travelling to Shanghai, tasting the food of different regions of China, being introduced to the Chinese culture. Several people from that team, both French and Chinese, are now in China. Jing, now a dear friend, went back to Shanghai in 2009 and I remember how she still felt sentimentally bound to her original city of Xiangtan, Hunan – ready to help in any way she could. From this experience, I understood that if there existed such an organization as ‘Global Ningxia, Gansu and Qinghai Leaders, it would not often encounter rebukes when reaching out for help, money or expertise. Such an organization could be very helpful in closing the economic gap with other regions.

Technically, Chinese names are clearly recognizable amongst other nationalities or origins. So, querying a professional database, we can produce onomastics mapping of Chinese company directors. For example, the following maps represents the density of Chinese and Japanese business communities in Southern Latin America, relatively to each other.


Source: Factiva DF Copyright 2013 NamSorts.com NomTriTM NamSorTM – All rights reserved

How many of those successful Chinese businessmen (or businessmen of Chinese origin) come from Ningxia, Gansu or Qinghai? This is where applied onomastics can be a game changer. Not that all questions are solved. At the present time, the available software allows us to detect phenomenons, not to understand them perfectly. For instance, I would like to share two data visualizations produced as part of this effort, which I found beautiful and promising.



What do we see here? Something – something that still needs to be analysed and understood, but something that may be of great value for someone trying to locate and identify potential investors or decision makers. Chinese last names actually raise specific challenges, since they have been used for many centuries and with rare or less common names disappearing over time, only one hundred different names remain today. But first names still carry regional differences, poetry and other semantics. Roots may be almost invisible, onomastics can still track them. And the more difficult is the tracking, the more valuable are the findings.

Logo Paris Tech Review

This content is licensed under a Creative Commons Attribution 3.0 License

You are free to share, copy, distribute and transmit this content

Logo creative commons

Download documents : Onomastics for Business.pdf (English version) Onomastique et Big Data.pdf (French version) Mirrors: [Harvard.edu] [arXiv]

Leave a comment

Filed under FDI Magnet, General

Chinese names, colorful onomastics

Chinese names

Some more work in progress. No comment – yet.

Leave a comment

Filed under EthnoViz

Chinese, Japanese business communities in Latin America

To continue our tour of global business communities, this week we present the onomastics mapping of Chinese and Japanese company directors in Latin America (except Mexico). Our first map represents the overall number of Chinese and Japanese company directors, by country.


This same information can also be represented as a pie chart:


Our second map shows where Chinese and Japanese businessmen are likely to settle down.



About FDIMagnet,

FDI Magnet is NamSor™ offering for Investment Promotion. We use our unique data mining software to offer differentiated Foreign Direct Investment (FDI) services:

–   Diaspora Direct Investments (DDI)
–   Smart Investors Targeting & CRM
–   FDI Targeted Communication

Follow @FDIMagnet, join the LinkedIn group or email us at contact@fdimagnet.com

Leave a comment

Filed under EthnoViz, FDI Magnet

SmartCity : Geodemography, Onomastics and Megacities

Can the Big Data help make cities Smart AND Inclusive ?

DataTuesday (Paris) : translation a presentation given on the 26th of March at IPSOS


PDF download : Smart City : GEODEMOGRAPHY, Onomastics & Megacities


About NamSor

NamSor™ Applied Onomastics is a European vendor of Name Recognition Software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.

NamSor launched FDIMagnet,  a consulting offering to help Investment Promotion Agencies and High-Tech Clusters leverage a Diaspora to connect with business and scientific communities abroad.

Leave a comment

Filed under General

Making sense of three Kanji : Chinese names and Japanese names

Three Kanji Names : Chinese or Japanese

3-Kanji Names : Chinese or

A hint at some work in progress : enjoying the beauty of Kanji and trying to make sense of every bit of information to recognize Chinese names and Japanese names.

Wikipedia tells us that « Japanese names have distinct differences from Chinese names through the selection of characters in a name and pronunciation. A Japanese person can distinguish a Japanese name from a Chinese name by looking at it. Akie Tomozawa, […] said that this was equivalent to how “Europeans can easily tell that the name ‘Smith’ is English and ‘Schmidt’ is German or ‘Victor’ is English or French and ‘Vittorio’ is Italian”. »

Keep in touch and follow us on Twitter @NomTri or join the Onomastics group on Facebook. 

1 Comment

Filed under General

Indian, Chinese, Russian and Japanese directors in European Big Business – an onomastics view [2/2]

Today, with this map of Japanese and Russian business communities in Europe, we complete an earlier post about Indian and Chinese presence in European economic affairs (*).

Japanese and Russian Business Comunities in EU plus Switzerland (vF)

The map would look different if we filtered information according to certain sectors (industry, trade, energy,…) but as it is, what does the picture tell us?

Besides showing the obvious and the well-known (a strong Russian business community in Russia’s traditional zone of influence, for example in the Baltic’s Lithuania, Latvia, Estonia ; a predominance of Russian businessmen in Cyprus international “offshore” financial holdings), it reveals several less expected features.

Firstly, one would expect Germany to be a stronghold of Russian business in Europe, due to the high level of trade between Russia and Germany. It may be so, but while there are many company directors with Russian names, there are even more Chinese, Indian and Japanese businessmen  in Germany.

Secondly, there is a clear Japanese preference in favour of Belgium and the Netherlands for Foreign Direct Investments (FDIs) and as an entry door for trade with Europe.

Thirdly, while Indian and Chinese directors share a similar profile to select European target countries for FDIs and trading, Russian and Japanese businessmen demonstrate more polarization: they generally make different choices.

Keep posted ! Follow us on Twitter or join the Onomastics group on Facebook

©2012 NamSor™ – All rights reserved.

(* We show the density of Japanese and Russian company directors expressed relatively to the total density of Japanese, Russian, Indian and Chinese presence, measured using the onomastics of about half a million company directors of the largest companies of all sectors, in the European Union plus Switzerland. Accuracy of name classification software is typically in the range 75%-95%)


Filed under EthnoViz, FDI Magnet