Overthinking fields of studies

Triple spider-man meme representing how field of study can take different forms.

Do all researchers in a computer science department are doing computer science? Do researchers need to publish in a established sociology venue to be doing sociology? In a computational and complex system world, the relationship between authors' department, venues, and the article content are becoming increasingly fuzzy. This is unfortunate as one of our main goals is to show how field of studies are becoming computational. If a field of study (FoS) changes label as it becomes computational, this makes our problem harder.

Physicists are well-known for applying their toolbox to other disciplines, such as the social sciences and epidemiology. Researchers in computer science departments might end up doing social media studies, even though the subject matter is very much the social sciences. Arguably, the mapping between one researcher's background and its host department is not as consistent as it used to be.

Journals evolve over time. Disciplines evolve over time. They can evolve in different ways. First, there is the multiscale nature of displines. Disciplines can branch out to become more or less fine-grained. What used to be the natural sciences is now a collection of subfields, related by a common ancestor that was naturalism (or even philosophy if you go back far enough). Statistics is now machine learning. Machine learning is now many things, as measured by conferences and venues.

Topics can fall out of fashion, but the ideas stay alive under different forms. Eugenics used to have their own journal. Albeit the eugenism as a field was rejected, we can still see some of the underlying ideas floating around in different field of studies.

We will introduce the main taxonomies out there used to classify papers: OpenAlex, semantic scholar, web of science, and google scholar. Each of them have different heritage, openess, and play different roles in the research ecosystem.

We distinguish between bottom-up and top-down approaches. The top-up approach use known labels from the top—authors' department or venues' labels—to classify works at the bottom. It used to be the most common way to classify papers. A paper is sociology when it is publish in a known sociology journal. The bottom-up approach is data-driven; it is about clustering fields using the content and metadata of papers.

To quantify the computational turn, we want to classify works more in terms topics than methods. If a paper is about the history of the House of Habsburg, or the Republic of Letters, in such a way that this is identfiable as 'History', we want to say this is history works regardless of how it has been studied. Needless to say that this is tricky. FoS are tied to their methodology; anthropologists do interview and field works, historians do archival works, sociologists do big surveys, etc. In some ways, there is no work that is 'pure history works', albeit there are seminal works. This is why we seek to combine multiple perspectives.

OpenAlex taxonomy

tldr; Each paper has a topic, which is then mapped onto Scopus' taxonomy. Topics are clusters on co-citation graph, which has been labeled using GPT3.5 Turbo (they provided top papers based on representative sample of papers).

OpenAlex worked with people at the Leiden university to have a list of topics to classify papers. That draws from years of experimentation in bibliometry, and perhaps more importantly that is open and transparent. By open, we mean that we know what is the algorithm underlying the classification (as opposed to Web of Science and Google Scholar). Their taxonomy borrows from Scopus's ASJC structure, but works at the leve of papers instead of being derived at the journal level. OpenAlex contribution in that regard was to connect topics to the ASJC structure.

By virtue of being open, they allow us to tinker with their ideas; we know, for instance, that their topics is really link clustering of the co-citation graph. We know that they do not use content to derive their classication. This matters. If we define a field in terms of who cite who, then two researchers using completely different methods might end in different disciplines, even though they work on the same topic. This is most likely the case of linguistics. In open alex, linguistics is found both in the subcategory of Arts and Humanities and in the Social sciences. But if you take a venue-first approach (where the venues dictate the classification), as with Google scholar, we find that computational linguistics is also a field in Engineering & Computer Science.

Semantic Scholar fields of study

tldr; Each paper has up to three FoS. The model use paper's title and abstract as input. Folks at semantic scholar annotated 500 papers to validate their classification scheme. They favor recall, but still achieve 0.9 prevision with papers+abstracts, and 0.8 for title only papers.

S2ORC build on Microsoft Academic Graph original taxonomy, but they build a machine learning model (linear SVM running on character n-gram TF-IDF representations) to keep classifying papers on their end. Their model use paper's title and abstract as input. They favored a flat hierarchy, but each paper can be composed of up to three field of study. They added in their taxonomy the fields of Linguistics, Law, Education, and Agriculture and Food Sciences.

semantic scholar publishing partners

Embeddings

tldr; Embeddings are magical. We can cluster papers that to create bottom-up field of studies.

Embeddings are magical, but hard to work with in a principled way. They are magical because they seem to work. Each point in the figure on the left correspond to a paper, or more precisely a combination of the papers' title, abstract, and citation. If you hoover the point, you can look at the title and abstract and see that they are kind of related. They are hard to work with because what says that we are not just reading tea-leaves.

We can aggregate researchers' contributions to better understand their field of study. Because this space is mostly topical, we could in principle find researchers who are doing, say, sociology even if they do not tend to publish in sociological journal. I say mostly, because if computational social scientists keep talking about how they use computational methods, they might end up closer together than to researchers' in their 'true field of study'.

One nice feature with embedding space is that when we use density-based methods to cluster papers, we end up with some proptoypical coordinates of a given field of study...

Web of Science

Talking about FoS

Google scholar (FoS → Venues)

Here we look at a different approach; papers inheritate the categories of the venues. The rationale is that if a paper is published in well-know venues in, say, linguistics, then it is a linguistic paper. It is unclear where Google Scholar categories are coming from.

WoS

GoogleScholar

Level 1

OpenAlex

Level2

Semantic Scholar

WoS

GoogleScholar

Final taxonomy

With all that, we construct our own taxonomy that seek to bring forth parts of sciences who might be transitioning from the qualitative science to quantitative science. We proceed as follow:

◆ Start from openAlex level 1's hierarchy
◆ Break down large categories, use s2orc as replacement
  ◆ Split arts and humanities and social sciences
  ◆ Replace agricultural and biological sciences with Agricultural and Food Sciences
  ◆ Split environmental science into `ecology` and `Pharmacology, Toxicology and Pharmaceutics`
  ◆ Add `Earth and Planetary Sciences`
  ◆ Split Statistics, Probability and Uncertainty from mathematics
◆ Aggregate some categories that are not the focused of this study
  ◆ health professions and medicine into health and medical sciences
  ◆ Aggregate Veterinary and dentistry into health and medical science
  ◆ Aggregate energy into physics and astronomy or material science
◆ Add Literature and Literary Theory, Information Systems and Management