Classifying computational works

As we saw in the previous section, 'computational' has different meaning for different communities. For some communities, computational is really about numerical works. As in computational physics. In other communities, such as the social sciences, computational as in 'computational methods' refer to a set of practices that require programming in some ways.

We define computational works with the following heuristics:

Did the authors had to run code to produce the paper, aka perform simulations, produce visualization, or run statistical analysis.

This is hard to answer. We adopt the following methods, which goes from full blown quantification of how much computation there might have been in papers to a simple binary classification.

Computational complexity

This is the best answer we can give. Given a set of parsed pdfs, we extract sentences containing mention of practices that might involve computer programming, i.e. simulation, data or code availability statement, computational X, and so on. For each mention, we seek to determine author's intent (same than the Software Mention Recognition task at NSLP 2024, related more generally to classifying citation intent in papers). That is, we seek to determine if the author was mentioning simulation/data/code/... such that it involved usage or creation of the mention thing, or it was a simple mention of other's work. To do that, we need to be able to extract all mentions from a papers.

With that in hand, we can create a score on a "computational complexity" scale, where zero is non-computational at all and 9 is very much computational. We understand computational complexity like anthropologists understand technological complexity. We assume that we are able to count bits of the paper that involve computation. A paper with only numerical simulations contain less computation than one involving simulations as well as a full blown data analysis pipeline. Similarly, one where authors had to scrape their own data contains more computation than one where the data was provided and so on. In a very naive way, this notion of computational complexity is literally Kolmogorov complexity, where complexity is measured in terms of the shortest computer programs that could produce the paper in question. The only difference is that we actually seek to measure program lengths underlying ideas.








Distilling computational complexity to binary classification

In many cases on our main database (semantic scholar), we have papers' title, abstract, figures. Based on that, we can seek to estimate the probability that a paper is computational, but we cannot fully quantify the computational complexity of a paper. We play the following question game to label computational papers from abstract, title, and figures alone:

Here are some pitfalls when doing so: