Summary database
Here is the summary of our 4 tables, and the information we wrangle from them:
Uniq Name in Paper table
Coauthor table
Author table
Training table
People:Department to Category mapping
Validity assessment of our department to college mapping:
Balancing annotations
To study group size across field of sciences, we would as many faculties with and without resarch groups in each domain. At the moment, we oversampled faculties with research groups because we were interested in characterizing group size.
Data quality check
Author age
Checking author age. most often if above 50 years old it is a glitch in min paper year that needs to be fixed. Right now we are doing it manually.
Coauthor age
Same for coauthors. Since coauthors are numerous, we do nothing to fix them right now. If they are above 50, we impute with NA in the model so that age_diff between author and coauthor is NA. This might not be the greatest solution because this mean we artificially make a cutoff of age diff of 50 years old.