View on GitHub


Workshop on scientific writing for QuEST, April 2021

A Skeptic’s Guide to Scientific Writing

Stefano Allesina QuEST Workshop, U. Vermont, Apr 2021

History of the paper

The way of publishing research has been evolving over the past few centuries:

Common advice on writing papers

The advice given to scientific writers follows closely that given to journalists:

Advice on how to write from Boyle, who was the main proponent of scientific communication based on short essays (rather than books): a Philosopher need not be sollicitous that his style should delight its Reader with his Floridnesse, yet I think he may very well be allow’d to take a Care that it disgust not his Reader by its Flatness, especially when he does not so much deliver Experiments or explicate them, as make Reflections or Discourses on them; for on such Occasions he may be allow’d the liberty of recreating his Reader and himself, and manifesting that he declin’d the Ornaments of Language, not out of Necessity, but Discretion, which forbids them to be us’d where they may darken as well as adorn the Subject they are appli’d to.

Just looking at “10 simple rules” PLoS CB:

What works for me

Pet peeves and best practices

Things that I value as a reviewer:

Things that I value as an editor:

Being a skeptic

Weinberger, Evans & Allesina, 2015, Ten Simple (Empirical) Rules for Writing Science

Is the advice given to scientists any good?

Effects on citations of abstract features

Have some fun with real data

The file data/plos_compbio.csv contains data on all the documents published in PLoS Computational Biology (as of 3/3/2021). The file is comma-separated, with headers specifying the content of each column:

The file data/plos_compbio_details.csv contains:

Notes on data

Obligatory disclaimer: data are never perfect.

Missing data (e.g., for documents without abstract) is reported as NA.

You can read a document by opening for example

Types of documents

Assignment of gender is based on first name. The reported probability is computed by counting the number of newborns that were assigned at birth a certain name and sex combination (as of today, SSA reports only male/female). The data, provided by the Social Security Administration, covers US newborns from about 1880 to today. Clearly, the assignment is going to be most accurate for authors residing in English-speaking countries, though the large immigrant population in the US allows some resolution of names that originated in other areas of the world.

Taking a peek

This rich data set allows us to explore several aspects of scientific writing. Just a few basic visualizations:

# the code for the visualizations below is here

Distribution of citations:


Note the very broad, skewed distribution. To model citations, it is therefore convenient to transform the data. In particular, plotting log(num_citations + 1), we obtain:


Note the many documents with zero citations—these are mostly editorials, errata, etc. Considering only Article, we find Gaussian-looking distributions (note that 2020 and 2021 are excluded):


Similarly, the number of views is highly skewed:


and the transformation has a similar effect:


The typical number of authors per document increased slightly over the years (record holder):


And the proportion of articles authored by women about doubled over 15 years:


International collaborations became more frequent (record holder):


The number of words in the abstract has remained about constant (record holder):


The proportion of simple words in the abstract has also remained quite constant:


The number of references and figures has been constant as well (record holder):



While the proportion of articles containing several equations has been growing steadily (record holder):


Choose your own adventure

A Choice of Weapons


Does a certain feature correlate (in a qualitative/ranking sense, e.g. Kendall’s) with citations/view?

Generalized Linear Models

Take (y_i) to be the (possibly, log-transformed) number of citations (or views) for article (i). We can model this “response variable” as a linear combination of the predictors, and an error term. Examples (in R):

log(num_citations + 1) ~ year (assumes that citations increase at a constant rate by year)

log(num_citations + 1) ~ as.factor(year) + num_authors (fit each year independently, assumes citations change in an orderly, exponential [i.e., the log is linear] way with the number of authors; you might want to bin the number of authors and use it as a factor)

The model can be made as complex as needed.


Say that we want to probe whether papers with a female first-author accrue more citations than those with a male first-author. Then we can compute the average (or median, etc.) number of citations for papers with a female first author. Call this the observed average. Then, we can re-compute the value after shuffling the imputed sex of the first author in the data. By repeatedly shuffling and measuring, we create a distribution of the expected average number of citations. If the observed value lies in the tails of the distribution (e.g., measure a p-value), we conclude that the effect of having a female first-author is significant.


Analyze the provided data, guided by an hypothesis on what would make for an impactful paper. If you share an Rmd file with me, I can publish it on the website for next week.

An example, looking at homophily in co-authorship.