Handle with Care: pitfalls in analysing book usage data

A recent report by SpringerNature demonstrates quite how careful one needs to be in the process of collecting and analysing book usage data to ensure that reliable insights are obtained.

In November SpringerNature released a white paper entitled The OA effect: How does open access affect the usage of scholarly books?


The first part of the report compares download and citation data between OA and non-OA titles published by Springer over the past 4 years. The headline findings are that download rates from the SpringerLink website are 7 times higher for the OA titles than the non-OA titles, and that citation rates are 50% higher for OA titles. The immediate conclusions proposed by the report are that books will attract 7 times more downloads and be cited 50% more if they are published as OA titles rather than non-OA. Having advocated the advantages of OA book publishing for some time I was naturally pleased to see these findings.

But there is an important distinction between correlation and causation in data analysis – so it is important to ensure that causation can really be identified even when correlation is observed. Alas, on closer inspection of the data and analysis, it is apparent that the study has a number of serious data processing weaknesses which mean that neither of the main conclusions can be verified (or falsified) by the data or analysis presented in the report.

I will consider the download and citation results in turn.

1. Downloads – aggregation issues

The actual data used in the analysis is not provided, so it is difficult to accurately assess the validity of the data directly. The downloads data refers to downloads from the SpringerLink website – where “downloads are recorded for individual chapters rather than as full book downloads.” (p. 7) Of course the report itself studies book usage, rather than individual chapters – and so to obtain book-level download data it appears that they simply add together all the individual chapter downloads for each book (certainly that is how the book-level downloads data displayed on the SpringerLink website is generated).

The immediate difficulty of this process is that if somebody wishes to download the entire volume they need to download every chapter – so the total number of “chapter” downloads recorded for each such user will equal the total number of chapters in the book, and books with more chapters will naturally be rewarded by having more chapter downloads.

This problem is magnified however by the fact that for both OA and non-OA titles the entire volume can be easily downloaded from the SpringerLink website as a single file without the need to download each chapter individually (albeit at a cost for the non-OA titles). When this happens it appears that each chapter is recorded as having received a download, so a user downloading the entire title as a single pdf or epub file generates as many ‘chapter’ downloads as there are chapters in the book.

Problem 1: When downloaded as a complete book, titles are recorded as receiving as many downloads as there are chapters in the book – so those titles with more chapters will be recorded as having received more downloads.

This makes the comparison of download figures between titles difficult to interpret. Charts 2-4 of the report show downloads for Science titles to be about twice as high as for Humanities titles. This may have nothing to do with usage rates at all, but just reflect that Science titles typically have twice as many chapters as Humanities titles. Clearly, to make any conclusions about comparative usage between disciplines (or titles) one needs to first correct for the number of chapters in the book.

Of course the main result claimed in this report is relative, not absolute. It is reported that OA titles receive 7 times more downloads than non-OA titles. If, on average, there is no difference in the number of chapters between OA and non-OA titles in a specific discipline, then this ratio might not be affected – right?

Alas – not necessarily. For the ratio to remain unchanged it must also be true that the proportion of people downloading the whole book (rather than an individual chapter) must be the same between OA and non-OA titles – and there is no reason to expect that to be the case. Given that users of non-OA titles must pay more to download the entire book than an individual chapter, but that both are free for OA titles, it is perhaps not unreasonable to expect the proportion of ‘whole book’ downloads to be higher for OA titles. When the whole book downloads are multiplied by the number of chapters in the volume, then a small proportional increase in users downloading the entire volume will increase the absolute number of downloads by a much greater proportion.

An example may be useful: Let’s consider a book with 20 chapters, and start with the hypothesis that 1000 readers are interested in this title and will download a component of a this book whether it is OA or non-OA. Further, let’s assume that when readers have to pay for access (non-OA) they download only a single chapter of the book, while they download the entire work when it is free to do so (OA). In this situation the OA work will be recorded as having 20 times as many downloads as the non-OA version. As a non-OA title the book will receive 1000 chapter downloads, but as an OA publication it will be credited with 20,000 downloads. While clearly this is an extreme example – it usefully demonstrate the problem.

Problem 2: The larger number of downloads reported for OA titles in this report may actually reflect a change in the download behaviour by existing readers when the work is published OA, rather than any increase in the number of people accessing the work.

Of course, it would be remarkable if a freely downloadable title did not actually attract more people to download the work. It’s just that the data presented in this report doesn’t allow us to conclude that, or quantify any increase.

One solution that would go some way to mitigating both problems would be to use ‘user session’ data (at the book level) rather than chapter downloads. Session data would record the number of users downloading any part of the title as a single ‘download’ – irrespective of how many chapters they downloaded during the session.

Taking the base unit of analysis as a ‘session’ might also allow the analysis of differences in behaviour within a ‘session’ between OA and non-OA titles. It is important to understand how users interact with the content  – and it would be interesting to assess if users are interacting with more content, or with the content in different ways, when the title is OA.

2. Citations – selection bias

There have been numerous studies assessing possible citation advantages for journal articles when published OA. Early studies reported very large citation advantages for OA articles – but subsequently these studies were shown to have substantially overstated the advantage due to selection bias in the articles published OA. (E.g. McCabe, Mark J., and Snyder, Christopher M., (2015) “Does Online Availability Increase Citations? Theory and Evidence from a Panel of Economics and Business Journals.” Review of Economics and Statistics 97:1, 144-165. See also the Open Access Citation Impact Bibliography created by A. Ben Wagner.)

Put simply, articles published OA are – on average – better quality than those published non-OA, and so they are cited more because of that. Within the scholarly articles literature there have been an increasing number of studies trying to correct for various alternative selection biases – by conditioning on such things as the author’s institution (articles with US-based authors receive more citations than others, and are also relatively more likely to be published OA), the previous publishing record of the author (senior/prestigious authors are more likely to both be cited and publish OA) etc.

This report does note the possible difficulty associated with selection bias – but unfortunately does nothing to try and correct for this. In consequence they are reporting only a correlation, and we must expect a similar outcome to that of articles – that any OA citation advantage is likely to be much lower than reported here (or potentially not even exist at all) when the selection biases in the quality and other characteristics of OA book publications are corrected for.

Problem 3: We can’t tell from this report whether OA books are cited more because they are published OA or because the authors of more citable books are also more likely to have chosen to publish them OA.

The solution, of course, is to make concerted efforts to correct for possible selection biases. How best to do that will depend on the dataset actually available, but an important component is likely to be conditioning results on a rich set of title level metadata (much of which will be available from CrossRef – itself the source for the citation data used in this analysis) or through the careful assessment of the behaviour of users not charged for access to either OA or non-OA titles (through whole collection deals or the like) that SpringerNature may have available.

It should also be noted that selection bias problems will affect the analysis of downloads in precisely the same way – so similar corrections will be required for the analysis of both citations and downloads data for any causal relationship to be tested.


This report by SpringerNature has asked some key questions about the impact of OA publishing on citation rates and download behaviour, and would seem to have at its disposal an extensive dataset with which to undertake a rigorous and insightful analysis. Alas, due to weaknesses in their data processing and analysis, this particular study doesn’t provide us with any useful insights into the important questions posed. However more careful analysis of their data is still possible, and I strongly encourage SpringerNature to now involve somebody with rigorous data processing skills directly into their ongoing analysis of this potentially informative dataset.

It should also be noted that data in this study is taken solely from the SpringerLink website. One important aspect of OA publications is that they can also be hosted and downloaded from many other platforms not under the control or ownership of the publisher, so one might expect any analysis based on downloads from the SpringerLink platform alone to significantly underestimate total usage/download levels for OA titles.

An aside: At Open Book Publishers (an OA book publisher) we have been making concerted efforts to collect and collate usage data from multiple platforms, and to present this data online. And as part of the EU funded HIRMEOS project we are working to extend our existing processes and develop an open source system for collecting and collating usage data for Open Access books.

