Visualizing Book Usage Statistics with Metabase

Vincent W.J. van Gerven Oei

doi:doi:10.21428/ae6a44a6.e290a663

Open Access and Usage Statistics

There is an inherent contradiction between publishing open access books and gathering usage statistics. Open access books are meant to be copied, shared, and spread without any limit, and the absence of any Digital Rights Management (DRM) technology in our PDFs makes it indeed impossible to do so. Nevertheless, we can gather an approximate impression of book usage among certain communities, such as hardcopy readers and those connected to academic infrastructures, by gathering data from various platforms and correlating them. These data are useful for both our authors and supporting libraries to gain insight into the usage of punctum publications.1

As there exists no ready-made open-source solution that we know of to accomplish this, for many years we struggled to import these data from various sources into ever-growing spreadsheets, with ever more complicated formulas to extract meaningful data and visualize them. This year, we decided to split up the database and correlation/visualization aspects, by moving the data into a MySQL database managed via phpMyAdmin, while using Metabase for the correlation and visualization part. This allows us to expose our usage data publicly, while also keeping them secure.

Book Sales

The compilation of book sales data is relatively easy. punctum books sells its publications through the KDP (formerly CreateSpace) platform of Amazon, through which they are available in all major online book stores. The Uitgeverij imprint, which was merged into the punctum catalog, is sold through Ingram’s Lightning Source, a comparable platform. (Together, Amazon and Ingram basically form a duopoly in print-on-demand book printing and distribution.) Sales figures can be downloaded from their respective backends, although KDP makes detailed sales figures difficult to retrieve after a three-month period, basically necessitating manual data input. We were unable to retrieve any book sales data from KDP or CreateSpace from before June 2012, even though our first publication dates to November 2011. So some sales data from our early years are missing.

Furthermore, punctum provides books directly to authors, bookshops, and library vendor GOBI. For the invoicing, we use open source platform InvoiceNinja, which can output detailed monthly overviews.

There are of course legacy data, for example from Paypal (which we used in the past for invoicing). Paypal’s data formats have changed over the years in haphazard and inexplicable ways, necessitating quite a bit of manual processing to make/keep them usable. Fortunately, we no longer have to rely on them.

Downloads/“Interactions”

There is quite some variation in terms of what is considered a “download” or “interaction” with an ebook. In the past, we’ve used different Wordpress plugins recording whenever someone clicked on the link to the PDF. (We’ve experimented with both free and paid downloads.) Since 2019, all the PDFs referenced on our website link directly to the OAPEN repository, which provides COUNTER-compliant usage statistics data to us on a regular basis. However, OAPEN has transitioned its infrastructure this year to Dspace, and we decided to wait with ingesting their usage data over 2020 until the moment they have finalized their new usage stats export format.

On OAPEN, only a single PDF file is hosted per publication, which allows for a relatively straightforward equation between download and interaction; every single interaction with the file, either by opening it in a browser or saving it to a harddrive, can be counted once.

This equivalence is not so straightforward with other platforms, such as Project MUSE and JSTOR. Both platforms feature a selection of publications from the punctum catalog, but because of their background in hosting journal articles, monographs and edited collections are by default cut up into chapters. For example, this means that in order to access our entire book Anthropocene Unseen from JSTOR, you’d have to download 92 separate files, which are presumably counted as separate interactions. It may be understood that this practice of cutting up books inflates the number of interactions.

Although OAPEN provides usage stats that comply with the standards of COUNTER 5, this is not the case for JSTOR and Project MUSE. Responding to a query from our side, JSTOR informed us that although their reports regarding books are built on the same reporting system as their COUNTER 5 reports (for journal articles), they are technically not COUNTER-compliant because the report types that JSTOR offers are not mandated by the COUNTER code of practice. Project MUSE responded to a similar query that whereas their usage stats provided to libraries are COUNTER-compliant, the stats provided to publishers are strictly speaking not so.

This then creates a problem in terms of comparison between different data sources: the proverbial apples and pears (or perhaps apples and quinces). This appears to be a problem that cannot be solved by an individual publisher.

Finally, there is also an enormous practical obstacle between meaningfully collating and comparing ebook usage data from these different sources: they don’t use unified standards. None use ISO 8601 encodings for the date, though books are usually referred to by their ISBN. When it comes to comparing country- and institution-level usage data, things become more complicated.

Note: OAPEN data for 2020 are not yet integrated.

Per Country

Despite the fact that there exists a universal standard for encoding countries, the ISO 3166-2 alpha-2 standard which provides each country with its own two-letter code, none of the platforms actually use this in their usage stats output. This leads to discrepancies such as the following2:

ISO 316601 alpha-2 code	ISO 316601 alpha-2 name	Platform 1	Platform 2
PS	Palestine, State of	Palestinian Territory, Occupied	Palestine
US	United States of America	USA	United States
TW	Taiwan, Province of China	Taiwan, Province of China	Taiwan

By correlating the idiosyncratic naming practices of the different data sources to a single standard, we are then able to visualize them in a similar fashion:

Please note here that the radical inequality in terms of access between countries illustrated by these maps may have two main causes. First of all, it may be that Project MUSE and JSTOR track country data only through institutional access to their platforms, thus not counting access to our open access books by users not logged in via institutional networks. Second, it may simply be that the open access collections of neither platform reach enough users in South America, Africa, Eastern Europe, the Middle East, and Central and South-East Asia. It would be interesting to compare these data with those of OAPEN, which solely hosts open access books. In any case, what these data show rather painfully is that open access itself is not enough to ensure access by readers worldwide.

Per Institution

Providing usage statistics of publications from our catalog by institution is facilitated by both Project MUSE and JSTOR. Also institution-level data are gathered presumably by tracking interactions of users who are logged in through an institutional account, such as students accessing JSTOR through their university library network.

As with country-level usage data, neither platform has adopted an open, standardized set of institutional identifiers such as GRID, ROR, or Crossref’s Funder Registry. For our correlation table in Metabase, we decided to use GRID as standard, because their database is openly accessible and appears to be the largest, containing nearly 100,000 institutions.

As a list of institutions, despite its finitude, is considerably larger than a list of countries, the work in correlating the data from Project MUSE and JSTOR with GRID identifiers is considerably more cumbersome and currently still in progress. The lack of standardization leads to the following types of correlation inconsistencies:

GRID ID	GRID name	Platform 1	Platform 2
grid.253553.7	California State University, Bakersfield	Calif State Univ @ Bakersfield	California State University, Bakersfield
grid.253554.0	California State University, Channel Islands	Cal State Univ @ Channel Islands	California State University, Channel Islands
grid.253555.1	California State University, Chico	California State Univ @ Chico	California State University, Chico
grid.253556.2	California State University, Dominguez Hills	Cal St Univ @ Dominguez Hills	California State University, Dominguez Hills

Here we see that Platform 2 follows the GRID name, while Platform 1 has an internally inconsistent nomenclature. With institutions not based in an Anglophone country, language and transliteration become an additional issue:

GRID ID	GRID name	Platform 1	Platform 2
grid.10548.38	Stockholm University	Stockholms universitet	Stockholm University
grid.441965.b	University of Saint Francis Xavier	Universidad Mayor y Pontificia de San Francisco Xavier de Chuquisaca	N/A
grid.6935.9	Middle East Technical University	Orta Dogu Teknik University Library	ORTA DOĞU TEKNİK ÜNİVERSİTESİ: Middle East Technical University

For grid.10548.38, Platform 2 conforms to the GRID name in English, whereas Platform 1 prefers to use the Swedish name. In the case of grid.441965.b, it was possible to figure out that this referred to the University of Saint Francis Xavier in Sucre, Bolivia, and not the Saint Francis Xavier University in Antigonish, Canada, because GRID contains information on the location of each instution. As Platform 2 was unable to generate a master list of all the institutions in their system (despite my repeated attempts to explain the issue), we don’t know whether students at grid.441965.b never accessed our books through Platform 2, or simply don’t have access to Platform 2. In the case of grid.6935.9, GRID used the English name, Platform 1 a hybrid Turkish–English name with proper diacritics, while Platform 2 use both languages with the Turkish name (with correct diacritics) in all caps. This is just a tiny sample of the issues caused by a lack of standardization among platforms and the obstacles for publishers to release reliable and usable usage statistics.

For the sake of the sanity of other (open-access) publishers who may want to meaningfully compare usage data from different platforms and distributors, it can only be hoped that standardization will become the norm. Until then, punctum books is happy to share its correlation tables so others will not have to repeat our work.