Providing open and trustworthy usage data on open access (OA) publications remains a challenge as the scholarly publication market continues its general shift toward forms of publishing that keep publicly funded knowledge in the public domain, even though the economic models behind them often remain questionable.
At punctum books, we believe that every aspect of the scholarly book production chain should be open — built on open infrastructures, following open standards, made available in open formats, and, most importantly, co-governed by a community of authors, institutions, funders, and publishers. The existence of any proprietary link in this chain constitutes a potential point of profit extraction and is as such vulnerable to corporate capture. This has become very clear from the way in which giant commercial publishing-turned-data-analytics companies embrace OA only to move their shareholder profit extraction from the public purse further upstream to research environments and datasets. Downstream usage data provide another such potential point of capture and monetization.
There are currently two large-scale projects involved in developing platforms for providing harmonized usage data for open access books.
In 2021, the stakeholders in the OA Book Usage Data Trust, funded by the Mellon Foundation, developed their initial governance principles and are now working toward the launch of a technical pilot for an “international data space for open access book usage data between 2022 and 2025.” While the results of this technical pilot are not yet public, the organization mentions in its prospectus that “there is not currently a trusted book usage data exchange infrastructure neutrally operated by a consortium of book-publishing stakeholders to facilitate the aggregation and benchmarking of both open and controlled usage data for specified uses by trusted partners” (my emphasis). Considering the commercial partners in this project and the contradiction between the professed openness of the data trust and the gatekeeping signal words “specified” and “trusted,” it appears unlikely that the data trust itself will eventually be truly, fully open, even if built on open infrastructures.
Another pilot is the Book Analytics Dashboard (BAD) Project, also running from 2022 to 2025, coordinated by the Curtin Open Knowledge Initiative (COKI) and again funded by the Mellon Foundation. BAD has already launched an initial proof of concept dashboard showing data from the University of Michigan Press (which is also involved in the OA Book Usage Data Trust project). Importantly, these data are indeed open to anyone and directly downloadable from the dashboard under an open license. COKI’s initial findings from focus groups run in the first quarter of 2023 are also publicly available. This in itself is a hopeful sign that the project is moving in a fully open direction.
But what jumps out to me from these findings is the consistent negative attitude of those participating in the focus groups toward author access to open usage data:
NO - Concern–this is great and useful for publisher, I’ll have to do explaining to authors, this is what it means don’t panic.
NO - Extra output for non-publisher users. In what way can I convey to admin and authors and in what form? They cannot access these kinds of dashboards I presume.
NO - authors have diff expectations, sometimes not very realistic. Therefore, I think it would cause problems to give full access to authors, compare their book with another book with completely other circumstances for example. So having one sight (view) for one book, or we will restrict to just ourselves and then we’ll communicate numbers to authors with explanation.
NO - we would massage before handing to authors, this would just be for us.[1]
Again, making book usage data open means open to everyone, including authors. Yes, this means that we as publishers have to provide contextual information. Yes, this means that we as publishers will have to explain decisions around marketing and publicity and the intricacies of book distribution. But all of this is part of our care for the author.
Before both the OA Book Usage Data Trust and COKI BAD projects properly commenced, punctum launched its own open usage data system in 2020[2], based on the then available open infrastructures and standards. This system has expanded over the years as more data sources have been added, most recently usage data from Google Books and the Internet Archive[3], and event data from CrossRef. Below I give a brief overview of the current open ebook usage data that we provide on our website under an open license and the platforms and standards we use, to give an indication of the challenges faced by both the OA Book Usage Data Trust, COKI BAD, and, indeed, any initiative that would be serious about providing open ebook usage data at an industry-wide scale.
Infrastructures
All infrastructures we use to organize and visualize the usage data of our open access ebooks are open source. Whereas Metabase still has some issues with regard to combining composite tables and labeling download files, it serves most of our purposes well. We hope these pending issues will be resolved in future releases.
Data Sources
WordPress plugins (legacy data)
OAPEN (COUNTER 5 compliant)
JSTOR (book and chapter data combined)
Project MUSE
Internet Archive
Google Books
CrossRef Events
Except for the WordPress plugin data from the early days of punctum, all other platforms to which we provide our ebooks provide monthly usage data. Only one of those, OAPEN, provides data that are strictly COUNTER 5 compliant. A further complexity is posed by the JSTOR data, which combines book and chapter usage data. CrossRef events are related to DOI numbers, which we assign to both books and chapters in edited collections. All other sources only provide data on interactions with books.
Identifiers
There is currently anything but uniform agreement on which (open) persistent identifiers to use, posing, in my opinion, the most substantial obstacle to a harmonized, open ebook usage data platform, as well as broader challenges concerning costs and access within the scholarly communications landscape.[4]
Works
PBN (WordPress plugins)
ISBN
Thoth Work ID
DOI
For referencing unique books (and chapters), publisher-assigned DOIs would be the most practical, but only OAPEN and CrossRef currently consistently provide data identified by DOI. The downside of using DOIs is that works are not globally (and historically) assigned such numbers. Moreover, assigning a DOI is not free, thus setting up an additional barrier to open publishing. There is also no global consensus on DOI assignment. As a case in point, JSTOR only provides data with the DOIs it has assigned itself. Internet Archive data are only, and minimally, identified by the URL slug, which in our case is identical to the identification code assigned to works in Thoth (as the Internet Archive automatically ingests our book metadata from Thoth on a regular basis[5]). Our legacy WordPress plugin data were manually enriched with punctum books ID numbers (PBN), which we use internally to identify books and is part of our DOIs. All other platforms use ISBNs to identify individual books in the usage data they provide. The problem with ISBNs is obviously that they ignore chapters (and are assigned by national monopolies against sometimes very high fees).
In our system, these various work identifiers are manually correlated in a table in phpMyAdmin to make sure all data are interoperable on this level.
Contributors
ORCID
Idiosyncratic formats
For contributors, there is still no globally established usage of ORCID persistent identifiers, which makes it impossible to collate usage data on the contributor level at this point. Contributors may change names during their careers (marriage, divorce, transition), add or subtract initials, and moreover different platforms have different standards for including the author name in the usage data, and a persistent identifier standard, implemented globally, would potentially mitigate these ambiguities. It appears, however, that also ORCID has been unable to fully avoid such “anomalies” and “misapplications,”[6] while also raising ethical concerns.[7] Even if contributors were to become uniquely identifiable and correlatable through a persistent identifier, there is no globally established and accepted open standard for contributor roles (author, editor, illustrator, foreword, translator, etc.). Without such standards, making ebook usage data interoperable on the contributor level will remain problematic.
Whereas we advocate the usage of ORCID among our contributors, we do not make them compulsory. It should be the purview of a contributor to weigh the advantages of discoverability against the invasion of privacy.
Institutions
GRID (superseded by ROR)
Proprietary lists
Currently, only JSTOR and Project MUSE provide us with institution-level usage data, but neither institution uses the ROR open standard, which has now established itself, tentatively, as the global standard for institutional persistent identifiers.[8]
Internally, we use ROR-predecessor GRID and manually correlated JSTOR and Project MUSE institutional designations with GRID numbers. As neither the JSTOR nor the MUSE tables are publicly available, our institution-level data can currently not be trusted as our institution table has not been updated since 2020.
Countries
Only JSTOR, Project MUSE, and the Internet Archive provide country-level usage data. Of these, only the Internet Archive uses the ISO 3166-2 alpha-2 standard. Considering the limited size of the set of countries and their relative invariability over time, we have correlated JSTOR and MUSE country designations with the ISO standard in a table in phpMyAdmin to ensure the data are interoperable.
Dates
None of the usage data sources agree on a date format, again despite a well-established ISO standard. Dates are manually added to the data received every month as an integral part of the data prepping before ingest into phpMyAdmin.
Subjects
None of the platforms we use for the dissemination of our publications currently provides us with usage data based on subject. Though some platforms require a BIC subject code (such OAPEN and JSTOR), none have currently switched to the open standard that has superseded it, Thema.
We add BIC, BISAC, and Thema subject codes, as well as keywords, to our book metadata. In principle, it would be possible to add subject data to our book table in phpMyAdmin, which would produce usage data categorizable by subject code.
Conclusion
As a result of these multiple and varied obstacles to providing detailed usage data sketched above, punctum is currently able to only provide download monthly data per title per platform, with the country data provided separately per platform that provides them. CrossRef Events are provided separately.
This is comparable to what the COKI Book Analytics Dashboard (BAD) currently provides: access by country (based on availability), monthly access from a variety of platforms, and institutional access (based on availability). The Dashboard also provides breakdowns by author and subject, but these will be impossible to scale up across multiple publishers and publisher platforms without achieving universal adoption of something like ORCID and Thema subject codes.
This also suggests that one important stakeholder remains unrepresented, or at least underrepresented, at both the OA Book Usage Data Trust and BAD, namely the contributors: the authors, editors, translators, and so on who actually provide the content for ebooks. First of all, as universal ORCID coverage can currently only be driven by voluntary author adoption, they are essential stakeholders in the development of any open book usage data platform with broad interoperability and the data granularity currently featured in the BAD Dashboard. This adoption of course can be stimulated by offering the contributors something in return: access to the usage data of their books. But from the currently available documentation of both the Data Trust and BAD projects, this appears to be an obstacle for at least some of the publishers involved, reluctant as they may be to explain their business model to their authors.
Once the authors of open books would become involved, questions concerning the accessibility of such a platform would also come into more distinct relief: why would the book be open to them but not the usage data? This is a fundamental question that remains unaddressed by both projects and is ethically suspect. Without authors, there is no content and therefore they must be considered primary stakeholders in any open usage data project we construct and any attempt to withhold, mask, or massage those data on their behalf smacks of paternalism.
The ideal of fully open, interoperable open access book usage data cannot be achieved by a collective adoption of open and community-governed standards for the multiple persistent identifiers of a book: DOI, ORCID, ROR, and Thema, as well as ISO standards. Only these can be integrated into the code to facilitate automated ingest, error checks, export, and dissemination, and only making this code open source will allow for continued community-driven improvement, the potential avoidance corporate capture, and continued availability to presses that are not part of a group of “trusted partners.”