Reproducible research | Daniel Antal

Building Public-Private Data Partnerships

Tue, 20 Feb 2024 15:48:00 +0000

While open data infrastructures are growing fast, we must bare in mind that private infrastructure is growing at an even faster scale. Last weekend, Europeana had 12,416 sound recordings that you could use without restrictions and altogether, 206,280 music sound recordings were made available over ten years. The same amount is made available with excellent intelligent recommenders and free listening options on Spotify in every 2 days. If we can find ways to connect private and public data infrastructures, the benefits are enormous for all cycles of data curation: in some cases, we can save much cost, and in other cases we can enrich the data and build fantastic new public applications, like we plan to in music libraries.

To put the scale of PPP advantages in a different scope, if we placed the music we handle in Slovakia alone on Europeana in our project, we would double Europeana’s music collection, even though the Slovak Republic accounts for only 1% of the European Union’s population and music creation.

Comparing Data to Oil is a Cliché: Crude Oil Has to Go Through a Number of Steps and Pipes Before it Becomes Useful

Mon, 07 Jun 2021 10:00:00 +0000

As a developer at rOpenGov, and as an economic sociologist, what type of data do you usually use in your work?

Generally speaking, people’s access to (or inequalities in accessing) different types of resources and their ability in transforming these resources to other types of resources is what interests me. The data I usually work with is the kind of data that is actually nicely covered by existing rOpenGov tools: data about population demographics and administrative units from Statistics Finland, statistical information on welfare and health from Sotkanet and also data from Eurostat. Aside from these a lot of information is of course data from surveys and texts scraped from the internet.

We are placing the growing number of rOpenGov tools in a modern application with a user-friendly service and a modern data API.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

Late spring and early summer time is, at least for me, defined by the Eurovision Song Contest. Every year watching the contest makes me ponder the state of the music industry in my home country Finland as well as in Europe. Was the song produced by homegrown talent or was it imported? Was it better received by the professional jury or the public? How well does the domestic appeal of an artist translate to the international stage? Many interesting phenomena are difficult to quantify in a meaningful way and writing a catchy song with international appeal is probably more an art than a science. Nevertheless that should not deter us from trying as music, too, is bound by certain rules and regularities that can be researched.

Music, too, is bound by certain rules and regularities that can be researched. Our Digital Music Observatory and its Listen Local experimental App does this exactly, and we would love to create Eurovision musicology datasets. Photo: Eurovision Song Contest 2021 press photo by Jordy Brada

Why did you decide to join the EU Datathon challenge team and why do you think that this would be a game changer for researchers and policymakers?

The challenge has, in my opinion, great potential in leading by example when it comes to open data access and reproducible research. Comparing data to oil is a common phrase but fitting in the sense that crude oil has to go through a number of steps and pipes before it becomes useful. Most users and especially policymakers appreciate ease-of-use of the finished product, but the quality of the product and the process must also be guaranteed somehow. Openness and peer-review practices are the best guarantors in the field of data, just as industrial standards and regulations are in the oil industry.

We provide many layers of fully transparent quality control about the data we are placing in our data APIs and provide for our end-users.

Join us

Join our open collaboration Economy Data Observatory team as a data curator, developer or business developer. More interested in environmental impact analysis? Try our Green Deal Data Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Creating Algorithmic Tools to Interpret and Communicate Open Data Efficiently

Fri, 04 Jun 2021 10:00:00 +0000

As a developer at rOpenGov, what type of data do you usually use in your work?

As an academic data scientist whose research focuses on the development of general-purpose algorithmic methods, I work with a range of applications from life sciences to humanities. Population studies play a big role in our research, and often the information that we can draw from public sources - geospatial, demographic, environmental - provides invaluable support. We typically use open data in combination with sensitive research data but some of the research questions can be readily addressed based on open data from statistical authorities such as Statistics Finland or Eurostat.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

One line of our research analyses the historical trends and spread of knowledge production, in particular book printing based on large-scale metadata collections. It would be interesting to extend this research to music, to understand the contemporary trends as well as the broader historical developments. Gaining access to a large systematic collection of music and composition data from different countries across long periods of time would make this possible.

Why did you decide to join the challenge and why do you think that this would be a game changer for researchers and policymakers?

Joining the challenge was a natural development based on our overall activities in this area; the rOpenGov project has been around for a decade now, since the early days of the broader open data movement. This has also created an active international developer network and we felt well equipped for picking up the challenge. The game changer for researchers is that the project highlights the importance of data quality, even when dealing with official statistics, and provides new methods to solve these issues efficiently through the open collaboration model. For policymakers, this provides access to new high-quality curated data and case studies that can support evidence-based decision-making.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

Regarding open government data, one of my favorites is not a single data source but a data representation standard. The px format is widely used by statistical authorities in various countries, and this has allowed us to create R tools that allow the retrieval and analysis of official statistics from many countries across Europe, spanning dozens of statistical institutions. Standardization of open data formats allows us to build robust algorithmic tools for downstream data analysis and visualization. Open government data is still too often shared in obscure, non-standard or closed-source file formats and this is creating significant bottlenecks for the development of scalable and interoperable AI and machine learning methods that can harness the full potential of open data.

Regarding open government data, one of my favorites is not a single data source but a data representation standard, the Px format.

From your perspective, what do you see being the greatest problem with open data in 2021?

Although there are a variety of open data sources available (and the numbers continue to increase), the availability of open algorithmic tools to interpret and communicate open data efficiently is lagging behind. One of the greatest challenges for open data in 2021 is to demonstrate how we can maximize the potential of open data by designing smart tools for open data analytics.

What can our automated data observatories do to make open data more credible in the European economic policy community and be accepted as verified information?

The role of the professional network backing up the project, and the possibility of getting critical feedback and later adoption by the academic communities will support the efforts. Transparency of the data harmonization operations is the key to credibility, and will be further supported by concrete benchmarks that highlight the critical differences in drawing conclusions based on original sources versus the harmonized high-quality data sets.

We need to get critical feedback and later adoption by the academic communities.

How we can ensure the long-term sustainability of the efforts?

The extent of open data space is such that no single individual or institution can address all the emerging needs in this area. The open developer networks play a huge role in the development of algorithmic methods, and strong communities have developed around specific open data analytical environments such as R, Python, and Julia. These communities support networked collaboration and provide services such as software peer review. The long-term sustainability will depend on the support that such developer communities can receive, both from individual contributors as well as from institutions and governments.

Join us

Reproducible research in practice: empirical study on the structural conditions of book piracy in global and European academia

Sat, 05 Dec 2020 08:10:00 +0200

PLOS One is the fourth most influential multidisciplinary journal after Nature, and Science, and Proceedings of the National Academy of Sciences of the United States of America (based on H index.) On December 3, 2020 it published a paper co-authored by Dr. Balazs Bodo, associate professor at the Institute for Information Law (IViR), Daniel Antal (Reprex, Demo Music Observatory), a data scientist interested in reproducible research, as an independent researcher, and Zoltan Puha, a Data Science PhD at Tilburg University, JADS. PLOS (Public Library of Science) is a nonprofit Open Access publisher, empowering researchers to accelerate progress in science and medicine by leading a transformation in research communication.

The article utilizes the our reproducible datasets created with our regions package, and builds on many years of expertise in empirical research on the field of music and audiovisual piracy, home copying and private copying compensation (see for example Private Copying in Croatia.) Our aim is to provide reliable, high quality indicators for the creative industries not only on national, but provincial, state, regional and metropolitan area level, too, because these levels are often more relevant for creators, performers and policy-makers.

The topic of the paper is Library Genesis (LG), the biggest piratical scholarly library on the internet, which provides copyright infringing access to more than 2.5 million scientific monographs, edited volumes, and textbooks. The paper uses advanced statistical methods to explain why researchers around the globe use copyright infringing knowledge resources. The analysis is based on a huge usage dataset from LG, as well as data from the World Bank, Eurostat, and Eurobarometer, to identify the role of macroeconomic factors, such as R&D and higher education spending, GDP, researcher density in scholarly copyright infringing activities.

We created a global and a far more detailed European model for pirate book downloads.

The main finding of the paper is that open access, even if it is radical, is not a panacea. The hypothesis of the research was that researchers in low-income regions use piratical open knowledge resources relatively more to compensate for the limitations of their legal access infrastructures. The authors found evidence to the contrary. Researchers in high income countries and European regions with access to high quality knowledge infrastructures, and high levels of funding use radical open access resources more intensively than researchers in lower income countries and regions, with less resourceful libraries. This means that while open knowledge is an important resource to close the knowledge gap between centrum and periphery, equality in access does not translate into equality in use. Structural knowledge inequalities are both present and are being reproduced in the context of open access resources.

The paper is unique not just because of the data it is based on. It also sets new standards in interdisciplinary legal research by publishing the paper, the data and the software code in the same time in open access repositories, following reproducible research best practices — the practices that we want to promote in our Demo Music Observatory (later renamed: Digital Music Observatory) and further data observatories to serve business, evidence-based policy and scientific research.

Feasibility Study For The Establishment Of A European Music Observatory & The Demo Observatory

Mon, 16 Nov 2020 07:03:00 +0200

The Feasibility study for the establishment of a European Music Observatory was published on 13 November. Our private observatory, CEEMID was consulted in the creation of the Feasibility Study, and some of our recommendations found way into the consultant’s document. We created a Demo Music Observatory to provide a practical guidance on the decisions facing the European stakeholders, and to answer the questions that were left open in the Feasibility Study — particularly on data integration and the institutional model, where a wrong choice can lead to very long delivery time, quality control and budgeting.

We have been developing our Demo Music Observatory in the world’s 2nd ranked university-backed incubator program, the Yes!Delft AI Validation Lab since 15 September 2020. Our aim is to show a better organizational model, examples of research automation and other data integration innovation that can reduce the budgetary needs of the European Music Observatory by 80-90% and provide far more timely, accurate, and relevant service than most data observatories in Europe.

CEEMID has been creating a similar data observatory to the foreseen European Data Observatory, solely based on the contribution of about 60 European stakeholders. As the Feasibility Study suggests, we would be happy to transfer much of CEEMID’s content to the European Data Observatory, which could potentially fill up about 50-70% of the envisioned observatory. We are building our Demo Music Observatory based on the 2000 pan-European indicators collected by CEEMID since 2014.

Challenge Our Demo Observatory: Check out the Music Diversity & Circulation Pillar of our Demo Music Observatory. If you do not find what you are looking for, contact us — we will try to put the data there from our repositories.

Illusory data gap: active and music participation is available on EU level both for gender groups or four ethnic minorities – this is regularly featured in various European CAP surveys and in our national CAP surveys, too.

The Feasibility Study is based on perceived data gaps between data needs of the European stakeholders and data availability. We have shown earlier this year to the European stakeholders that much of these data gaps are illusory. We would like to give about 50 indicators with full documentation, automated, weekly, monthly, quarterly, or annual refreshment for free for all music industry users. We would like to challenge the stakeholders to formulate data requests to us and think together on the ways how could the European music industry build a better observatory faster and with less cost.

Challenge Our Demo Observatory: Check out the Music Economy Pillar of our Demo Music Observatory. If you do not find what you are looking for, contact us — we will try to put the data there from our repositories.

The Feasibility Study concludes that a “European Music Observatory would require a very significant allocation of funds, beyond what could be currently expected from the possible budget of the future Creative Europe programme”. While the Feasibility Study provide cost options, or any cost-benefit analysis, we are certain that this is an exaggeration. Most European data observatories operate with an annual 20,000-200,000-euro subsidy. We want to show with our Demo Music Observatory what can be achieved with an annual budget of 20,000 euros, 50,000 euros, 100,000 euros or 200,000 euros.

Challenge Our Demo Observatory: Check out the Music, Society and Citizenship Pillar of our Demo Music Observatory. If you do not find what you are looking for, contact us — we will try to put the data there from our repositories.

Product/Market Fit Validation in Yes!Delft

Fri, 25 Sep 2020 15:31:39 +0000

We would like to validate our product market/fit in two segments, business/policy research and scientific research, with a supporting role given to data journalism. Because we want to follow a bootstrapping strategy, we must focus on those clients where we find the highest value proposition, which is of course easier said than done. We see much interest in our offering from other continents, therefore we truly welcome the opportunity that we can do this on a truly global business canvas in one of the worlds’ top five incubators, the number 2 university-backed incubator in the world, second to none in Europe, in the Yes!Delft AI+Blockchain Validation Lab.

In Europe hundreds of thousands of microenterprises, such as record labels, video producers or book publishers are facing data and AI giants like Google’s YouTube, Apple Music, Spotify, Netflix or Amazon. If the recommendation engines of these giants do not recommend their songs, films or books, then their investments are doomed to fail, because about half of the global sales are driven by AI algorithms. When they make a claim for the missing money, they will immediately find themselves in a dispute with gigabytes of data that they can only handle with a data scientist, even though they do not even have an IT professional or an HR professional to make the hire.

An awful lot of money, creativity and real values are at stake, and we want to be on the creator’s side, their technician’s side, their manager’s side when they want to get a fair share from the pie and they want to help these industry leader to make the pie grow.

The UNESCO and the EU have been promoting as an organizational solution the fragmentation problem with the so-called data observatories that are pooling the business, policy, and scientific research needs of various domains, like music. This is an idea that we really like, and we believe that our research automation solutions can help these observatories to grow faster as ecosystems, create better quality and more timely data and research products and a far lower cost.

We define ourselves as a reproducible research company inspired by the philosophy of open collaboration, based on open-source software and open data. We want to explore various revenue models around these ideas.

We are not committed to open source licensing if more permissive licensing policies provide us with better opportunities.
We would like to explore various data-as-service models, because we do not want to be locked into the position of cheap open data vendors.
We want to deploy AI applications that really help earning money in these sectors with playlisting, recommendation engines, forecasting applications, or royalty valuations, because our open collaboration approach brings up enough data sooner to than its alternatives, because it manages inherent conflicts of interests, fragmentation, and decentralization better than hierarchical solutions.

Timeline

In January CEEMID reached its peak: we introduced a 12-country reproducible research project made with only freelancers in Brussels, presented as best use case of evidence-based policy design.
In February Daniel visited the Yes!Delft Co-Lab to find out who would be the best co-founder to re-launch CEEMID as an enterprise.
In April we started to release our data as open data for validation.
One month ago we started-up.
Then we launched the music.dataobservatory.eu project.
A few other data observatories.

Bonus:

Palato in the Hague, where we took our selfie and had an absolutely amazing dinner after the pitch. Check them out!