open-science | Daniel Antal

Create Datasets that are Easy to Combine and Reuse

Tue, 22 Nov 2022 09:09:00 +0100

The latest Reprex R package, dataset was released today on the Comprehensive R Archive Network. It is a very early, conceptual package that will help make scientific achievements more open, governmental data easier to find, and store information that can be better combined.

Data interoperability is almost a buzzword, yet we see very few comprehensive, good solutions to apply it. Try to find information on open government portals or on big open science repositories—apart from a few good examples, most datasets are as disorganized as any PC’s hard disk that is collecting dust in a shed.

The dataset package aims to bring together the best practices of data semantics, data organization, and the use of standard metadata to make sure that whatever you store in a data table, it will be immediately available for data analysis, activation, or combination in any new database.

Ambitious? It is, and dataset 0.1.9 is a very experimental product. While our other packages are aimed at intermediate users with a clear use case in mind, dataset at this point is aimed at package developers. Casual or even heavy R users are unlikely to download it as a standalone product. Instead, dataset aims to be a stable developer basis for our existing products, rOpenGov packages, and many new uses.

Download dataset

The metadata aim of dataset it to add standardized metadata to r data.frames, tibbles, data.tables and other similar structured, tabular objects. The organization and semantic objectives are to bring the tidy data concept closer to the datacube model, which is the basis of all statistical data exchanges, and W3C standards, which foster machine-to-machine data communications on the traditional web APIs and the semantic web.

Makes data importing easier and less error-prone;
Leaves plenty of room for documentation automation, resulting in far better reusability and reproducibility;
The publication of results from R following the FAIR principles is far easier, making the work of the R user more findable, more accessible, more interoperable and more reusable by other users;
Makes the placement into relational databases, semantic web applications, archives, repositories possible without time-consuming and costly data wrangling (See From dataset To RDF).

The first official release offers little immediate benefits. However, if you are an R package developer, we can bring you a few steps nearer to releasing your data products in a way that conforms the FAIR metadata principles. We can make a few steps to streamline your data wrangling. Make integration with relational databases easier. To make a step towards the semantic web.

Research & Analysis: Music Creators’ Earnings in the Digital Era

Thu, 23 Sep 2021 08:00:00 +0000

Reprex with its Digital Music Observatory team was commissioned to prepare an analysis on the justified and not justified differences in music creators’ earnings. We have posted our most important findings in an earlier blogpost (Music Creators’ Earnings in the Streaming Era. United Kingdom Research Cooperation With the Digital Music Observatory.

The UK Intellectual Property Office has published the entire report on the music creators’ earnings, and we have made our detailed analysis available in a side-publication. Reprex also signed an agreement with the researchers of the Music Creators’ Earnings project to deposit all data published in the report in the Digital Music Observatory, and to promote the building of the observatory further.

The research questions asked in this report are related to the Music Creator Earnings’ Project (MCE), exploring issues concerning equitable remuneration and earnings distributions. We were tasked with providing a longitudinal analysis of earnings development and relating our findings to equitable remuneration. The starting point of our work was centred around a very broadly defined problem: how much money music creators (rightsholders) earn from streaming, how these earnings are distributed, and how the earnings and their distribution have developed during the last decade.

The highly globalized music industry generates two important international reports, as well as several national reports, but these are not suitable for the analysis of the typical or average rightsholder, nor for small labels and publishers who do not represent a large and internationally diversified portfolio of music works or recordings. Copyright and neighboring right revenues are collected in national jurisdictions. Because British artists are almost never constrained by their use of language, and the UK Music Industry is highly competitive in the global music markets, even relatively less known rightsholders earn revenues from dozens of national markets. The lack of market information on music sales volumes, prices for each jurisdiction, and the unaccounted for national, domestic, and foreign revenues makes the analysis of the rightholder’s earnings, or the economics of a certain distribution channel like music streaming or media platforms, impossible.

The Effect of International Diversification on Revenues - a combination of international price differences and exchange rate fluctuations.

While total earnings are reported by international and national organizations, they hide five important economic variables: changes in sales volumes, changes in prices, market share on various national jurisdictions (which have their own volume and price movements), the exchange rates applied, and the share of the repertoire exploited. Even worse, the global music industry has no comprehensive database of rightsholders, music works, and recordings – this is the data gap that we would like fill with the Digital Music Observatory.

Our report highlights some important lessons. First, we show that in the era of global music sales platforms it is impossible to understand the economics of music streaming without international data harmonization and advanced surveying and sampling. Paradoxically, without careful adjustments for accruals, market shares in jurisdictions, and disaggregation of price and volume changes, the British industry cannot analyze its own economics because of its high level of integration to the global music economy. Furthermore, the replacement of former public performances, mechanical licensing, and private copying remunerations (which has been available for British rightsholders in their European markets for decades) with less valuable streaming licenses has left many rightsholders poorer. Making adjustments on the distribution system without modifying the definition of equitable remuneration rights or the pro-rata distribution scheme of streaming platforms opens up many conflicts while solving not enough fundamental problems. Therefore, we suggest participation in international data harmonization and policy coordination to help regain the historical value of music.

Context

The idea of our Digital Music Observatory was brought to the UK policy debate on music streaming by the Written evidence submitted by The state51 Music Group to the Economics of music streaming review of the UK Parliaments’ DCMS Committee¹.

The music industry requires a permanent market monitoring facility to win fights in competition tribunals, because it is increasingly disputing revenues with the world’s biggest data owners. This was precisely the role of the former CEEMID² program, which was initiated by a group of collective management societies. Starting with three relatively data-poor countries, where data pooling allowed rightsholders to increase revenues, the CEEMID data collection program was extended in 2019 to 12 countries.The final regional report, after the release of the detailed Hungarian, Slovak and Croatian reports of CEEMID was sponsored by Consolidated Independent (of the state51 music group.)

CEEMID was eventually to formed into the Demo Music Observatory in 2020³, following the planned structure of the European Music Observatory, and validated in the world’s 2nd ranked university-backed incubator, the Yes!Delft AI+Blockchain Validation Lab. In 2021, under the final name Digital Music Observatory, it became open for any rightsholder or stakeholder organization or music research institute, and it is being launched with the help of the JUMP European Music Market Accelerator Programme which is co-funded by the Creative Europe Programme of the European Union.

In December 2020, we started investigating how the music observatory concept could be introduced in the UK, and how our data and analytical skills could be used in the Music Creators’ Earnings in the Streaming Era (in short: MCE) project, which is taking place paralell to the heated political debates around the DCMS inquiry. After the state51 music group gave permission for the UK Intellectual Property Office to reuse the data that was originally published as the experimental CEEMID-CI Streaming Volume and Revenue Indexes, we came to a cooperation agreement between the MCE Project and the Digital Music Observatory. We provided a detailed historical analysis and computer simulation for the MCE Project, and we will host all the data of the Music Creators’ Earnings Report in our observatory, hopefully no later than early July 2021.

The Digital Music Observatory contributes to the Music Creators’ Earnings in the Streaming Era project with understanding the level of justified and unjustified differences in rightsholder earnings, and putting them into a broader music economy context.

We started our cooperation with the two principal investigators of the project, Prof David Hesmondhalgh and Dr Hyojugn Sun back in April and will start releasing the findings and the data in July 2021.

Join us

Do you need high-quality data for your music business or institution? Are you a music researcher? Join our open collaboration Digital Music Observatory team as a data curator, developer or business developer.

Footnote References

state51 Music Group. 2020. “Written Evidence Submitted by The state51 Music Group. Economics of Music Streaming Review. Response to Call for Evidence.” UK Parliament website. https://committees.parliament.uk/writtenevidence/15422/html/. ↩︎
Artisjus, HDS, SOZA, and Candole Partners. 2014. “Measuring and Reporting Regional Economic Value Added, National Income and Employment by the Music Industry in a Creative Industries Perspective. Memorandum of Understanding to Create a Regional Music Database to Support Professional National Reporting, Economic Valuation and a Regional Music Study.” ↩︎
Antal, Daniel. 2021. “Launching Our Demo Music Observatory.” Data & Lyrics. Reprex. https://dataandlyrics.com/post/2020-09-15-music-observatory-launch/. ↩︎

Comparing Data to Oil is a Cliché: Crude Oil Has to Go Through a Number of Steps and Pipes Before it Becomes Useful

Mon, 07 Jun 2021 10:00:00 +0000

As a developer at rOpenGov, and as an economic sociologist, what type of data do you usually use in your work?

Generally speaking, people’s access to (or inequalities in accessing) different types of resources and their ability in transforming these resources to other types of resources is what interests me. The data I usually work with is the kind of data that is actually nicely covered by existing rOpenGov tools: data about population demographics and administrative units from Statistics Finland, statistical information on welfare and health from Sotkanet and also data from Eurostat. Aside from these a lot of information is of course data from surveys and texts scraped from the internet.

We are placing the growing number of rOpenGov tools in a modern application with a user-friendly service and a modern data API.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

Late spring and early summer time is, at least for me, defined by the Eurovision Song Contest. Every year watching the contest makes me ponder the state of the music industry in my home country Finland as well as in Europe. Was the song produced by homegrown talent or was it imported? Was it better received by the professional jury or the public? How well does the domestic appeal of an artist translate to the international stage? Many interesting phenomena are difficult to quantify in a meaningful way and writing a catchy song with international appeal is probably more an art than a science. Nevertheless that should not deter us from trying as music, too, is bound by certain rules and regularities that can be researched.

Music, too, is bound by certain rules and regularities that can be researched. Our Digital Music Observatory and its Listen Local experimental App does this exactly, and we would love to create Eurovision musicology datasets. Photo: Eurovision Song Contest 2021 press photo by Jordy Brada

Why did you decide to join the EU Datathon challenge team and why do you think that this would be a game changer for researchers and policymakers?

The challenge has, in my opinion, great potential in leading by example when it comes to open data access and reproducible research. Comparing data to oil is a common phrase but fitting in the sense that crude oil has to go through a number of steps and pipes before it becomes useful. Most users and especially policymakers appreciate ease-of-use of the finished product, but the quality of the product and the process must also be guaranteed somehow. Openness and peer-review practices are the best guarantors in the field of data, just as industrial standards and regulations are in the oil industry.

We provide many layers of fully transparent quality control about the data we are placing in our data APIs and provide for our end-users.

Join us

Join our open collaboration Economy Data Observatory team as a data curator, developer or business developer. More interested in environmental impact analysis? Try our Green Deal Data Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Creating Algorithmic Tools to Interpret and Communicate Open Data Efficiently

Fri, 04 Jun 2021 10:00:00 +0000

As a developer at rOpenGov, what type of data do you usually use in your work?

As an academic data scientist whose research focuses on the development of general-purpose algorithmic methods, I work with a range of applications from life sciences to humanities. Population studies play a big role in our research, and often the information that we can draw from public sources - geospatial, demographic, environmental - provides invaluable support. We typically use open data in combination with sensitive research data but some of the research questions can be readily addressed based on open data from statistical authorities such as Statistics Finland or Eurostat.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

One line of our research analyses the historical trends and spread of knowledge production, in particular book printing based on large-scale metadata collections. It would be interesting to extend this research to music, to understand the contemporary trends as well as the broader historical developments. Gaining access to a large systematic collection of music and composition data from different countries across long periods of time would make this possible.

Why did you decide to join the challenge and why do you think that this would be a game changer for researchers and policymakers?

Joining the challenge was a natural development based on our overall activities in this area; the rOpenGov project has been around for a decade now, since the early days of the broader open data movement. This has also created an active international developer network and we felt well equipped for picking up the challenge. The game changer for researchers is that the project highlights the importance of data quality, even when dealing with official statistics, and provides new methods to solve these issues efficiently through the open collaboration model. For policymakers, this provides access to new high-quality curated data and case studies that can support evidence-based decision-making.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

Regarding open government data, one of my favorites is not a single data source but a data representation standard. The px format is widely used by statistical authorities in various countries, and this has allowed us to create R tools that allow the retrieval and analysis of official statistics from many countries across Europe, spanning dozens of statistical institutions. Standardization of open data formats allows us to build robust algorithmic tools for downstream data analysis and visualization. Open government data is still too often shared in obscure, non-standard or closed-source file formats and this is creating significant bottlenecks for the development of scalable and interoperable AI and machine learning methods that can harness the full potential of open data.

Regarding open government data, one of my favorites is not a single data source but a data representation standard, the Px format.

From your perspective, what do you see being the greatest problem with open data in 2021?

Although there are a variety of open data sources available (and the numbers continue to increase), the availability of open algorithmic tools to interpret and communicate open data efficiently is lagging behind. One of the greatest challenges for open data in 2021 is to demonstrate how we can maximize the potential of open data by designing smart tools for open data analytics.

What can our automated data observatories do to make open data more credible in the European economic policy community and be accepted as verified information?

The role of the professional network backing up the project, and the possibility of getting critical feedback and later adoption by the academic communities will support the efforts. Transparency of the data harmonization operations is the key to credibility, and will be further supported by concrete benchmarks that highlight the critical differences in drawing conclusions based on original sources versus the harmonized high-quality data sets.

We need to get critical feedback and later adoption by the academic communities.

How we can ensure the long-term sustainability of the efforts?

The extent of open data space is such that no single individual or institution can address all the emerging needs in this area. The open developer networks play a huge role in the development of algorithmic methods, and strong communities have developed around specific open data analytical environments such as R, Python, and Julia. These communities support networked collaboration and provide services such as software peer review. The long-term sustainability will depend on the support that such developer communities can receive, both from individual contributors as well as from institutions and governments.

Join us

New Indicators for Computational Antitrust

Wed, 02 Jun 2021 17:00:00 +0000

As someone who’s worked in data for almost 20 years, what type of data do you usually use in your research?

In my field (industrial organisation, competition policy), company level financial data, and product price and sales data have been the conventional building blocks of research papers. Ideally this has been the sort of data that I would seek out for my work. Of course as academic researchers we often get knocked back by the reality of data access and availability. I would think that industrial organisation is one of those fields where researchers have to be quite innovative in terms of answering interesting and relevant policy questions, whilst having to operate in an environment where most relevant data is proprietary and very expensive. Against this backdrop, I have worked with neatly organised proprietary datasets, self-assembled data collections, and also textual data.

From your experience working with various data sets, models, and frameworks, what would be the ultimate dataset, or datasets that you would like to see from the Economy Data Observatory?

There seems to be an emerging consensus that market concentration and markups have been continuously increasing across the economy. But most of these works use industry classification to define markets. One of the things I’d really like to see coming out of the Economy Data Observatory is a mapping of what we call antitrust markets.

Mapping NACE to Antitrust Markets.

Available datasets use standard industry classification (such as NACE in the EU), which is often very different from what we call a product market in microeconomics. Product markets are defined by demand, and supply-side substitutability, which is a dynamically evolving feature and difficult to capture systematically on a wider scale. But with the recent proliferation of data and the growth (and fall in price) of computing power, I am positive that we could attempt to map out the European economy along these product market boundaries. Of course this is not without any challenge. For example in digital markets, traditional ways to define markets have caused serious challenges to competition authorities around the world.

I believe that there is an immensely rich, and largely unexplored source of information in unstructured textual data that would be hugely useful for applied microeconomic works, including my own area of IO and competition policy. This includes a large corpus of administrative and court decisions that relate to businesses, such as merger control decisions of the European Commission. To give two examples from my experience, we’ve used a large corpus of news reports related to various firms to gauge the reputational impact of European Commission cartel investigations, or we’ve trained an algorithm to be able to classify US legislative bills and predict whether they have been lobbied or not. Finding a way to collect and convert this unstructured data into a format that is relevant and useful for users is not a trivial challenge, but is one of the most exciting parts of our Economy Data Observatory plans (see related project plan).

Finding a way to collect and convert this unstructured data into a format that is relevant and useful for users is not a trivial challenge, but is one of the most exciting parts of our Economy Data Observatory plans.

What is an idea that you consider will be a game changer for researchers and/or policymakers?

Partly talking in the past tense, the use of data driven approaches, automation in research, and machine learning have been increasingly influential and I think this trend will continue to all areas of social science. 10 years ago, to do machine learning, you had to build your models from scratch, typically requiring a solid understanding of programming and linear algebra. Today, there are readily available deep learning frameworks like TensorFlow, Keras, PyTorch, to design a neural network for your own application. 10 years ago, natural language processing would have only been relevant for a small group of computational linguists. Today we have massive word embedding models trained on an enormous corpus of texts, at the fingertip of any researcher. 10 years ago, the cost of computing power would have made it prohibitive for most researchers to run even relatively shallow neural networks. Today, I can run complex deep learning models on my laptop using cloud computing servers. As a result of these developments, whereas 10 years ago one would have needed a small (or large) research team to explore certain research questions, much of this can now be automated and be done by a single researcher. For researchers without access to large research grants and without the ability to hire a research team, this has truly been an amazing victory for the democratisation of research.

You can already try out our API.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

As a competition economist, I tend to need very specific data for each research question I’m working on, which has to be collected from scratch. On the other hand, most works do require us to use data that has already been collected and made available. For example, access to census data has been immensely useful in ensuring that we can control for local demographic features, in papers where local competition plays a role. Census data is made readily available by most governments, but I particularly liked the Australian data, partly because they run a census every 5 years, but also because they have made the data available through a great table making tool.

Is there a number that recently surprised you? What was it?

I have these moments of surprise fairly frequently. To give one example from something I’m currently working on, looking at the distributional impact of increasing market concentration, we’ve found that low income households experience a larger increase in the petrol retail margin when market concentration increases than high income households. This fits nicely with theoretical works on search in homogeneous costs, i.e. low income households are less good at engaging with the market, and, as a result, if suppliers can price discriminate, they will charge a higher margin to these households.

The figure below shows our raw data (18 years of petrol station level daily price data from Western Australia) for low and high income areas, and the increase in the margin following an increase in market concentration (vertical dotted line). The left hand side, low income areas, displays a large increase in the margin (when compared to a control group), whereas the right hand side (high income households) experience no change. In our paper of course we build a fairly data intensive quasi experiment for identification of the treatment effect of changing market concentration on the price margin applied to various demographic groups.

Surprising findings: market concentration and margin changes for petrol stations.

Do you have a good example of really good, or really bad use of data science /data curation?

Out of professional courtesy I really wouldn’t like to mention names from academic research as examples of bad use of data. But there are ample examples from newspaper coverage of data related work, or simply the misuse of data by newspapers. This may be intentional but is often a result of journalists not having the necessary training in using and analysing data.

When the press finds a piece of academic research interesting, often bad things come out of it. This is often because not all journalists are well equipped to interpret scientific findings. As a result, sometimes conclusions are drawn as a result of a misinterpretation of good data analysis. Correlation interpreted as causation is a frequently recurring example. Equally bad is press coverage changes the incentive system of producing good research, when scientists work too hard for their work to be noticed by the press, and sacrifice scientific rigour in data analysis for the sake of media attention. There can also be less discernible but equally damaging errors.

In some cases requiring to pre disclose the tests the research is going to run on data helps maintain credibility in many instances. Moreover, I am always a bit suspicious if the authors do not give access to their data for reproduction.

Our Economy Data Observatory places all new indicators on Zenodo with a DOI, and asks future individual contributors their data for replication there.

What do you see as the greatest challenge with open data in 2021?

The things I mentioned above about the democratisation of research driven by automation and access to big data does raise serious challenges as well. The obvious one is to do with the fact that there are enormous economies of scale in the use of data. As such, larger players will always be better positioned to outdo their smaller competitors, simply as a result of their superior data and infrastructure (for example having more granular consumer data allowing them to offer better designed customised experience for the consumer). Like many others, I see this as the biggest challenge for open data - to level the playing field for smaller players. This is not a trivial task at all; and even if, miraculously, small businesses could access the same data as the biggest players, they still would not have the capacity or the ability to use this data. So allowing access to data alone is unlikely to solve any of these problems. I would say that fostering engagement with open data is probably as big a challenge as creating the open data in the first place.

How do you envision the Economy Data Observatory making open data more credible in the European economic policy community and accepted as verified information?

I think starting with a focused agenda is a good idea. For example, linking up with the Centre for Competition Policy means that we have an initial focus of competition policy relevant economic data. This is still a large domain, but it is one where we have ample expertise. Starting with specific research questions such as linking competition enforcement and merger decisions to related information on innovation and ownership data puts the Economy Data Observatory at the heart of some of the most topical policy questions, such as the role of killer acquisitions (acquisitions with the intent to kill of sources of rival innovation), or common ownership, both of which are increasingly discussed in policy and practitioner circles. Once we established ourselves as a credible source of data in the competition policy community, we can look into joining this up with other policy areas, and also with our other Data Observatories (Music and Green Deal).

Join us

Reprex is Contesting all Three Challenges of the EU Datathon 2021 Prize

Fri, 21 May 2021 20:00:00 +0000

Reprex, a Dutch start-up enterprise formed to utilize open source software and open data, is looking for partners in an agile, open collaboration to win at least one of the three EU Datathon Prizes. We are looking for policy partners, academic partners and a consultancy partner. Our project is based on agile, open collaboration with three types of contributors.

With our competing prototypes we want to show that we have a research automation technology that can find open data, process it and validate it into high-quality business, policy or scientific indicators, and release it with daily refreshments in a modern API.

We are looking for institutions to challenge us with their data problems, and sponsors to increase our capacity. Over then next 5 months, we need to find a sustainable business model for a high-quality and open alternative to other public data programs.

The EU Datathon 2021 Challenge

To take part, you should propose the development of an application that links and uses open datasets. - our data curator team
Your application … is also expected to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.” - this application is developed by our technology contributors
Your application should showcase opportunities for concrete business models or social enterprises. - our service development team is working to make this happen!
We use open source software and open data. The applications are hosted on the cloud resources of Reprex, an early-stage technology startup currently building a viable, open-source, open-data business model to create reproducible research products.
We are working together with experts in the domain as curators (check out our guidelines if you want to join: Data Curators: Get Inspired!).
Our development team works on an open collaboration basis. Our indicator R packages, and our services are developed together with rOpenGov.

Mission statement

We want to win an EU Datathon prize by processing the vast, already-available governmental and scientific open data made usable for policy-makers, scientific researchers, and business researcher end-users.

“To take part, you should propose the development of an application that links and uses open datasets. Your application should showcase opportunities for concrete business models or social enterprises. It is also expected to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.”

We aim to win at least one first prize in the EU Datathon 2021. We are contesting all three challenges, which are related to the EU’s official strategic policies for the coming decade.

Challenge 1: A European Grean Deel

Our Green Deal Data Observatory connects socio-economic and environmental data to help understanding and combating climate change.

Challenge 1: A European Green Deal, with a particular focus on the The European Climate Pact, the Organic Action Plan, and the New European Bauhaus, i.e., mitigation strategies.

Climate change and environmental degradation are an existential threat to Europe and the world. To overcome these challenges, the European Union created the European Green Deal strategic plan, which aims to make the EU’s economy sustainable by turning climate and environmental challenges into opportunities and making the transition just and inclusive for all.

Our Green Deal Data Observatory is a modern reimagination of existing ‘data observatories’; currently, there are over 70 permanent international data collection and dissemination points. One of our objectives is to understand why the dozens of the EU’s observatories do not use open data and reproducible research. We want to show that open governmental data, open science, and reproducible research can lead to a higher quality and faster data ecosystem that fosters growth for policy, business, and academic data users.

We provide high quality, tidy data through a modern API which enables data flows between public and proprietary databases. We believe that introducing Open Policy Analysis standards with open data, open-source software, and research automation, can help the Green Deal policymaking process. Our collaboration is open for individuals, citizens scientists, research institutes, NGOS, and companies.

Challenge 2: A Europe fit for the digital age

Our Economy Data Observatory will focus on competition, small and medium sized enterprizes and robotization.

Challenge 2: An economy that works for people, with a particular focus on the Single market strategy, and particular attention to the strategy’s goals of 1. Modernising our standards system, 2. Consolidating Europe’s intellectual property framework, and 3. Enabling the balanced development of the collaborative economy strategic goals.

Big data and automation create new inequalities and injustices and have the potential to create a jobless growth economy. Our Economy Data Observatory is a fully automated, open source, open data observatory that produces new indicators from open data sources and experimental big data sources, with authoritative copies and a modern API.

Our observatory monitors the European economy to protect consumers and small companies from unfair competition, both from data and knowledge monopolization and robotization. We take a critical Small and Medium-Sized Enterprises (SME)-, intellectual property, and competition policy point of view of automation, robotization, and the AI revolution on the service-oriented European social market economy.

We would like to create early-warning, risk, economic effect, and impact indicators that can be used in scientific, business, and policy contexts for professionals who are working on re-setting the European economy after a devastating pandemic in the age of AI. We are particularly interested in designing indicators that can be early warnings for killer acquisitions, algorithmic and offline discrimination against consumers based on nationality or place of residence, and signs of undermining key economic and competition policy goals. Our goal is to help small and medium-sized enterprises and start-ups to grow, and to furnish data that encourages the financial sector to provide loans and equity funds for their growth.

Challenge 3: A Europe fit for the digital age

Our Digital Music Observatory is not only a demo of the European Music Observatory, but a testing ground for data governance, Digital Servcies Act, and trustworthy AI problems.

Challenge 3: A Europe fit for the digital age, with a particular focus Artificial Intelligence, the European Data Strategy, the Digital Services Act, Digital Skills and Connectivity.

The Digital Music Observatory (DMO) is a fully automated, open source, open data observatory that creates public datasets to provide a comprehensive view of the European music industry. It provides high-quality and timely indicators in all four pillars of the planned official European Music Observatory as a modern, open source and largely open data-based, automated, API-supported alternative solution for this planned observatory. The insight and methodologies we are refining in the DMO are applicable and transferable to about 60 other data observatories funded by the EU which do not currently employ governmental or scientific open data.

Music is one of the most data-driven service industries where most sales are currently executed by AI-driven autonomous systems that influence market shares and intellectual property remuneration. We provide a template that enables making these AI-driven systems accountable and trustworthy, with the goal of re-balancing the legitimate interests of creators, distributors, and consumers. Within Europe, this new balance will be an important use case of the European Data Strategy and the Digital Services Act.

The DMO is a fully functional service that can serve as a testing ground of the European Data Strategy. It can showcase the ways in which the music industry is affected by the problems that the Digital Services Act and European Trustworthy AI initiatives attempt to regulate. It is being built in open collaboration with national music stakeholders, NGOs, academic institutions, and industry groups.

Our Product/Market Fit was validated in the world’s 2nd ranked university-backed incubator program, the Yes!Delft AI Validation Lab. We are currently developing this project with the help of the JUMP European Music Market Accelerator program.

Problem Statement

The EU has an 18-year-old open data regime and it makes public taxpayer-funded data in the values of tens of billions of euros per year; the Eurostat program alone handles 20,000 international data products, including at least 5,000 pan-European environmental indicators.

As open science principles gain increased acceptance, scientific researchers are making hundreds of thousands of valuable datasets public and available for replication every year.

The EU, the OECD, and UN institutions run around 100 data collection programs, so-called ‘data observatories’ that more or less avoid touching this data, and buy proprietary data instead. Annually, each observatory spends between 50 thousand and 3 million EUR on collecting untidy and proprietary data of inconsistent quality, while never even considering open data.

Our automated data observatories are modern reimaginations of the existing observatories that do not use open data and research automation.

The problem with the current EU data strategy is that while it produces enormous quantities of valuable open data, in the absence of common basic data science and documentation principles, it seems often cheaper to create new data than to put the existing open data into shape.

This is an absolute waste of resources and efforts. With a few R packages and our deep understanding of advanced data science techniques, we can create valuable datasets from unprocessed open data. In most domains, we are able to repurpose data originally created for other purposes at a historical cost of several billions of euros, converting these unused data assets into valuable datasets that can replace tens of millions’ worth of proprietary data.

What we want to achieve with this project – and we believe such an accomplishment would merit one of the first prizes - is to add value to a significant portion of pre-existing EU open data (for example, available on data.europa.eu/data) by re-processing and integrating them into a modern, tidy database with an API access, and to find a business model that emphasises a triangular use of data in 1. business, 2. science and 3. policy-making. Our mission is to modernize the concept of data observatories.