No one doubts that data is currently one of the most important resources. Companies use data to segment customers and increase the likelihood of sales. Administrations use data to improve decision making. There are public repositories of data for multiple uses, for example, for machine learning.
However, the apparent situation does not match the real one, at least not in Software Engineering. The big problems of the past, such as the motivation to perform replications, have not been solved. It is not easy to identify interesting data sets for meta-analysis. Because it does not exist, there is not even a standardized vocabulary that allows efficient indexing and searching of data.
In this talk, I will review data-related problems in empirical software engineering research. From a personal, and probably chronological, perspective, I will unpack the developments that have occurred in the last twenty years, which I have experienced first-hand. Fifty is a good time for this exercise. I will also describe what I would have liked to do, but have not yet achieved, such as creating multi-area repositories or conducting multi-center experiments.
I do not pretend to guess the future. I couldn’t. But I do intend to provide a working agenda, certainly personal and data-centric, that could also be useful for Software Engineering as a community.