The RAMP is a versatile management and software tool for connecting data science to domain sciences, which is the main mission of the Paris-Saclay Center for Data Science. It grew organically out of our experience with data challenges, and evolved through the dozen iterations that we carried out in our research and training activities. The RAMP is developed as an in-house tool at the CDS, in collaboration with the Center for Scientific Management (CGS) at Ecole des Mines. It was originally designed as a collaborative prototyping tool that makes efficient use of the time of data scientist in solving the data analytics segment of high-impact domain science problems. We then realized that it is equally valuable for training novice data scientists, for networking, for communication, and as a social science observatory. It has been rapidly becoming a standard educational tool, used in three UPSaclay data science masters, but also in other programs in Paris and Lille. It has been used six times at Saclay, and in four hackatons outside Saclay (Paris School of Economics; French National Museum of Natural History; NCAR, Colorado; Epidemium, Paris).
The RAMP is used in the following operational context. Similarly to a data challenge, the data provider arrives with a prediction problem and a corresponding data set. An experienced data scientist then cleans and curates the data and formalizes the problem. This process can take two weeks to six months, and results in a starting kit, typically an ipython notebook that introduces the domain science problem, describes the data, and shows a first untuned solution (benchmark). The problem is then set up using the RAMP software, and a RAMP event is organized with 30-50 data scientists and domain scientists. The RAMP event usually takes a single day to attract data scientists who do not wish to engage for a longer period of time learning the domain problem. We have been experimenting with other formats: data challenges usually take several months, and course projects can take several weeks. When the data science problem requires the mastering of a specific tool, the RAMP event can be preceded by a Training Sprint for explaining specific tools to the participants. Part of the Training Sprint can also be devoted to introducing the domain science problem, otherwise this introduction takes place at the beginning of the RAMP.
During the RAMP, the participants submit predictive solutions (code) (as opposed to data challenges, where only predictions are submitted). The models are trained on our back-end. The scores are displayed on a leaderboard. All participants have access to all code, and they are encouraged to look at and to reuse each other's solutions. This accelerates the development process (compared to challenges) since good ideas spread fast. The original single day setup was tested on the HiggsML challenge (particle physics), on mortality prediction (health care), variable star classification (astrophysics), El Nino prediction (climate science), insect recognition (ecology), and replacing agent-based simulations (macroeconomy). Each of these events lead to a significant improvement over the baseline. Since the organizers have access to all the code, the result of the day is a fully functioning near-optimal prototype.
Each RAMP attracts about 30-50 participants, coming from different backgrounds and carrier stages, who usually meet for the first time. They develop a working relationship in a relaxed environment, and sometimes keep working together after the event.
The RAMPs generate a significant amount of quantitative data on the way data scientists work and collaborate with each other, which allows social scientists to study the dynamics of collaborative work. The results of these analyses can then be used to optimize certain aspects of the RAMP format.