Data, Data, Data; So Much Data

Claudia Neuhauser, Vice President, Data Analytics

High-throughput data acquisition in the sciences is now outsourced most frequently to core research facilities, where highly skilled technicians work increasingly side by side with robots to churn out very high volumes of data. These core facilities are specialized on a specific type of data, for instance, a genomics facility will generate RNA or DNA data on their sequencing machines or an imaging facility will take high resolution images of thin slices of tissue.The large volumes of data now routinely generated by most investigators, has now outpaced the ability of the single investigator labs to manage the data through its life cycle.

In addition, increasingly, researchers take advantage of multiple core research facilities to generate the data needed for their research. Core facilities go through their established workflows to generate raw data that are then delivered to the researcher using data platforms that are unique to each facility, leaving the difficult task of integrating data from multiple facilities to the investigator. The problem of curating, integrating, analyzing, and disseminating data is particularly acute for investigators in the life and health sciences where the investigator’s competencies in managing data life cycle issues have not caught up with the realities of today’s research projects.

A recent study confirms the difficulties faced by investigators. The authors of the study surveyed investigators in 2016 who received funding from the NSF Biological Sciences Directorate. They found that 87% of those surveyed use big data sets in their research, with DNA/RNA/protein sequences being the most common type. With the rapid change in analysis methods and the episodic nature of analysis needs, labs face the additional challenge of having to be familiar with whatever is the most appropriate analysis method when they are ready to analyze their data. Very likely, the most appropriate method is different from the one they had used previously.

The same article revealed that well over 75% of PIs expect to have data analysis needs consistent with large and heterogeneous data sets and that their analysis needs are largely unmet. While in the past, computer infrastructure might have been the barrier, nine out of ten respondents now think that training in data integration, data management, or scaling analyses for HPC are most critical to the success of their science. The article makes the case for supporting computational training in biology, and we fully agree with that recommendation.

However, we feel there is still a piece missing that is critical to the acceleration of the data lifecycle that leaves out an opportunity to accelerate research. The same way high-throughput data acquisition has moved out of individual investigator labs and into core facilities, certain portions of the data analysis can be standardized (commoditized), and move out of individual laboratories. At the University of Minnesota, we have experimented with a new model of data analysis and have had excellent success, in particular in genomics. We developed standardized workflows that move the raw genomics data through quality control and basic analysis. These are workflows that turn data analysis into a commodity: Data are consistently pre-processed and analyzed, without much human interference. Since all the data a facility produces runs through the same workflows, quality issues in data acquisition are noticed quickly, and the quality of the analysis is consistent and uniform. Reproducibility is no longer an issue. The data product is then handed over to the investigator in a usable form ready to be interpreted and analyzed further. Some of our workflows even produce publication-ready figures. Analysis steps that would take a lab months to complete are done in days or a couple of weeks, and investigators are assured that the best-practice versions of analysis tools are used.

The next step in the evolution of this model is to build a data acquisition platform that can be used across numerous core research facilities, providing the investigator with a single entry point to access their data. Such a platform will allow us to deliver the data to the appropriate data storage unit, thus further reducing data management challenges.

Instead of trying to centralize and coordinate core research facilities, which is very difficult to achieve in many institutions, we pushed the integration downstream after the data leave the facility. The facilities continue to act as “nodes” in this research ecosystem. No efforts on their parts are needed to coordinate. Instead, the coordination happens at the point of analysis. Adding value to raw data through commoditized analysis is accelerating and facilitating research, while, at the same time, improving quality.

HSNA principles have led the developments described above and excel in all aspects of data lifecycle issues. In particular, we have first hand experience with managing large institutional research cores and developing strategies and operational units to aid investigators in the acquisition, analysis and storage of large data sets.

(1) Barone L, Williams J, Micklos D (2017) Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators. PLoS Comput Biol13(10): e1005755. https://doi.org/10.1371/journal.pcbi.1005755