By Ellen Liberman
It took $3 billion dollars, 13 years and a score of universities and research centers in six countries to sequence nearly the entire human genome. In April 2003, the international consortium piloting the effort announced that it had successfully accounted for 92% of the 3.2 billion base pairs that make up a human’s genetic blueprint. At the time, it was called one of the greatest scientific feats in history. Nonetheless, there were gaps and errors—8% of the human genome could not be sequenced with the tools available at the time. It would be another 15 years before methods and computing power had advanced sufficiently to finish the job. In April 2022, scientists from National Human Genome Research Institute, the University of California, Santa Cruz, and the University of Washington, Seattle, announced that they had sequenced a final, complete human genome.
It takes GSO graduate student Ian Bishop about three months to extract a sample of DNA from an algae culture, send it to a core facility for genomic sequencing and receive a massive dataset. A biological oceanographer, Bishop studies the ways in which the individual genome and genetic diversity within a population of phytoplankton can help adapt a species to rapid environmental change.
Given phytoplankton’s critical role in the ocean food web and in carbon sequestration, answering Bishop’s research question has huge implications for the planet. But he couldn’t even ask it without new technologies that not only provide big data rapidly, but allow him to make sense of it at a scale that was previously out of reach.
“Bioinformatics in sequencing data analysis is kind of like taking an individual, which has a book, its DNA, ripping it into pages of 10 to 30 words, and giving someone a pile of those. I have to reconstruct the book again,” he says. “Previously, you would spend a lot of time and effort developing assays to look at a very small part of the genome, which is valuable, but a lot more work up front and not computationally intensive. Now I use the signal from the entire genome to look at similar features, at higher resolution, in the population and genetic differentiation between individuals.”
“Big Data is still so new that we are almost in its infancy in figuring out how to handle it. It’s going to take a long time to really come to grips with that.”Megan Cromwell, Assistant Chief Data Officer, National Ocean Service
Embracing the Tsunami
In 2018, the International Data Corporation predicted that by 2025 there would be 175 zettabytes—175 billion terabytes—of data in the world. According to the IDC’s Data Age 2025 report, storing the entire Global Datasphere on DVDs could build a stack that would reach the moon 23 times, or circle Earth 222 times. There is no estimate for how much of it is generated by scientists, but suffice to say that it is a lot. The National Centers for Environmental Information, the National Oceanic and Atmospheric Administration (NOAA) archives, for example, currently contains about 47 petabytes of data, says Megan Cromwell, assistant chief data officer for the National Ocean Service, who, in her last job, managed all the data for NOAA’s research vessel, Okeanos Explorer. She and her colleagues recently estimated that a single research project would collect half a petabyte alone.
“Just for video data from one ship we had over 300 terabytes of data. It’s growing so exponentially,” she says. “Big Data is still so new that we are almost in its infancy in figuring out how to handle it. It’s going to take a long time to really come to grips with that.”
In many ways, this tsunami of raw observations, samples and imagery, plus experimental, simulated and derived data represents an unprecedented scientific opportunity—“open doors” is an oft-used metaphor. Big Data, voluminous and accessible, can democratize science, reduce research costs by spreading it among multiple users, and speed the pace of discovery. At the same time, it presents enormous organizational and storage challenges. Will there be an easily navigated library or a pile of ripped pages on the other side of that door?
Open data has been the official policy of U.S. science’s largest funders for a decade or more. In 2011, the National Science Foundation (NSF), updated the implementation of its data sharing policy, requiring investigators to include a two-page supplement to their proposals, describing a management plan for any data generated as a result of NSF-granted work. Two years later, the White House’s Office of Science and Technology Policy (OSTP)issued a guidance document ordering all agencies with more than $100 million in annual research and development expenditures to develop a plan to increase public access to all data—including peer-reviewed and digital—generated from federally funded research. Last year, the OSTP updated the policy, giving federal agencies four years to make all federally funded data available in a machine-readable format in an agency-designated repository “without delay,” which removed the embargos for access to the underlying data of a peer-reviewed publication.
“…having ready access to a large trove of data lets you scale anything you may want to do up to a planetary scale.”Katherine Kelley, Professor of Oceanography, GSO
Extracting Meaning
For GSO professor Katherine Kelley, who investigates how volatile elements help to fuel magma formation and volcanic eruptions underwater, using open-access data is “pretty routine.” She worked on her Ph.D. thesis in the late 1990s and early 2000s using previously generated data, with the help of PetDB, a searchable database of published geochemical data for igneous and metamorphic rocks hosted by the Lamont-Doherty Earth Observatory at Columbia University. The old way of aggregating existing data, combing the scientific literature for studies of interest, pulling the data from a certain region and typing it all in to make your own dataset “is just not practical if you need to bring together 1,000 or 10,000 peoples’ analyses related to the globe,” she says. “It definitely changes the way you ask your questions, because having ready access to a large trove of data lets you scale anything you may want to do up to a planetary scale.”
GSO professor Susanne Menden-Deuer was another early adopter, launching a seminar in 2011 with her colleague, professor Bethany Jenkins, on big data and marine science, with a focus on how to best visualize and analyze existing data.
“I realized as we increased observational capability, one of the big bottlenecks was going to be how to understand those data and extract some meaning.”
And she saw how data-savvy students can take advantage of this paradigm in a summer course on climate change and marine ecosystems she taught in Hong Kong to post-docs and early career scientists from the region.
“The Malaysian students were working with [National Space and Aeronautics Administration] data. They said: ‘we don’t have a research ship, but here’s data we can use to do oceanographic exploration.’ And we can use it to open doors to broader participation in ocean science,” she says. “People are trying to navigate on a rotating roller-coaster, because data are coming in faster and faster. But there’s already been a great deal of progress, when you have students in Malaysia accessing data in a useful way.”
“Is there a singular place where you can search for all the data that is available, be able to download it and utilize it?”Deborah Smith, Data Governance Manager, OECI
To be FAIR
And yet, there is a great deal of progress to be made. In 2016, the journal Scientific Data laid out four guiding principles for data producers and publishers that should govern data-sharing: Findability, Accessibility, Interoperability, and Reusability (FAIR). And of the four, accessibility seems to be the furthest along. Literally thousands of data repositories exist, both generalist, such as Dataverse and Figshare and domain specific, such as the Advanced Global Atmospheric Gases Experiment database, which contains all the calibrated measurements from June 1978 to present of the composition of the global atmosphere. Their data may cover the globe, such as those held by the World Ocean Database, described as the “world’s largest collection of uniformly formatted, quality controlled, publicly available ocean profile data.” Or repositories may hold information on one of its small corners. GSO offers access to a variety of real-time data collected in Narragansett Bay. The Rolling Deck to Repository (R2R) program preserves and provides access to—as of mid-April—47,936 datasets with more than 16 million downloadable files of underway data from 49 research vessels. The search engine re3data.org, a repository of more than 2,400 repositories worldwide, brings up 164 current and defunct databases with the term “ocean” in its title.
FAIR has gained widespread acceptance in the scientific community as an aspiration, but adhering to its many provisions presents a number of hurdles, says Deborah Smith, data governance manager for the Ocean Exploration Cooperative Institute. For example, findability—“it’s not so much Big Data, as lots of data that lives in lots of different places and getting that data into a central library,” she says. “Is there a singular place where you can search for all the data that is available, be able to download it and utilize it? Collecting the metadata of the data is just as important because it’s the primary thing that people search: what was the cruise, who was on it, what type of data was collected.”
There is also large volume of high-resolution data, like research vessel videos that are not online, but on “hard magnetic LTO tapes sitting on a shelf, so you have to request it, it has to get transferred to a hard drive and sent to somebody,” she says.
Some of the most basic elements of data collection and characterization are not yet standardized. Different repositories may use different metadata. Cromwell would like to see data producers standardize the way they express basic individual elements such as time—some researchers use local time, while others use Coordinated Universal Time. The lack of common frameworks makes it harder for data mangers to build useful, automated information pipelines, and it’s one of the factors that can increase costs of data availability and management.
“One of the intrinsic challenges for any data management in the sciences is that we are inherently about change and innovation and growth.”Raleigh Martin, Geosciences Program Director, National Science Foundation
“We put most of the data in the cloud, and there’s still a very big cost associated with that. If nobody touches it then it may not cost a lot. If you have 400 users trying to hit the data all the time, download and work with it, what is the cost for that in perpetuity going forward?” asks Smith.
“One of the big things we have been leaning into is machine learning and artificial intelligence to help us find out what’s here and where the anomalies are to guide the data and the users,” says Cromwell.
Data managers are working to resolve these logistical issues, but “we aren’t yet there with technology across the oceanographic field,” Smith says. And that is before they tackle more conceptual questions, such as deciding what data to retain and what can be safely discarded.
Raleigh Martin, an NSF geosciences program director, says the technology side will never be done. “One of the intrinsic challenges for any data management in the sciences is that we are about change, innovation and growth. We want change, new technologies, computing capabilities and new ways of looking at the data. So, I think that is one of the inherent tensions with these repositories is they need to have enough standards to make the data FAIR, but they have to accommodate the fact that things do change.”
“The whole trajectory toward interdisciplinary and multi-disciplinary research is greatly benefitted by these networks of computational resources.”Ian Bishop, Ph.D. Candidate, GSO
Enter High-Performance Computing
And even if a researcher can find and access the data they need, some datasets are so large and complex, they need other tools to make sense of it. Bishop manages his genetic data using URI’s new Center for Computational Research (CCR). Launched in January on the Kingston campus, the CCR’s “intent is to give URI researchers access to cutting edge computing,” says its director, physics professor Gaurav Khanna. “The broader mission is to grow, promote and support all sorts of computational research, which includes high-performance computing, artificial intelligence and quantum computing.”
Khanna had founded a similar entity a decade ago at UMass, which is a partner in the Massachusetts Green High Performance Computing Center, along with Harvard, MIT, and Northeastern and Boston Universities. It took two years for URI to establish the CCR: hiring Khanna, modernizing the computing infrastructure, establishing relationships with other regional universities with high-performance computing centers, and moving it through the academic approval process. The pandemic helped to speed things up, as COVID shut down lab-based research, leaving computing and computational analysis as the only research—accessible and conducted remotely—that was moving forward. The CCR’s physical hub consists of 50 interlinked servers with 3,000 core processors, along with specialized hardware for artificial intelligence, housed in a hydro-powered, zero-carbon facility in Holyoke, Mass.
Developed over the last several decades, high performance computing has become one of science’s most significant tools, moving hypothesis-generation and experimentation beyond the lab bench into simulations and modeling that can allow you to “do more at significantly reduced costs,” Khanna says. “It’s quite a radical shift.”
Simulations allow astrophysicists to conduct physically impossible experiments—like smashing two black holes together, or, it can be applied to more immediate, earthly concerns. Khanna points to GSO professor Isaac Ginis’s complex hurricane modeling, which he has been migrating from a North Carolina facility to URI’s CCR:
“He is able to do real-time simulations and inform emergency management services in real time about storms that are coming so that they can make appropriate preparations for flooding, on a street-by-street and building-by-building level of prediction,” he says. “When it is ready for next hurricane season, it will be very impactful.”
Early career scientists like Bishop see open data plus new data mining tools expanding science’s frontiers.
“We are getting to this scale of scientific inquiry that requires really diverse skill sets. The whole trajectory toward interdisciplinary and multidisciplinary research is greatly benefitted by these networks of computational resources—people are trying very hard to get people working together and working on bigger, previously intractable issues.”
Some Things Never Change
There’s a joke in graduate student circles that a candidate can write an entire dissertation using other people’s data. To be sure, field work continues. Engineers develop new devices to collect more and better data, and scientists are still boarding research vessels to investigate hypotheses—someone has to collect all those zetabytes. But, Menden-Deuer, a sea-going oceanographer and director of her own plankton ecology lab, sees no irony at all.
“There are a lot of very profound lessons in acknowledging others’ data, verification of data sources, and respecting data sources,” she says. “It also demonstrates the iterative and collaborative nature of science. We are all just building upon knowledge generated prior.”