{"id":180879,"date":"2023-05-15T16:45:21","date_gmt":"2023-05-15T20:45:21","guid":{"rendered":"https:\/\/web.uri.edu\/gso\/?p=180879"},"modified":"2023-05-15T16:45:21","modified_gmt":"2023-05-15T20:45:21","slug":"oceanography-meet-big-data","status":"publish","type":"post","link":"https:\/\/web.uri.edu\/gso\/publications\/aboard-gso\/oceanography-meet-big-data\/","title":{"rendered":"Oceanography, Meet Big Data"},"content":{"rendered":"<section class=\"cl-wrapper cl-hero-wrapper\"><div class=\"cl-hero super   cl-has-accessibility-controls\"><div class=\"cl-hero-proper\"><div class=\"overlay\"><div class=\"block\"><h1>Oceanography, Meet Big Data<\/h1><p>As epochs go, is the Age of Discovery yielding to the Age of Information?<\/p><\/div><\/div><div class=\"still\" style=\"background-image:url(https:\/\/web.uri.edu\/gso\/wp-content\/uploads\/sites\/916\/AGSO_S23-Front-Cover2.jpg);\"><\/div><div class=\"cl-accessibility-controls-container\"><div class=\"cl-accessibility-controls\"><div class=\"cl-accessibility-icon\" title=\"Accessibility controls\">Accessibility controls<\/div><div class=\"cl-accessibility-control cl-accessibility-motion-control cl-accessibility-control-hidden\"><div class=\"cl-accessibility-control-default\"><div class=\"cl-accessibility-control-button\" title=\"Pause motion\">Pause motion<\/div><div class=\"cl-accessibility-control-label\">Motion: <span class=\"cl-accessibility-syntax\">On<\/span><\/div><\/div><div class=\"cl-accessibility-control-alternate\"><div class=\"cl-accessibility-control-button\" title=\"Play motion\">Play motion<\/div><div class=\"cl-accessibility-control-label\">Motion: <span class=\"cl-accessibility-syntax\">Off<\/span><\/div><\/div><\/div><div class=\"cl-accessibility-control cl-accessibility-contrast-control\"><div class=\"cl-accessibility-control-default\"><div class=\"cl-accessibility-control-button\" title=\"Increase text contrast\">Increase text contrast<\/div><div class=\"cl-accessibility-control-label\">Contrast: <span class=\"cl-accessibility-syntax\">Standard<\/span><\/div><\/div><div class=\"cl-accessibility-control-alternate\"><div class=\"cl-accessibility-control-button\" title=\"Reset text contrast\">Reset text contrast<\/div><div class=\"cl-accessibility-control-label\">Contrast: <span class=\"cl-accessibility-syntax\">High<\/span><\/div><\/div><\/div><div class=\"cl-accessibility-system-setting\"><div class=\"cl-accessibility-toggle\" title=\"Apply my preferences site-wide\"><\/div><div class=\"cl-accessibility-toggle-label\">Apply site-wide<\/div><\/div><\/div><\/div><\/div><\/div><\/section>\n<h4>By Ellen Liberman<\/h4>\n<div class=\"type-intro\">\n<hr>\n<p>It took $3 billion dollars, 13 years and a score of universities and research centers in six countries to sequence nearly the entire human genome. In April 2003, the international consortium piloting the effort announced that it had successfully accounted for 92% of the 3.2 billion base pairs that make up a human\u2019s genetic blueprint. At the time, it was called one of the greatest scientific feats in history. Nonetheless, there were gaps and errors\u20148% of the human genome could not be sequenced with the tools available at the time. It would be another 15 years before methods and computing power had advanced sufficiently to finish the job. In April 2022, scientists from National Human Genome Research Institute, the University of California, Santa Cruz, and the University of Washington, Seattle, announced that they had sequenced a final, complete human genome. <\/p><\/div>\n<hr>\n<p>It takes GSO graduate student Ian Bishop about three months to extract a sample of DNA from an algae culture, send it to a core facility for genomic sequencing and receive a massive dataset. A biological oceanographer, Bishop studies the ways in which the individual genome and genetic diversity within a population of phytoplankton can help adapt a species to rapid environmental change. <\/p>\n<p>Given phytoplankton\u2019s critical role in the ocean food web and in carbon sequestration, answering Bishop\u2019s research question has huge implications for the planet. But he couldn\u2019t even ask it without new technologies that not only provide big data rapidly, but allow him to make sense of it at a scale that was previously out of reach.<\/p>\n<p>\u201cBioinformatics in sequencing data analysis is kind of like taking an individual, which has a book, its DNA, ripping it into pages of 10 to 30 words, and giving someone a pile of those. I have to reconstruct the book again,\u201d he says. \u201cPreviously, you would spend a lot of time and effort developing assays to look at a very small part of the genome, which is valuable, but a lot more work up front and not computationally intensive. Now I use the signal from the entire genome to look at similar features, at higher resolution, in the population and genetic differentiation between individuals.\u201d<\/p>\n<section class=\"cl-wrapper cl-quote-wrapper\"><div class=\"cl-quote  \"><div class=\"cl-quote-image\" style=\"background-image:url(https:\/\/web.uri.edu\/gso\/wp-content\/uploads\/sites\/916\/AGSO-S23_cromwell-125.jpg)\" title=\"\"><\/div><blockquote>\u201cBig Data is still so new that we are almost in its infancy in figuring out how to handle it. It\u2019s going to take a long time to really come to grips with that.\u201d<\/blockquote><cite>Megan Cromwell, Assistant Chief Data Officer, National Ocean Service<\/cite><\/div><\/section>\n<h3>Embracing the Tsunami<\/h3>\n<p>In 2018, the International Data Corporation predicted that by 2025 there would be 175 zettabytes\u2014175 billion terabytes\u2014of data in the world. According to the IDC\u2019s Data Age 2025 report, storing the entire Global Datasphere on DVDs could build a stack that would reach the moon 23 times, or circle Earth 222 times. There is no estimate for how much of it is generated by scientists, but suffice to say that it is a lot. The National Centers for Environmental Information, the National Oceanic and Atmospheric Administration (NOAA) archives, for example, currently contains about 47 petabytes of data, says Megan Cromwell, assistant chief data officer for the National Ocean Service, who, in her last job, managed all the data for NOAA\u2019s research vessel, Okeanos Explorer. She and her colleagues recently estimated that a single research project would collect half a petabyte alone. <\/p>\n<p>\u201cJust for video data from one ship we had over 300 terabytes of data. It\u2019s growing so exponentially,\u201d she says. \u201cBig Data is still so new that we are almost in its infancy in figuring out how to handle it. It\u2019s going to take a long time to really come to grips with that.\u201d<\/p>\n<p>In many ways, this tsunami of raw observations, samples and imagery, plus experimental, simulated and derived data represents an unprecedented scientific opportunity\u2014\u201copen doors\u201d is an oft-used metaphor. Big Data, voluminous and accessible, can democratize science, reduce research costs by spreading it among multiple users, and speed the pace of discovery. At the same time, it presents enormous organizational and storage challenges. Will there be an easily navigated library or a pile of ripped pages on the other side of that door?<\/p>\n<p>Open data has been the official policy of U.S. science\u2019s largest funders for a decade or more. In 2011, the National Science Foundation (NSF), updated the implementation of its data sharing policy, requiring investigators to include a two-page supplement to their proposals, describing a management plan for any data generated as a result of NSF-granted work. Two years later, the White House\u2019s Office of Science and Technology Policy (OSTP)issued a guidance document ordering all agencies with more than $100 million in annual research and development expenditures to develop a plan to increase public access to all data\u2014including peer-reviewed and digital\u2014generated from federally funded research. Last year, the OSTP updated the policy, giving federal agencies four years to make all federally funded data available in a machine-readable format in an agency-designated repository \u201cwithout delay,\u201d which removed the embargos for access to the underlying data of a peer-reviewed publication. <\/p>\n<section class=\"cl-wrapper cl-quote-wrapper\"><div class=\"cl-quote  \"><div class=\"cl-quote-image\" style=\"background-image:url(https:\/\/web.uri.edu\/gso\/wp-content\/uploads\/sites\/916\/AGSO-S23_Katie_Kelley_2-500.jpg)\" title=\"\"><\/div><blockquote>\u201c\u2026having ready access to a large trove of data lets you scale anything you may want to do up to a planetary scale.\u201d<\/blockquote><cite>Katherine Kelley, Professor of Oceanography, GSO<\/cite><\/div><\/section>\n<h3>Extracting Meaning<\/h3>\n<p>For GSO professor Katherine Kelley, who investigates how volatile elements help to fuel magma formation and volcanic eruptions underwater, using open-access data is \u201cpretty routine.\u201d She worked on her Ph.D. thesis in the late 1990s and early 2000s using previously generated data, with the help of PetDB, a searchable database of published geochemical data for igneous and metamorphic rocks hosted by the Lamont-Doherty Earth Observatory at Columbia University. The old way of aggregating existing data, combing the scientific literature for studies of interest, pulling the data from a certain region and typing it all in to make your own dataset \u201cis just not practical if you need to bring together 1,000 or 10,000 peoples\u2019 analyses related to the globe,\u201d she says. \u201cIt definitely changes the way you ask your questions, because having ready access to a large trove of data lets you scale anything you may want to do up to a planetary scale.\u201d <\/p>\n<p>GSO professor Susanne Menden-Deuer was another early adopter, launching a seminar in 2011 with her colleague, professor Bethany Jenkins, on big data and marine science, with a focus on how to best visualize and analyze existing data. <\/p>\n<p>\u201cI realized as we increased observational capability, one of the big bottlenecks was going to be how to understand those data and extract some meaning.\u201d<\/p>\n<p>And she saw how data-savvy students can take advantage of this paradigm in a summer course on climate change and marine ecosystems she taught in Hong Kong to post-docs and early career scientists from the region.<\/p>\n<p>\u201cThe Malaysian students were working with [National Space and Aeronautics Administration] data. They said: \u2018we don\u2019t have a research ship, but here\u2019s data we can use to do oceanographic exploration.\u2019 And we can use it to open doors to broader participation in ocean science,\u201d she says. \u201cPeople are trying to navigate on a rotating roller-coaster, because data are coming in faster and faster. But there\u2019s already been a great deal of progress, when you have students in Malaysia accessing data in a useful way.\u201d<\/p>\n<section class=\"cl-wrapper cl-quote-wrapper\"><div class=\"cl-quote  \"><div class=\"cl-quote-image\" style=\"background-image:url(https:\/\/web.uri.edu\/gso\/wp-content\/uploads\/sites\/916\/AGSO-S23_Deborah-Smith.jpg)\" title=\"\"><\/div><blockquote>\u201cIs there a singular place where you can search for all the data that is available, be able to download it and utilize it?\u201d<\/blockquote><cite>Deborah Smith, Data Governance Manager, OECI<\/cite><\/div><\/section>\n<h3>To be FAIR<\/h3>\n<p>And yet, there is a great deal of progress to be made. In 2016, the journal Scientific Data laid out four guiding principles for data producers and publishers that should govern data-sharing: Findability, Accessibility, Interoperability, and Reusability (FAIR). And of the four, accessibility seems to be the furthest along. Literally thousands of data repositories exist, both generalist, such as Dataverse and Figshare and domain specific, such as the Advanced Global Atmospheric Gases Experiment database, which contains all the calibrated measurements from June 1978 to present of the composition of the global atmosphere. Their data may cover the globe, such as those held by the World Ocean Database, described as the \u201cworld\u2019s largest collection of uniformly formatted, quality controlled, publicly available ocean profile data.\u201d Or repositories may hold information on one of its small corners. GSO offers access to a variety of real-time data collected in Narragansett Bay. The Rolling Deck to Repository (R2R) program preserves and provides access to\u2014as of mid-April\u201447,936 datasets with more than 16 million downloadable files of underway data from 49 research vessels. The search engine re3data.org, a repository of  more than 2,400 repositories worldwide, brings up 164 current and defunct databases with the term \u201cocean\u201d in its title. <\/p>\n<p>FAIR has gained widespread acceptance in the scientific community as an aspiration, but adhering to its many provisions presents a number of hurdles, says Deborah Smith, data governance manager for the Ocean Exploration Cooperative Institute. For example, findability\u2014\u201cit\u2019s not so much Big Data, as lots of data that lives in lots of different places and getting that data into a central library,\u201d she says. \u201cIs there a singular place where you can search for all the data that is available, be able to download it and utilize it? Collecting the metadata of the data is just as important because it\u2019s the primary thing that people search: what was the cruise, who was on it, what type of data was collected.\u201d<\/p>\n<p>There is also large volume of high-resolution data, like research vessel videos that are not online, but on \u201chard magnetic LTO tapes sitting on a shelf, so you have to request it, it has to get transferred to a hard drive and sent to somebody,\u201d she says.<\/p>\n<p>Some of the most basic elements of data collection and characterization are not yet standardized. Different repositories may use different metadata. Cromwell would like to see data producers standardize the way they express basic individual elements such as time\u2014some researchers use local time, while others use Coordinated Universal Time. The lack of common frameworks makes it harder for data mangers to build useful, automated information pipelines, and it\u2019s one of the factors that can increase costs of data availability and management. <\/p>\n<section class=\"cl-wrapper cl-quote-wrapper\"><div class=\"cl-quote  \"><div class=\"cl-quote-image\" style=\"background-image:url(https:\/\/web.uri.edu\/gso\/wp-content\/uploads\/sites\/916\/AGSO-S23_RMartin.jpg)\" title=\"\"><\/div><blockquote>\u201cOne of the intrinsic challenges for any data management in the sciences is that we are inherently about change and innovation and growth.\u201d<\/blockquote><cite>Raleigh Martin, Geosciences Program Director, National Science Foundation<\/cite><\/div><\/section>\n<p>\u201cWe put most of the data in the cloud, and there\u2019s still a very big cost associated with that. If nobody touches it then it may not cost a lot. If you have 400 users trying to hit the data all the time, download and work with it, what is the cost for that in perpetuity going forward?\u201d asks Smith.<\/p>\n<p>\u201cOne of the big things we have been leaning into is machine learning and artificial intelligence to help us find out what\u2019s here and where the anomalies are to guide the data and the users,\u201d says Cromwell.<\/p>\n<p>Data managers are working to resolve these logistical issues, but \u201cwe aren\u2019t yet there with technology across the oceanographic field,\u201d Smith says. And that is before they tackle more conceptual questions, such as deciding what data to retain and what can be safely discarded.<\/p>\n<p>Raleigh Martin, an NSF geosciences program director, says the technology side will never be done. \u201cOne of the intrinsic challenges for any data management in the sciences is that we are about change, innovation and growth. We want change, new technologies, computing capabilities and new ways of looking at the data. So, I think that is one of the inherent tensions with these repositories is they need to have enough standards to make the data FAIR, but they have to accommodate the fact that things do change.\u201d<\/p>\n<section class=\"cl-wrapper cl-quote-wrapper\"><div class=\"cl-quote  \"><div class=\"cl-quote-image\" style=\"background-image:url(https:\/\/web.uri.edu\/gso\/wp-content\/uploads\/sites\/916\/AGSO-S23_IBishop.jpg)\" title=\"\"><\/div><blockquote>\u201cThe whole trajectory toward interdisciplinary and multi-disciplinary research is greatly benefitted by these networks of computational resources.\u201d<\/blockquote><cite>Ian Bishop, Ph.D. Candidate, GSO<\/cite><\/div><\/section>\n<h3>Enter High-Performance Computing<\/h3>\n<p>And even if a researcher can find and access the data they need, some datasets are so large and complex, they need other tools to make sense of it. Bishop manages his genetic data using URI\u2019s new Center for Computational Research (CCR). Launched in January on the Kingston campus, the CCR\u2019s \u201cintent is to give URI researchers access to cutting edge computing,\u201d says its director, physics professor Gaurav Khanna. \u201cThe broader mission is to grow, promote and support all sorts of computational research, which includes high-performance computing, artificial intelligence and quantum computing.\u201d<\/p>\n<p>Khanna had founded a similar entity a decade ago at UMass, which is a partner in the Massachusetts Green High Performance Computing Center, along with Harvard, MIT, and Northeastern and Boston Universities. It took two years for URI to establish the CCR: hiring Khanna, modernizing the computing infrastructure, establishing relationships with other regional universities with high-performance computing centers, and moving it through the academic approval process. The pandemic helped to speed things up, as COVID shut down lab-based research, leaving computing and computational analysis as the only research\u2014accessible and conducted remotely\u2014that was moving forward. The CCR\u2019s physical hub consists of 50 interlinked servers with 3,000 core processors, along with specialized hardware for artificial intelligence, housed in a hydro-powered, zero-carbon facility in Holyoke, Mass.  <\/p>\n<p>Developed over the last several decades, high performance computing has become one of science\u2019s most significant tools, moving hypothesis-generation and experi\u00admentation beyond the lab bench into sim\u00adulations and modeling that can allow you to \u201cdo more at significantly reduced costs,\u201d Khanna says. \u201cIt\u2019s quite a radical shift.\u201d <\/p>\n<p>Simulations allow astrophysicists to conduct physically impossible experiments\u2014like smashing two black holes together, or, it can be applied to more immediate, earthly concerns. Khanna points to GSO professor Isaac Ginis\u2019s complex hurricane modeling, which he has been migrating from a North Carolina facility to URI\u2019s CCR:<\/p>\n<p>\u201cHe is able to do real-time simulations and inform emergency management services in real time about storms that are coming so that they can make appropriate preparations for flooding, on a street-by-street and building-by-building level of prediction,\u201d he says. \u201cWhen it is ready for next hurricane season, it will be very impactful.\u201d<\/p>\n<p>Early career scientists like Bishop see open data plus new data mining tools expanding science\u2019s frontiers.<\/p>\n<p>\u201cWe are getting to this scale of scientific inquiry that requires really diverse skill sets. The whole trajectory toward interdisci\u00adplinary and multidisciplinary research is greatly benefitted by these networks of computational resources\u2014people are trying very hard to get people working together and working on bigger, previously intrac\u00adtable issues.\u201d<\/p>\n<h3>Some Things Never Change<\/h3>\n<p>There\u2019s a joke in graduate student circles that a candidate can write an entire dissertation using other people\u2019s data. To be sure, field work continues. Engineers develop new devices to collect more and better data, and scientists are still boarding research vessels to investigate hypotheses\u2014someone has to collect all those zetabytes. But, Menden-Deuer, a sea-going oceanographer and director of her own plankton ecology lab, sees no irony at all.<\/p>\n<p>\u201cThere are a lot of very profound lessons in acknowledging others\u2019 data, verification of data sources, and respecting data sources,\u201d she says. \u201cIt also demonstrates the iterative and collaborative nature of science. We are all just building upon knowledge generated prior.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Ellen Liberman It took $3 billion dollars, 13 years and a score of universities and research centers in six countries to sequence nearly the entire human genome. In April 2003, the international consortium piloting the effort announced that it had successfully accounted for 92% of the 3.2 billion base pairs that make up a [&hellip;]<\/p>\n","protected":false},"author":2120,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[7,1987],"tags":[],"class_list":["post-180879","post","type-post","status-publish","format-standard","hentry","category-aboard-gso","category-publications"],"acf":[],"_links":{"self":[{"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/posts\/180879","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/users\/2120"}],"replies":[{"embeddable":true,"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/comments?post=180879"}],"version-history":[{"count":5,"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/posts\/180879\/revisions"}],"predecessor-version":[{"id":180897,"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/posts\/180879\/revisions\/180897"}],"wp:attachment":[{"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/media?parent=180879"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/categories?post=180879"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/web.uri.edu\/gso\/wp-json\/wp\/v2\/tags?post=180879"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}