Published on Mar 30, 2011
The big data revolution arguably hit science before it hit other institutions. Powerful scientific instruments and pervasive computing have driven quantum leaps in the amount of data available to scientists, raising new challenges for researchers who have had to develop new methods, tools and institutions for managing and exploring massive datasets. Thankfully, their efforts are surfacing valuable lessons for open data innovators in other fields such as public administration, journalism and health care.
In the data poor world, the devices scientists used to capture and process data were sparsely distributed and intermittently connected. The result was an incomplete, and often outdated, snapshot of the real world.
But distribute billions and perhaps trillions of connected sensors around the planet—just as we are doing today—and virtually every animate and inanimate object on Earth could be generating and transmitting data, including our homes, our cars, our natural and man-made environments, and yes, even our bodies.
Although our bodies are rarely connected to the Internet today, they will be as sensors embedded in clothing and medical devices enable patients with chronic conditions report their vitals back to a central database that is monitored remotely by physicians. Our cars aren’t sharing their data, but they will be too, including their speed, fuel performance, location and road conditions with an incredibly high degree of accuracy. And while our physical environment (both natural and man-made) may be sparsely connected now, soon we’ll have global insight into air and water quality, vegetation, temperature variations, wind speeds and much, much more, at the click of button.
It’s not just that the sensors are getting smaller and faster, as Moore’s Law predicts they will. The absolute number of sensors is exploding as more and more applications emerge. In other words, it’s a double exponential in terms of the amount of data being generated.
With the right tools and the right training, scientists can harness this vast cloud of data to revolutionize our ability to model the world and all of its systems, giving us new insights into social and natural phenomena and the ability to forecast trends like climate change with greater accuracy. At the same time, it will revolutionize the practice of science and even alter the basic skill set required to enter the field.
Data granularity is already ushering in some big changes, including a growing reliance on computers. It is also stirring up new controversies as societies wrestle with the social implications of a world with ubiquitous connectivity, where every minute movement or trivial utterance could be detected and recorded for subsequent analysis.
The real challenge for scientists is not collecting the data, but analyzing and making sense of it. And not just each individual data stream in isolation, but the larger emergent patterns arising out of the cacophony of information we are constantly assembling. “We already have orders of magnitude more data than before,” says Euan Adie, who works in the online division of Nature Publishing Group. “It’s not like one person can collect the data, analyze it and then exhaust all the possibilities with it.”
The new data-rich reality has already started driving some fundamental changes in scientific practices and even the structure of scientific communities. Take astronomy, where increasingly powerful telescopes like the Sloan Digital Sky Survey and the data they generate have fundamentally changed the discipline. A decade ago astronomy was still largely about small and mostly isolated groups of researchers keeping observational data proprietary and publishing individual results. Now it is organized around large data sets, with data being shared, coded and made accessible to the whole community. In the process, astronomers went from having dozens and hundreds and thousands of galaxies to handling hundreds of thousands and now millions.
The challenges in analyzing large data sets have prompted scientists to pursue some intriguing experiments. In one instance, a team at Oxford and Yale launched Galaxy Zoo, a clever online citizen science project where anyone interested can peer at the wonders of outer space, while simultaneously helping scientists classify the millions of galactic images they have stored up in their databases. At first, researcher Kevin Schawinski assumed “that there may be a couple of dozen hardcore amateur astronomers who might possibly be interested in this.” Three years later, the Galaxy Zoo community is thriving, with more than 275,000 users who have made nearly 75 million classifications of one million different images — far beyond the project’s original goal of classifying 50,000 galaxies. If the scientists behind the project were still laboring on their own, it would have taken them 124 years to classify that many images!
The most data-savvy scientists are arguably in the biomedical sciences where researchers have become accustomed to readily accessible “big genomics” databases and networks at a cost no higher than that of connecting to the Internet. The rapid pace of scientific discovery, especially following the sequencing of the human genome, has led many to believe that a pharmaceutical renaissance is just around the corner. And yet, as Gigi Hirsch, executive director for MIT’s Center for Biological Innovation has argued, “it is perhaps the most frustrating fact of the industry that despite an enormous increase in R&D investment, and historical advances in technology through genomics, automation and computation, the number of new drugs produced each year remains at the same level that existed over 40 years ago (about 20 per year).”
You can point the finger at the industry’s extremely long product development times, and the high cost of R&D. Both have arguably led to a very competitive industry culture, with little interest in cooperative ventures and a bias towards fiercely protecting its intellectual property. For decades, these inefficient practices have become deeply ingrained by a highly risk averse and legalistic corporate culture, often at the expense of opportunities to co-develop early stage technology tools, establish data standards, share disease target information or pursue other forms of collaboration that could lift the productivity of the entire industry.
Thankfully this is beginning to change with an infusion of new thinking about intellectual property throughout the life sciences industry. Scientists at Sage Bionetworks in Seattle have argued that human disease biology is so complex, interconnected, and expensive to research that the existing dominant business strategies of building and patenting unique models need to be replaced by an open source alternative.
“Human disease biology has no common languages, no accessible communal repositories and no government, corporate or foundation investment in generating an inclusive resource,” they argue. “Disease biology is characterized by many intelligent academic and commercial researchers in fragmented public and proprietary efforts. As a result, data are often stored as specialized and insulated collections and even when accessible there are barriers to integrating it into complex disease models required to guide research or trials in a meaningful way.”
Sage Bionetworks’ answer is to build a digital commons for biomedical data and predictive disease modeling. They suggest that the best way to evolve necessarily crude initial models of human disease is to have them nurtured by an open contributor network. They hope the network itself will evolve into an engine of human disease model building—one that bridges public and private research institutions around the world. Pooling data and talent from across the life sciences community would in turn enable researchers and drug manufacturers to launch a more coordinated and comprehensive attack on the intractable diseases that have so far stymied the industry.
So what are some of the lessons for data innovators in other fields? I’d like to highlight four that stand out:
1. Prepare for exponential change. More powerful instruments and sensor networks have led to exponential increases in the amount of data available to scientists. This is not only true of data-intensive disciplines like biomedicine and astrophysics but also other fields like oceanography where sensor networks are providing researchers with an astonishing wealth of undersea data once accessible only through costly marine expeditions. By comparison, open government initiatives around the world have focused on liberating the mountains of public data that lay dormant inside government agencies, a strategy that has proven in numerous cases to increase transparency and stoke public service innovation when third parties develop their own services using the data. But things are bound to get much more interesting when open government innovators start to harvest new data from the real world, like scientists are doing today.
2. Data literacy. Basic data literacy is assumed amongst scientists, but the general population has nowhere near the level of data literacy that will required in most professions in the near future. Even in science, recent developments have upped the ante. To do path-breaking science, scientists need to be fluent in large-scale data analytics or you need to partner with someone who is. Skills in managing, presenting and extracting insight from data will be increasingly valuable in other professions too, including marketing, public policy making and journalism, to name a few.
3. Rethinking intellectual property. Data is a valuable asset. It can be costly to gather and manage. For marketers, politicians and other professionals, possession of high-quality data can yield lucrative insights and strategic advantages. It’s not surprising that many organizations – and indeed many scientists – prefer to keep their data proprietary. And yet, the advantages from pooling data through shared repositories and open standards are compelling as well. Less redundancy. Lower costs. Increasingly comprehensive data sets. And a more diverse and capable network for generating new insights. Governments are sensibly concluding the public data created with public dollars should be treated as public assets that anyone can access and use. But data ownership and control issues will certainly become a fault line over which many institutions battle.
4. Breaking down institutional walls. Perhaps the biggest benefit of the big data revolution in science is that the research community increasingly recognizes that no one scientist, team or organization has the scale to create and curate the deluge of data on its own. Research organizations have little choice but to pool the financial and human resources necessary to undertake these large-scale projects. In the process, social media has become an increasingly important tool for breaking down institutional barriers. Researchers using Neptune’s Oceans 2.0 platform, for example, can tag everything from images to data feeds to video streams from undersea cameras, identifying sightings of little-known organisms or examples of rare phenomena. Wikis provide a shared space for group learning, discussion and collaboration, while a Facebook-like social networking application helps connect researchers working on similar problems.
The same kind of cross-institutional collaborations will be key to making effective use of open data in other fields too. Public servants and citizens will need to collaborate across agencies walls and jurisdictional boundaries. Journalists will need to join forces to interpret and report on breaking stories that contain a significant data component, like the recent Wikileaks disclosures. Health care providers will use medical data to collaborate around patient health needs and so on.
A data rich world will generate many new opportunities, but there will be some difficult adjustments and issues such as privacy, intellectual property and national security to confront along the way. “We’re going from a data poor to a data rich world,” says Larry Smarr. “And there’s a lag whenever an exponential change like this transforms the impossible into the routine.” People aren’t necessarily good at thinking about exponential changes, he argues, and as a result, it seems scientists are under-investing in the analysis and visualization tools we need to handle it.
Fortunately there are trailblazers to show us the way in a world where we have data about anything and everything. And in scientific pioneers like Neptune, Galaxy Zoo and CalIT2 we are seeing a new kind of analysis, a new kind of science, and a whole new kind of organization come into being. The question now is whether the rest of the world is ready.