Berkeley Lab Research Review Fall 1997


Data Mining in the Information Age
by Jon Bashor The information age is generating virtual mountains of useful data, including huge libraries of electronic information stored in computers around the world. The problem now is how to get at it. While no one doubts that it is a valuable resource, the sheer volume of stored data has made finding and retrieving small files of valuable information - a process known as "datamining" - increasingly difficult. At a July workshop organized by Berkeley Lab computer scientists, experts from various fields met to discuss standards and practices to improve what they call "metadata," the information system which helps users make sense of libraries of stored data. "Metadata facilitates access, use and sharing of stored data across cyberspace and time by systematically describing the content, structure and semantics of data residing in databases or files," says Berkeley Lab's Frank Olken, program committee chairman. The main sponsor of the three-day workshop was the U.S. Environmental Protection Agency, which has amassed volumes of environmental data. To make and defend policies today, the EPA needs to access data from many sources, ensure its validity, and integrate many perspectives, such as air quality, land use, water quality and chemical toxicity. The agency plans to open access of its databases to the public via the World Wide Web. This will make environmental information available to decision-makers in government and private enterprise, and address the general public's right to know about conditions in their communities. The workshop was held under the auspices of the International Organization for Standardization's (ISO) Joint Technical Committee on Information Standards. The goal of the workshop, said Bruce Bargmeyer, manager of EPA's Information and Data Management Program, was to bring together metadata experts from a variety of fields and try to find common ways to share data. "There are not only mountains of data to be conquered, but those mountains come in different varieties," said John McCarthy, a Berkeley Lab computer scientist and chairman of the workshop. "Some data, like those transmitted by satellites, have very large numbers of observations for relatively few variables. Other data libraries, like those on the genetic makeup of humans and other organisms, have many, many related and complex variables." McCarthy was one of the first researchers to coin the term "metadata" based on work performed at Berkeley Lab 25 years ago. The following recommendations were an outgrowth of the workshop: establishing a web site for exchanging information about metadata efforts; standardizing operations of metadata registries (the systems defining the structure and format of metadata collections); and creating standards for metadata, with those standards derived from International Organization of Standards (ISO) guidelines. For workshop information contact John McCarthy at (510) 486-5307
PREVIOUS \| NEXT