Archive for December, 2010

6th International Data Curation Conference

Posted on December 14, 2010 in Uncategorized

This posting documents my experiences at the 6th International Data Curation Conference, December 6-8, 2010 in Chicago (Illinois). In a sentence, my understanding of the breath and depth of data curation was re-enforced, and the issues of data curation seem to be very similar to the issues surrounding open access publishing.

Day #1

After a few pre-conference workshops which seemed to be popular, and after the reception the night before, the Conference began in earnest on Tuesday, December 7. The presentations of the day were akin to overviews of data curation, mostly from the people who were data creators.

One of the keynote addresses was entitled “Working the crowd: Lessons from Galaxy Zoo” by Chris Lintott (University of Oxford & Alder Planetarium). In it he described how images of galaxies taken as a part of the Sloan Digital Sky Survey where classified through crowd sourcing techniques — the Galaxy Zoo. Wildly popular for a limited period of time, its success was attributed to convincing people their task was useful, they were treated as collaborators (not subjects), and it was not considered a waste of time. He called the whole process “citizen science”, and he has recently launched Zooniverse in the same vein.

“Curation centres, curation services: How many are enough?” by Kevin Asjhley (Digital Curation Centre) was the second talk, and in a tongue-in-cheek way, he said the answer was three. He went on to outline they whys and wherefores of curation centers. Different players: publishers, governments, and subject centers. Different motivations: institutional value, reuse, presentation of the data behind the graph, obligation, aggregation, and education. Different debates on who should do the work: libraries, archives, computer centers, institutions, disciplines, nations, localities. He summarized by noting how data is “living”, we have a duty to promote it, it is about more than scholarly research, and finally, three centers are not really enough.

Like Lintott, Antony Williams (Royal Society of Chemistry) described a crowd sourcing project in “ChemSpider as a platform for crowd participation”. He began by demonstrating the myriad of ways Viagra has been chemically described on the the ‘Net. “Chemical information on the Internet is a mess.” ChemSpider brings together many links from chemistry-related sites and provides a means for editing them in an online environment.

Barend Mons outlined one of the common challenges of metadata. Namely, the computer’s need for structured information and most individuals’ lack of desire to create it. In “The curation challenge for the next decade: Digital overlap strategy or collective brains?” Mons advocated the creation of “nano publications” in the form of RDF statements — assertions — as a possible solution. “We need computers to create ‘reasonable’ formats.”

“Idiosyncrasy at scale: Data curation in the humanities” by John Unsworth (University of Illinois at Urbana-Champaign) was the fourth presentation of the day. Unsworth began with an interesting set of statements. “Retrieval is a precondition for use, and normalization is a precondition for retrieval, but humanities’ texts are messy and difficult to normalize.” He went on to enumerate types of textual normalization: spelling, vocabulary, punctuation, “chunking”, mark-up, and metadata. He described MONK as a normalization project. He also mentioned a recent discussion on the Alliance of Digital Humanities Organizations site where humanists debated whether or not texts ought be marked-up prior to analysis. In short, idiosyncracies abound.

The Best Student Paper Award was won by Youngseek Kim (Syracuse University) for “Education for eScience professionals: Integrating data curation and cyberinfrastructure”. In it he described the use of focus group interviews and an analysis of job postings to articulate the most common skills a person needs to be a “escience professional”. In the end he outlined three sets of skills: 1) the ability to work with data, 2) the ability to collaborate with others, and 3) the ability to work with cyberinfrastructure. The escience professional needs to have domain knowledge, a collaborative nature, and know how to work with computers. “The escience professional needs to have a range of capabilities and play a bridging role between scientists and information professionals.”

After Kim’s presentation there was a discussion surrounding the role of the librarian in data curation. While I do not feel very much came out of the discussion, I was impressed with one person’s comment. “If a university’s research data were closely tied to the institution’s teaching efforts, then much of the angst surrounding data curation would suddenly go away, and a strategic path would become clear.” I thought that comment, especially coming from a United States Government librarian, was quite insightful.

The day’s events were (more or less) summarized by Clifford Lynch (Coalition for Networked Information) with some of the following quotes. “The NSF mandate is the elephant in the room… The NSF plans are not using the language of longevity… The whole thing may be a ‘wonderful experiment’… It might be a good idea for someone to create a list of the existing data plans and their characteristics in order to see which ones play out… Citizen science is not only about analysis but also about data collection.”

Day #2

The second day’s presentation were more practical in nature and seemingly geared for librarians and archivists.

In my opinion, “Managing research data at MIT: Growing the curation community one institution at a time” by MacKenzie Smith (Massachusetts Institute of Technology Libraries) was the best presentation of the conference. In it she described data curation as a “meta-discipline” as defined in Media Ecology by Marshall McLuhan, and where information can be described in terms of format, magnitude, velocity, direction, and access. She articulated how data is tricky once a person travels beyond one’s own silo, and she described curation as being about reproducing data, aggregating data, and re-using data. Specific examples include: finding data, publishing data, preserving data, referencing data, making sense of data, and working with data. Like many of the presenters, she thought data curation was not the purview of any one institution or group, but rather a combination. She compared them to layers of storage, management, linking, discovery, delivery, management, and society. All of these things are done by different groups: researchers, subject disciplines, data centers, libraries & archives, businesses, colleges & universities, and funders. She then presented an interesting set of two case studies comparing & contrasting data curation activities at the University of Chicago and MIT. Finally she described a library’s role as one of providing services and collaboration. In the jargon of Media Ecology, “Libraries are a ‘keystone’ species.”

The Best Paper Award was given to Laura Wynholds (University of California, Los Angeles) for “Linking to scientific data: Identity problems of unruly and poorly bounded digital objects”. In it she pointed out how one particular data set was referenced, accessible, and formatted from three different publications in three different ways. She went on to outline the challenges of identifying which data set to curate and how.

In “Making digital curation a systematic institutional function” Christopher Prom (University of Illinois at Urbana-Champaign) answered the question, “How can we be more systematic about bringing materials into the archives?” Using time granted via a leave of absence, Prom wrote Practical E-Records which “aims to evaluate software and conceptual models that archivists and records manager might use to identify preserve, and provide access to electronic records.” He defined trust as an essencial component of records management, and outlined the following process that needs to be done in order to build it: assess resources, wrote program statement, engage records producers, implement policies, implement repository, develop action plans, tailor workflows, and provide access.

James A. J. Wilson (University of Oxford) shared some of his experiences with data curation in “An institutional approach to developing research data management infrastructure”. According to Wilson, the Computing Services center is taking the coordinating role at Oxford when it comes to data curation, but he, like everybody else, emphasized the process is not about a single department or entity. He outlined a number of processes: planning, creation, local storage, documentation, institutional storage, discovery, retrieval, and training. He divided these processes between researchers, computing centers, and libraries. I thought one of the more interesting ideas Wilson described was DaaS (database as a service) where databases are created on demand for researchers to use.

Patricia Hswe (Penn State University) described how she and a team of other people at the University are have broken down information silos to create a data repository. Her presentation, “Responding to the call to curate: Digital curation in practice at Penn State University” outlined the use of microservices in their implementation, and she explained the successes of CurateCamps. She emphasized how the organizational context of the implementation is probably the most difficult part of the work.

Huda Kan (Cornell University) described an application to create, reuse, stage, and share research data in a presentation called “DataStaR: Using the Semantic Web approach for data curation”. The use of RDF was core to the system’s underlying data structure.

Since this was the last session in a particular concurrent track, a discussion followed Kan’s presentation. It revolved around the errors in metadata, and the discussed solutions seemed to fall into three categories: 1) write better documentation and/or descriptions of data, 2) write computer programs to statistically identify errors and then fix them, or 3) have humans do the work. In the end, the solution is probably a combination of all three.

Sometime during the conference I got the idea of creating a word cloud made up of Twitter “tweets” with the conference’s hash tag — idcc10. In a fit of creativity, I wrote the hack upon my return home, and the following illustration is the result:

word cloud
Wordcloud illustrating the tweets tagged with idcc10

Summary

The Conference was attended by approximately 250 people, apparently a record. The attendees were mostly from the United States (obviously), but there it was not uncommon to see people from abroad. The Conference was truly international in scope. I was surprised at the number of people I knew but had not seen for a while because I have not been recently participating in Digital Library Federation-like circles. It was nice to rekindle old acquaintances and make some new ones.

At to be expected, the presentations outlined apparent successes based on experience. From my perspective, Notre Dame’s experience is just beginning. We ought to learn from this experience, and some of my take-aways include:

  • data curation is not the job of any one university department; there are many stakeholders
  • data curation is a process involving computer technology, significant resources of all types, and policy; all three are needed to make the process functional
  • data curation is a lot like open access publishing but without a lot of the moral high ground

Two more data creator interviews

Posted on December 11, 2010 in Uncategorized

Michelle Hudson and I have had a couple more data creator interviews, and here is a list of themes from them:

  • data types – data types include various delimited text files, narrative texts, geographic information system (GIS) files, images, and videos; the size of the data sets is about 20 GB
  • subject content – the subject represented by the content include observations of primates, longitudinal studies of families
  • output – the output resulting from these various data sets include scholarly articles and simulations
  • data management – the information is saved on any number of servers located in the Center for Research Computing, under one’s desk, or in a departmental space; some back up and curation happens, but not a lot; there is little if any metadata assigned to the data; migrating data from older versions of software to new versions is sometimes problematic
  • ongoing dissemination – everybody interviewed believe there needs to be a more formalized method for the ongoing management and dissemination of locally created data; some thought the Libraries ought to play a leadership role; others have considered offering the service to the campus community for a fee

Three data webinars

Posted on December 10, 2010 in Uncategorized

Between Monday, November 8 and Thursday, November 11 I participated in three data webinars — a subset of a larger number of webinars facilitated by the ICPSR, and this posting outlines what I learned from them.

Data Management Plans

The first was called “Data Management Plans” and presented by Katherine McNeill (MIT). She gave the briefest of histories of data sharing and noted the ICPSR has been doing this since 1962. With the advent of the recent National Science Foundation announcement requiring data curation plans the interest in curation has become keen, especially in the sciences. The National Institute of Health has had similar mandate for grants over $250,000. Many of these mandates only specify the need for a “what” when it comes to plan, and not necessarily the “how”. This is slightly different from the United Kingdom’s way of doing things.

After evaluating a number of plans from a number of places, McNeill identified a set of core issues common to many of them:

  • a description of the project and data
  • standards to be applied
  • short-term storage specifications
  • legal and ethical issues
  • access policies and provisions
  • long-term archiving stipulations
  • funder-specific requirements

How do and/or will libraries support data curation? She answered this question by listing a number of possibilities:

  • instituting an interdisciplinary librarian models
  • creating a dedicated data center
  • getting any (all) librarians up to speed
  • having the scholarly communications librarian lead the efforts
  • creating partnerships with other campus departments
  • participating in a national data service
  • getting funder support
  • activities through the local office of research
  • doing more inter-university collaborations
  • providing services through professional societies

Somewhere along the line McNeill advocated reading ICPSR’s “Guidelines for Effective Data Management Plans” which outlines elements of data plans as well as a number of examples

America’s Most Wanted

The second webinar was “America’s Most Wanted: Top US Government Data Resources” presented by Lynda Kellam (The University of North Carolina at Greensboro). Kellam is a data librarian, and this session was akin to a bibliographic instruction session where a number of government data sources were described:

  • Data.gov – has a lot of data from the Environmental Protection Agency; works a lot like ICPSR; includes “chatter” around data; includes “cool” preview function
  • Geospatial One Stop – a geographic information system portal with a lot of metadata; good for tracking down sources with a geographic interface
  • FactFinder – a demographic portal for commerce and census data; will include in the near future a more interactive interface
  • United States Bureau of Labor Statistics – lot o’ labor statistics
  • National Center for Education Statistics – includes demographics for school statistics and provides analysis online
  • DataFerrett – provides you with an applet to download, run, and use to analyze data

Students Analyzing Data

The final webinar I listened to was “Students Analyzing Data in the Large Lecture Class: Active Learning with SDA Online Analysis” by Jim Oberly (University of Wisconsin-Eau Claire). [5] As a historian, Oberly is interested in making history come alive for his students. To do this, he use ICPSR’s Analyze Data Online service, and this webinar demonstrated how. He began by asking questions about the Civil War such as “For economic reasons, would the institution of slavery have died out naturally, and therefore the Civil War would have been unnecessary?” Second, he identifying a data set (New Orleans Slave Sale Sample, 1804-1862) from the ICPSR containing information on the sale of slaves. Finally, he used ICPSR’s online interface to query the data looking for trends in prices. In the end, I believe he was not so sure the War could have been avoided because the prices of slaves seemed unaffected by the political environment. The demonstration was fascinating, and interface seemingly easy to use.

Summary

Based on these webinars it is an understatement to say the area of data is wide, varied, broad, and deep. Much of Library Land is steeped in books, but in the current environment books are only one of many manifestations of data, information, and knowledge. The profession is still grappling with every aspect of raw data. From its definition to its curation. From its organization to it use. From its politics to its economics.

I especially enjoyed seeing how data is being used online. Such is a growing trend, I believe, and represents a opportunity for the profession. The finding and acquisition of data sets is somewhat of a problem now, but such a thing will become less of a problem later. The bigger problem is learning how to use and understand the data. If the profession were to integrate functions for data’s use and understanding into its systems, then libraries have a growing responsibility. If the profession only seeks to enable find and access, then the opportunities are limited and short-lived. Find and access are things we know how to do. Use and understanding requires an adjustment of our skills, resources, and expertise. Are we up to the challenge?