Archive for June, 2014

Visualising Data: A Travelogue

Posted on June 17, 2014 in Uncategorized


Last month a number of us from the Hesburgh Libraries attended a day-long workshop on data visualisation facilitated by Andy Kirk of Visualising Data. This posting documents some of the things I learned.

First and foremost, we were told there are five steps to creating data visualisations. From the handouts and supplemented with my own understanding, they include:

  1. establishing purpose – This is where you ask yourself, “Why is a visualisation important here? What is the context of the visualization?
  2. acquiring, preparing and familiarising yourself with the data – Here were echoed different data types (open, nominal, ordinal, intervals, and ratios), and we were introduced to the hidden costs of massaging and enhancing data, which is something I do with text mining and others do in statistical analysis.
  3. establishing editorial focus – This is about asking and answering questions regarding the visualisation’s audience. What is their education level? How much time will they have to absorb the content? What medium(s) may be best used for the message?
  4. conceiving the design – Using just paper and pencil, draw, brainstorm, and outline the appearance of the visualisation.
  5. constructing the visualisation – Finally, do the work of making the visualisation a reality. Increasingly this work is done by exploiting the functionality of computers, specifically for the Web.

Here are a few meaty quotes:

  • Context is king.
  • Data preparation is a hidden cost in visualization.
  • Data visualisation is a tool for understanding, not fancy ways of showing numbers.
  • Data visualisation is about analysis and communication.

One of my biggest take-aways was the juxtaposition of two spectrum: reading to feeling, and explaining to exploring. In other words, to what degree is the visualization expected to be read or felt, and to what degree is it offering the possibilities to explain or explore the data? Kirk illustrated the idea like this:

                read
                 .
                / \
                 |
                 |
   explain <-----+-----> explore
                 |
                 |
                \ /
                 .
                feel

The the reading/feeling spectrum reminded me of the usability book entitled Don’t Make Me Think. The explaining/exploring spectrum made me consider interactivity in visualisations.

I learned two other things along the way: 1) creating visualisations is a team effort requiring a constellation of skilled people (graphic designers, statisticians, content specialists, computer technologists, etc.), and 2) is it entirely plausible to combine more than one graphic — data set illustration — into a single visualisation.

Now I just need to figure out how to put these visualisation techniques into practice.

ORCID Outreach Meeting (May 21 & 22, 2014)

Posted on June 13, 2014 in Uncategorized

This posting documents some of my experiences at the ORCID Outreach Meeting in Chicago (May 21 & 22, 2014).

As you may or may now know, ORCID is an acronym for “Open Researcher and Contributor ID”.* It is also the name of a non-profit organization whose purpose is to facilitate the creation and maintenance of identifiers for scholars, researchers, and academics. From ORCID’s mission statement:

ORCID aims to solve the name ambiguity problem in research and scholarly communications by creating a central registry of unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID and other current researcher ID schemes. These identifiers, and the relationships among them, can be linked to the researcher’s output to enhance the scientific discovery process and to improve the efficiency of research funding and collaboration within the research community.

A few weeks ago the ORCID folks facilitated a user’s group meeting. It was attended by approximately 125 people (mostly librarians or people who work in/around libraries), and some of the attendees came from as far away as Japan. The purpose of the meeting was to build community and provide an opportunity to share experiences.

The meeting itself was divided into number of panel discussions and a “codefest”. The panel discussions described successes (and failures) for creating, maintaining, enhancing, and integrating ORCID identifiers into workflows, institutional repositories, grant application processes, and information systems. Presenters described poster sessions, marketing materials, information sessions, computerized systems, policies, and politics all surrounding the implementation of ORCID identifiers. Quite frankly, nobody seemed to have a hugely successful story to tell because too few researchers seem to think there is a need for identifiers. I, as a librarian and information professional, understand the problem (as well as the solution), but outside the profession there may not seem to be much of a problem to be solved.

That said, the primary purpose of my attendance was to participate in the codefest. There were less than a dozen of us coders, and we all wanted to use the various ORCID APIs to create new and useful applications. I was most interested in the possibilities of exploiting the RDF output obtainable through content negotiation against an ORCID identifier, a la the command line application called curl:

curl -L -H "Accept: application/rdf+xml" http://orcid.org/0000-0002-9952-7800

Unfortunately, the RDF output only included the merest of FOAF-based information, and I was interested in bibliographic citations.

Consequently I shifted gears, took advantage of the ORCID-specific API, and I decided to do some text mining. Specifically, I wrote a Perl program — orcid.pl — that takes an ORCID identifier as input (ie. 0000-0002-9952-7800) and then:

  1. queries ORCID for all the works associated with the identifier**
  2. extracts the DOIs from the resulting XML
  3. feeds the DOIs to a program called Tika for the purposes of extracting the full text from documents
  4. concatenates the result into a single stream of text, and sends the whole thing to standard output

For example, the following command will create a “bag of words” containing the content of all the writings associated with my ORCID identifier and have DOIs:

$ ./orcid.pl 0000-0002-9952-7800 > morgan.txt

Using this program I proceeded to create a corpus of files based on the ORCID identifiers of eleven Outreach Meeting attendees. I then used my “tiny text mining tools” to do analysis against the corpus. The results were somewhat surprising:

  • The most significant key words shared across the corpus of eleven people included: information, system, site, and orcid.
  • The authors Haak and Paglione wrote the most similar articles. (They both wrote about ORCID.) Morgan and Havert were a very close second. (We both wrote about “information” and “sites”.)
  • The DOIs often point to splash pages, and consequently my “bags of words” included lots of content about cookies and publishers as opposed to meaty journal article content. ***

Ideally, the hack I wrote would allow a person to feed one or more identifiers to a system and output a report summarizing and analyzing the journal article content at a glance — a quick & easy “distant reading” tool.

I finished my “hack” in one sitting which gave me time to attend the presentations of the second day.

All of the hacks were added to a pile and judged by a vendor on their utility. I’m proud to say that Jeremy Friesen’s — a colleague here at Notre Dame — hack won a prize. His application followed the links to people’s publications, created a screen dump of the publications’ root pages, and made a montage of the result. It was a visual version of orcid.pl. Congratulations, Jeremy!

I’m very glad I attended the Meeting. I reconnected with a number of professional colleagues, and I my awareness of researcher identifiers was increased. More specifically, there seem to be a growing number of these identifiers. Examples for myself include:

And for a really geeky good time, I learned to create the following set of RDF triples with the use of these identifiers:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
  <http://dx.doi.org/10.1108/07378831211213201> dc:creator
  "http://isni.org/isni/0000000035290715" ,
  "http://id.loc.gov/authorities/names/n94036700" ,
  "http://orcid.org/0000-0002-9952-7800" ,
  "http://viaf.org/viaf/26290254" ,
  "http://www.researcherid.com/rid/F-2062-2014" ,
  "http://www.scopus.com/authid/detail.url?authorId=25944695600" .

I learned about the (subtle) difference between an identifier and a authority control record. I learned of the advantages and disadvantages the various identifiers. And through a number of serendipitous email exchanges, I learned about ISNIs which are a NISO standard for identifiers and seemingly popular in Europe but relatively unknown here in the United States. For more detail, see the short discussion of these things in the Code4Lib mailing list archives.

Now might be a good time for some of my own grassroots efforts to promote the use of ORCID identifiers.

* Thanks, Pam Masamitsu!

** For a good time, try http://pub.orcid.org/0000-0002-9952-7800/orcid-works, or substitute your identifier to see a list of your publications.

*** The problem with splash screens is exactly what the very recent CrossRef Text And Data Mining API is designed to address.

CrossRef’s Text and Data Mining (TDM) API

Posted on June 11, 2014 in Uncategorized

A few weeks ago I learned that CrossRef’s Text And Data Mining (TDM) API had gone version 1.0, and this blog posting describes my tertiary experience with it.

A number of months ago I learned about Prospect, a fledgling API being developed by CrossRef. Its purpose was to facilitate direct access to full text journal content without going through the hassle of screen scraping journal article splash pages. Since then the API has been upgraded to version 1.0 and renamed the Text And Data Mining API. This is how the API is expected to be used:

  1. Given a (CrossRef) DOI, resolve the DOI using HTTP content negotiation. Specifically, request text/turtle output.
  2. From the response, capture the HTTP header called “links”.
  3. Parse the links header to extract URIs denoting full text, licenses, and people.
  4. Make choices based on the values of the URIs.

What sorts of choices is one expected to make? Good question. First and foremost, a person is suppose to evaluate the license URI. If the URI points to a palatable license, then you may want to download the full text which seems to come in PDF and/or XML flavors. With version 1.0 of the API, I have discovered ORCID identifiers are included in the header. I believe these denote authors/contributors of the articles.

Again, all of this is based on the content of the HTTP links header. Here is an example header, with carriage returns added for readability:

<http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.pdf>;
rel="http://id.crossref.org/schema/fulltext"; type="application/pdf"; version="vor",
<http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.xml>;
rel="http://id.crossref.org/schema/fulltext"; type="application/xml"; version="vor",
<http://creativecommons.org/licenses/by/3.0/>; rel="http://id.crossref.org/schema/license";
version="vor", <http://orcid.org/0000-0002-8443-5196>; rel="http://id.crossref.org/schema/person",
<http://orcid.org/0000-0002-0987-9651>; rel="http://id.crossref.org/schema/person",
<http://orcid.org/0000-0003-4669-8769>; rel="http://id.crossref.org/schema/person"

I wrote a tiny Perl library — extractor.pl — used to do steps #1 through #3, above. It returns a reference to a hash containing the values in the links header. I then wrote three Perl scripts which exploit the library:

  1. resolver.cgi – a Web-based application taking a DOI as input and returning the URIs in the links header, if they exist. Your milage with the script will vary because most DOIs are not associated with full text URIs.
  2. search.cgi – given a simple query, use CrossRef’s Metadata API to find no more than five articles associated with full text content, and then resolve the links to the full text.
  3. search.pl – a command-line version of search.cgi

Here are a few comments. Myself, as a person who increasingly wants direct access to full text articles, the Text And Data Mining API is a step in the right direction. Now all that needs to happen is for publishers to get on board and feed CrossRef the URIs of full text content along the associated licensing terms. I found the links header to be a bit convoluted, but this is what programming libraries are for. I could not find a comprehensive description of what name/value combinations can exist in the links header. For example, the documentation alludes to beginning and ending dates. CrossRef seems to have a growing number of interesting applications and APIs which are probably going unnoticed, and there is an opportunity of some sort lurking in there. Specifically, somebody out to do something the text/turtle (RDF) output of the DOI resolutions.

‘More fun with HTTP and bibliographics.