Archive for March, 2012

Summarizing the state of the Catholic Youth Literature Project

Posted on March 30, 2012 in Uncategorized

This posting summarizes the purpose, process, and technical infrastructure behind the Catholic Youth Literature Project. In a few sentences, the purpose was two-fold: 1) to enable students to learn what it meant to be Catholic during the 19th century, and 2) to teach students the value of reading “closely” as well as from a “distance”. The process of implementing the Project required the time and skills of a diverse set of individuals. The technical infrastructure is built on a large set of open source software, and the interface is far from perfect.

Purpose

The purpose of the project was two-fold: 1) to enable students to learn what it meant to be Catholic during the 19th century, and 2) to teach students the value of reading “closely” as well as from a “distance”. To accomplish this goal a faculty member here at the University of Notre Dame (Sean O’Brien) sought to amass a corpus of materials written for Catholic youth during the 19th century. This corpus was expected to be accessible via tablet-based devices and provide a means for “reading” the texts in the traditional manner as well as through various text mining interfaces.

During the Spring Semester students in a survey class were lent Android-based tablet computers. For a few weeks of the semester these same students were expected to select one or two texts from the amassed corpus for study. Specifically, they were expected to read the texts in the traditional manner (but on the tablet computer), and they were expected to “read” the texts through a set of text mining interfaces. In the end the the students were to outline three things: 1) what did you learn by reading the text in the traditional way, 2) what did you learn by reading the text through text mining, and 3) what did you learn by using both interfaces at once and at the same time.

Alas, the Spring semester has yet to be completed, and consequently what the students learned has yet to be determined.

Process

The process of implementing the Project required the time and skills of a diverse set of individuals. These individuals included the instructor (Sean O’Brien), two collection development librarians (Aedin Clements and Jean McManus), and librarian who could write computer programs (myself, Eric Lease Morgan).

As outlined above, O’Brien outlined the overall scope of the Project.

Clements and McManus provided the means of amassing the Project’s corpus. A couple of bibliographies of Catholic youth literature were identified. Searches were done against the University of Notre Dame’s library catalog. O’Brien suggested a few titles. From these lists items were selected for inclusion for purchase, from the University library’s collection, as well as from the Internet Archive. The items for purchase were acquired. The items from the local collection were retrieved. And both sets of these items were sent off for digitization and optical character recognition. The results of the digitization process were then saved on a local Web server. At the same time, the items identified from the Internet Archive were mirrored locally and saved in the same Web space. About one hundred items items were selected in all, and they can be seen as a set of PDF files. This process took about two months to complete.

Technical infrastructure

The Project’s technical infrastructure enables “close” and “distant” reading, but the interface is far from perfect.

From the reader’s (I don’t use the word “user” anymore) point of view, the Project is implemented through a set of Web pages. Behind the scenes, the Project is implemented with an almost dizzying array of free and open source software. The most significant processes implementing the Project are listed and briefly described below:

  • mirroring – Much of the text mining services require extensive analysis of the original item. To accomplish this local copies of the texts were mirrored locally. By feeding the venerable wget program with a list of URLs based on Internet Archive unique identifiers, mirroring content locally is trivial.
  • name-entity extraction – There was a desire to list the underlying names, places, and organizations from each text. These things can put a text into a context for the reader. Are there a lot of Irish names? Is there a preponderance of place names from the United States? To accomplish this task and assist in answering these sorts of questions, a Perl script was written around the Stanford Named Entity Recognizer. This script (txt2ner.pl) extracts the entities, looks them up in DBedia, and saves metadata (abstracts, URLs to images, as well as latitudes & longitudes) describing the entities to a locally defined XML file for later processing. (See an example.) A CGI script (ner.cgi) was then written to provide a reader-interface to these files.
  • parts-of-speech extraction – Just as lists of named entities can be enlightening so can lists of a text’s parts-of-speech. Are the pronouns generally speaking masculine or feminine? Over all, are the verbs active or passive? To what degree are color words used words in the text? To begin to answer these sorts of questions, a Perl script exploited a Perl module called Lingua::TreeTagger. The script (pos-tag.pl) extracts parts-of-speech from a text file and saves the result as a simple tab-delimited file for later use. (See an example.)
  • word/phrase tabulation and concordancing – To support rudimentary word and phrase tabulations, as well as a concordance interface, an Apache module (Concordance.pm) was written around two more Perl modules. The first, Lingua::EN::Ngram, extracts word and phrase occurrences. The second, Lingua::Concordance, provides an object-oriented keyword-in-context interface.
  • metadata enhancement and storage – A rudimentary catalog listing the items in the Project’s corpus was implemented using a Perl module called MyLibrary. The MARC records describing each item in the corpus were first parsed. Desired metadata elements were mapped to MyLibrary fields, facets, and terms. Each item in the corpus was then analyzed in terms of word length as well as readability score through the use of yet another Perl module called Lingua::EN::Fathom. These additional metadata elements were then added to the underlying “catalog”. To accomplish this set of tasks two additional Perl scripts were written (add-author-title.pl and add-size-readability.pl).
  • HTML creation – A final Perl script was written to bring all the parts together. By looping through the “catalog” this script (make-catalog.pl) generates HTML files designed for display on tablet devices. These HTML files make heavy use of JQuery Mobile, and since no graphic designer was a part of the Project, JQuery Mobile was a godsend.

The result — the Catholic Youth Literature Project — is a system that enables the reader to view the texts online as well as do some analysis against them. The system functions in that it does not output invalid data, and it does provide enhanced access to the texts.


The home page is simply a list of covers and associated titles.


The Internet Archive online reader is one option for “close” reading.


The list of parts-of-speech provides the reader with some context. Notice how the word “good” is the most frequently used adjective.


The histogram feature of the concordance allows the reader to see where selected words appear in the text. For example, in this text the word “god” is used rather consistently.


A network diagram allows the reader to see what words are used “in the same breath” as a given word. Here the word “god” is frequently used in conjunction with “good”, “holy”, “give”, and “love”.

Summary

To summarize, the Catholic Youth Literature Project is far from complete. For example, it has yet to be determined whether or not the implementation has enabled students to accomplish the Project’s stated goals. Does it really enhance the use and understanding of a text? Second, the process of selecting, acquiring, digitizing, and integrating the texts into the library’s collection is not streamlined. Finally, usability of the implementation is still in question. On the other hand, the implementation is more than a prototype and does exemplify how the process of reading is evolving over time.

Summary of the Catholic Pamphlets Project

Posted on March 27, 2012 in Uncategorized

This posting summarizes the Catholic Pamphlets Project — a process to digitize sets of materials from the Hesburgh Libraries collection, add the result to a repository, provide access to the materials through the catalog and “discovery system” as well as provide enhanced access to the materials through a set of text mining interfaces. In a sentence, the Project has accomplished most of its initial goals both on time and under budget.

The Project’s original conception

The Catholic Pamphlets Project began early in 2011 with the writing of a President’s Circle Award proposal. The proposal detailed how sets of Catholic Americana would be digitized in conjunction with the University Archives. The Libraries was to digitize the 5,000 Catholic pamphlets located in Special Collections, and the Archives was to digitize its set of Orestes Brownson papers. In addition, a graduate student was to be hired to evaluate both collections, write introductory essays describing why they are significant research opportunities, and do an environmental scan regarding the use of digital humanities computing techniques applied against digitized content. In the end, both the Libraries and the Archives would have provided digital access to the materials through things like the library catalog, its “discovery” system, and the “Catholic Portal”, as well as laid the groundwork for further digitization efforts.

Getting started

By late Spring a Project leader was identified, and their responsibilities were to coordinate the Libraries’s side of the Project in conjunction with a number of library departments including Special Collections, Cataloging, Electronic Resources, Preservation, and Systems. By this time it was also decided not to digitize the entire collection of 5,000 items, but instead hire someone for the summer to digitize as many items as possible and process them accordingly – a workflow test. In the meantime, a comparison of in-house and vendor-supplied digitization costs would be evaluated.

By this time a list of specific people had also been identified to work on the Project, and these people became affectionately known as Team Catholic Pamphlets:

Aaron Bales • Eric Lease Morgan (leader) • Jean McManus • Julie Arnott • Lisa Stienbarger • Louis Jordan • Mark Dehmlow • Mary McKeown • Natasha Lyandres • Rajesh Balekai • Rick Johnson • Robert Fox • Sherri Jones

Work commences

Through out the summer a lot of manual labor was applied against the Project. A recent graduate from St. Mary’s (Eileen Laskowski) was hired to scan pamphlets. After a one or two weeks of work, she was relocated from the Hesburgh Library to the Art Slide Library where others were doing similar work. She used equipment borrowed from Desktop Computing and Network Services (DCNS) and the Slide Library. Both DCNS and the Slide Library were gracious about offering their resources. By the end of the summer Ms. Laskowski had digitized just less than 400 pamphlets. The covers were digitized in 24-bit color. The inside pages were gray-scale. Everything was digitized at 600 dots per inch. These pamphlets generated close to 92 GB of data in the form of TIFF and PDF files.

Because the Pamphlets Project was going to include links to concordance (text mining) interfaces from within the library’s catalog, Sherri Jones facilitated two hour-long workshops to interested library faculty and staff in order to explain and describe the interfaces. The first of these workshops took place in the early summer. The second took place in late summer.

In the meantime efforts were spent by two summer students of Jean McManus‘s. The students determined the copyright status of each of the 5,000 pamphlets. They used a decision-making flowchart as the basis of their work. This flowchart has since been reviewed by the University’s General Counsel and deemed a valid tool for determining copyright. Of the sum of pamphlets, approximately 4,000 (80%) have been determined to be in the public domain.

Starting around June Team Catholic Pamphlets decided to practice with the technical services aspect of the Project. Mary McKeown, Natasha Lyandres, and Lisa Stienbarger wrote a cataloging policy for the soon-to-be created MARC records representing the digital versions of the pamphlets. Aaron Bales exported MARC records representing the print versions of the pamphlets. PDF versions of approximately thirty-five pamphlets were placed on a Libraries’s Web server by Rajesh Balekai and Rob Fox. Plain text versions of the same pamphlets were placed on a different Web server, and a concordance application was configured against them. Using the content of the copyright database being maintained by Jean McManus’s students, Eric Lease Morgan updated the MARC records representing the print records to include links to the PDF and concordance versions of the pamphlets. The records were passed along to Lisa Stienbarger who updated them according to the newly created policy. The records were then loaded into a pre-production version of the catalog for verification. Upon examination the Team learned that users of Internet Explorer were not able to consistently view the PDF versions. After some troubleshooting, Rob Fox wrote a work-around to the problem, and the MARC records were changed to reflect new URLs of the PDF versions. Once this work was done the thirty-five records were loaded into the production version of the catalog, and from there they seamlessly flowed into the library’s “discovery system” – Primo. Throughout this time Julie Arnott and Dorothy Snyder applied quality control measures against the digitized content and wrote a report documenting their findings. Team Catholic Portal had successfully digitized and processed thirty-five pamphlets.

With these successes under our belts, and with the academic year commencing, Team Catholic Pamphlets celebrated with a pot-luck lunch and rested for a few weeks.

The workflow test concludes

In early October the Team got together again and unanimously decided to process the balance of the digitized pamphlets in order to put them into production. Everybody wanted to continue practicing with their established workflows. The PDF and plain text versions of the pamphlets were saved on their respective Web servers. The TIFF versions of the pamphlets were saved to the same file system as the library’s digital repository. URLs were generated. The MARC records were updated and saved to pre-production. After verification, they were moved to production and flowed to Primo. What took at least three months earlier in the year now took only a few weeks. By Halloween Team Catholic Pamphlets finished its workflow test processing the totality of the digitized pamphlets.

Access to the collection

There is no single home page for the collection of digitized pamphlets. Instead, each of the pamphlets have been cataloged, and through the use of command-line search strategy one can pull up all the pamphlets in the library’s catalog — http://bit.ly/sw1JH8

From the results list it is best to view the records’ detail in order to see all of the options associated with the pamphlet.

command-line search results page

From the details page one can download and read the pamphlet in the form of a PDF document or the reader can use a concordance to apply “distant reading” techniques against the content.

details of a specific Catholic pamphlets record

50 most frequently used words in a selected pamphlet

Conclusions and next steps

The Team accomplished most of its goals, and we learned many things, but not everything was accomplished. No graduate student was hired, and therefore no overarching description of the pamphlets (nor content from the Archives) was evaluated. Similarly, no environmental scan regarding use of digital humanities against the collections was done. While 400 of our pamphlets are accessible from the catalog as well as the “discovery system”, no testing has been done to determine their ultimate usability.

The fledgling workflow can still be refined. For example, the process of identifying content to digitize, removing it from Special Collections, digitizing it, returning it to Special Collections, doing quality control, adding the content to the institutional repository, establishing the text mining interfaces, updating the MARC records (with copyright information, URLs, etc.), and ultimately putting the lot into the catalog is a bit disjointed. Each part works well unto itself, but the process as a whole does not run like a well-oiled machine, yet. Like any new workflow, more practice is required.

This Project provided Team members with the opportunity to apply traditional library skills against a new initiative, and it was relished by everybody involved. The Project required the expertise of faculty and staff. It required the expertise of people in Collection Management, Preservation, Technical Services, Public Services, and Systems. Everybody applied their highly developed professional knowledge to a new and challenging problem. The Project was a cross-departmental holistic process, and it even generated interest in participation from people outside the Team. There are many people across the Libraries who would like to get involved with wider digitization efforts because they thought this Project was exciting and had the potential for future growth. They too see it as an opportunity for professional development.

While there are 5,000 pamphlets in the collection, only 4,000 of them are deemed in the public domain (legally digitizable). Four-hundred (400) pamphlets were scanned by a single person at a resolution of 600 dots/inch over a period of three months for a total cost of approximately $3,400. This is a digitization rate of approximately 1,200 pamphlets per year at a cost of $13,600. At this pace it would take the Libraries close to 3 1/3 years to digitized the 4,000 pamphlets for an approximate out-of-pocket labor cost of $44,880. If the dots/inch qualification were reduced by half – which still exceeds the needs for quality printing purposes – then it would take a single person approximately 1.7 years to do the digitization at a total cost of approximately $22,440. The time spent doing digitization could be reduced even further if the dots/inch qualification were reduced some more. One hundred fifty dots/inch is usually good enough for printing purposes. Based on our knowledge, it would cost less than $3,000 to purchase three or four computer/scanning set-ups similar to the ones used during the Project. If the Libraries were to hire as many as four students to do digitization, then we estimate the public domain pamphlets could be digitized in less than two years at a cost of approximately $25,000.

There are approximately 184,996 pages of Catholic pamphlet content, but approximately 80% of these pages (4,000 pamphlets of the total 5,000) are legally digitizable – 147,997 pages. A reputable digitization vendor will charge around $.25/page to do digitization. Consequently, the total out-of-pocket cost of using the vendor is close to $37,000.

Team Catholic Pamphlets recommends going forward with the Project using an in-house digitization process. Despite the administrative overhead associated with hiring and managing sets of digitizers, the in-house process affords the Libraries a means to learn and practice with digitization. The results will make the Libraries more informed and better educated and thus empower us to make higher quality decisions in the future.

Patron-Driven Acquisitions: A Symposium at the University of Notre Dame

Posted on March 19, 2012 in Uncategorized

The Professional Development Committee at the Hesburgh Libraries of the University of Notre Dame is sponsoring a symposium on the topic of patron-driven acquisitions:

  • Who – Anybody and everybody is invited
  • What – A symposium
  • When – Monday, May 21, 2012 from 9 o’clock to 1 o’clock, then lunch (included), and then informal roundtable discussions
  • Where – Hesburgh Library Auditorium, University of Notre Dame
  • Cost – free

After lunch and given enough interest, we will also be facilitating roundtable discussions on the topic of the day. To register, simply send your name to Eric Lease Morgan, and you will be registered. Easy!

Need a map? Download a campus map highlighting where to park and the location of the library.

Presentations

Here is a list of the presentations to get the discussion going:

  • Silent Partners in Collection Development: Patron-Driven Acquisitions at Purdue (Judith M. Nixon, Robert S. Freeman, and Suzanne M. Ward) – The Purdue University Libraries was an early implementer of patron-driven acquisitions (PDA). In 2000, interlibrary loan began buying rather than borrowing books that patrons requested. Following a brief review of the origin and reasons for this service, we will report on the results of an analysis of the 10,000 books purchased during the program’s first ten years. We examined data on the users’ status and department affiliations; most frequent publishers; and bibliographers’ analysis of the books in the top six subjects assessing whether the purchases were relevant to the collection. In addition, we will summarize the highlights of a comparative circulation study of PDA books vs. normally acquired books: do patron-selected books or librarian-selected books circulate at a higher rate? The conclusions of these PDA print book investigations encouraged the Libraries to begin an e-book PDA pilot program. We will report some early insights and surprises with this pilot. A librarian with selecting responsibilities in several subject areas will discuss his perspective of the value that PDA programs bring to collection building.
  • The Long Tail of PDA (Dracine Hodges) – Patron-driven acquisitions (PDA) titles are known to generate usage at least once at the moment a short-term loan or purchase is triggered. Despite the current PDA buzz, many remain unconvinced of the potential for ongoing circulation. There is a palpable level of skepticism over the sustainability of this buffet model with regard to user interest and the validity of shrinking librarian mediation in the selection process. To discuss these issues, data for content purchased during Ohio State’s 2009/2010 e-book PDA pilot will be examined. Several years of usage activity will be charted and analyzed for budgetary implications, including cost per use. In addition, key issues surrounding academic library patron-driven collection development philosophies will be explored. Particularly, this period when traditional methods of collection development must be maintained, while concurrently moving toward what appears to be the future with patron-driven collection development.
  • Acquisitions and User Services: responsive and responsible ways to build the collection (Lynn Wiley) – Patron-driven acquisitions (PDA) or purchase on demand programs are a natural extension of what libraries do naturally and that is to build programs to allow users to gain access to research materials. PDA programs provide for direct accountability on purchase decisions, especially relevant in the present economic situation. The Association of College and Research Libraries 2010 top ten trends in academic libraries (ACRL, 2010) listed PDA as a new force in collection development explaining: “Academic library collection growth is driven by patron demand and will include new resource types.” ACRL noted how this change was facilitated by vendor tools that provide controls for custom-made purchase on demand programs. In consortia settings, a PDA model can broaden access across the collective collection. This presentation describes the evolution of purchase on demand programs at the University of Illinois at Urbana-Champaign (UIUC) and includes a detailed description of several programs recently implemented at UIUC as well as a PDA program within a statewide academic library consortium that tested and analyzed purchase on demand mechanisms for print purchases. These programs describe a natural progression of models used to expand PDAs from ILL requesting to the discovery and selection model where bibliographic records were preselected and then made available in the online catalog for ordering. Statistics on use and users comments will be shared as well as comments on future applications.
  • Demand Driven Acquisitions: University of Notre Dame Experience (Fall 2011 – Spring 2012) (Laura A. Sill and Natasha Lyandres) – Using one time special funding, the Hesburgh Libraries of Notre Dame launched a DDA pilot project for ebooks in conjunction with YBP and Ebrary in September 2011. The implementation date followed several months of planning. The goal of the project was to test patronddriven acquisitions as the method for adding ebook titles of high interest to the library collection. Up until that point, ebooks had been acquired primarily through the purchase of large-scale vendor packages. One such package acquired in July of 2011 was Academic Complete on subscription, which provided access to 70,000 ebooks through the Ebrary platform. Also available to bibliographers and selectors was the ability to place firm orders through YBP for Ebrary titles. Our presentation will provide an overview of the pilot project and our thoughts on the effectiveness of this method vis-à-vis other ebook acquisitions methods currently utilized by the Libraries. We will discuss the particular challenges of running the pilot with Ebrary in conjunction with Academic Complete, as well as future possibilities for expanding our use of DDA to include additional use options such as short-term loans, greater integration with approval plans, and DDA for print.

Speakers

Here is a list of the speakers, their titles, and the briefest of bios:

  • Robert S. Freeman (Associate Professor of Library Science, Reference, Languages and Literatures Librarian) – Robert S. Freeman has worked at Purdue University since 1997, where he is a reference librarian and the liaison to the Department of English as well as the School of Languages and Cultures. He has an M.A. in German from UNC-Chapel Hill and an M.S. in Library and Information Science from University of Illinois at Urbana-Champaign. Interested in the history of libraries, he co-edited and contributed to Libraries to the People: Histories of Outreach (McFarland, 2003). More recently, he co-edited a special issue of Collection Management on PDA.
  • Dracine Hodges (Head, Acquisitions Department) – Dracine Hodges is Head of the Acquisitions Department at The Ohio State University Libraries. Previously, she was the Monographs Librarian and the Mary P. Key Resident Librarian. She received her BA from Wesleyan College and MLIS from Florida State University. She manages the procurement of print and electronic resources for the OSU Libraries. Most of her career has focused on acquisitions, but she has also worked as a reference librarian and in access services. Dracine is active in ALCTS serving on the Membership Committee and as past chair of the Tech Services Workflow Efficiency Interest Group. She is also an editorial assistant for College & Research Libraries and a graduate of the Minnesota Institute.
  • Natasha Lyandres (Head, Acquisitions, Resources and Discovery Services Department (ARDS)) – Natasha Lyandres, MLIS from San Jose State University, began her professional career in 1993 as cataloging and special projects librarian at the Hoover Institution Library and Archives, Stanford University. From 1996 to 2001 she has served as Reference and Collections Development Librarian at Joyner Library, East Carolina University. Natasha has joined the Hesburgh Libraries of Notre Dame in 2001. She has held positions in the areas of serials, cataloging, acquisitions and electronic resources. Natasha is currently Head of Acquisitions, Resources and Discovery Services Department, and Russian and East European Studies bibliographer.
  • Judith M. Nixon (Professor of Library Science and Education Librarian) – Judith M. Nixon holds degrees from Valparaiso University and University of Iowa. She has worked at Purdue University since 1984 as head of the Consumer & Family Sciences Library, the Management & Economics Library, and the Humanities & Social Science Library. Currently, as Education Librarian, she develops the education collections. Her publishing record includes over 35 articles and books. Her interest in patron-driven acquisitions lead to co-editing a special issue of Collection Management that focuses on this topic and a presentation at La Biblioteca Del Futuro in Mexico City in October of 2001.
  • Laura A. Sill (Supervisor, Monographic Acquisitions Unit, ARDS) – Laura A. Sill, MA from the University of Wisconsin-Madison, has been a member of the Hesburgh Libraries of Notre Dame library faculty over the years since 1989. She has held positions in the areas of acquisitions, serials, and systems. Laura is currently Visiting Associate Librarian, supervising Monographic Acquisitions in the Acquisitions, Resources and Discovery Services Department.
  • Suzanne M. Ward (Professor of Library Science and Head, Collection Management) – Suzanne (Sue) Ward holds degrees from UCLA, the University of Michigan, and Memphis State University. She has worked at the Purdue University Libraries since 1987 in several different positions. Her current role is Head, Collection Management. Professional interests include patron-driven acquisitions (PDA) and print retention issues. Sue has published one book and over 25 articles on various aspects of librarianship. She recently co-edited a special issue of Collection Management that focuses on PDA, and her book Guide to Patron-Driven Acquisitions is in press at the American Library Association.
  • Lynn Wiley (Head of Acquisitions and Associate professor of Library Administration) – Lynn Wiley has been a librarian for over thirty years working for academic libraries in the east coast and since 1995 at the University of Illinois. Lynn has worked in public service roles until 2005 when she switched to acquisitions. She has written and presented widely on meeting user needs and provided analysis on how library partnerships can best achieve this. She is active in state, regional and national professional associations and is also on the editorial board of LRTS. Her overall goal is to meet the needs of users easily and seamlessly.