Tiny Text Mining Tools

I have posted to Github the very beginnings of Perl library used to support simple and introductory text mining analysis — tiny text mining tools.

Presently the library is implemented in a set of subroutines stored in a single file supporting:

  • simple in-memory indexing and single-term searching
  • relevancy ranking through term-frequency inverse document frequency (TFIDF) for searching and classification
  • cosine similarity for clustering and “finding more items like this one”

I use these subroutines and the associated Perl scripts to do quick & dirty analysis against corpuses of journal articles, books, and websites.

I know, I know. It would be better to implement these thing as a set of Perl modules, but I’m practicing what I preach. “Give it away even if it is not ready.” The ultimate idea is to package these things into a single distribution, and enable researchers to have them at their finger tips as opposed to a Web-based application.

3 Responses to “Tiny Text Mining Tools”

  1. Saqib Ali says:

    Hi Eric,

    Not exactly relevant to your blogpost, but I thought you might have some ideas on this.

    I am looking for a mining tool that returns two word phrases that appear in more than one document in a corpus. For e.g. take the phrase “Data Ninja”. Since it appears in more than one document in our corpus, the mining tool should return that phrase. The mining tool should find all such phrases from all the documents in our corpus, by mining for two adjacent word combination (forming a phrase) in the documents that are in the corpus.

    The corpus is simply a long list of journal articles.

    Any ideas, thoughts, suggestions? I have explored doing this using Google Big Query and Apache Solr. But I am looking for something more simpler, easier and scriptable.

    Thanks! 🙂

    • Saqib, I am not familiar with any specific programs that do this work, but I have written applications that have done this work. Make sense? My applications have been based on a set of Perl modules — http://search.cpan.org/dist/Lingua-EN-Ngram/ But I don’t think this really answers your question. You want to: 1) identify a phrase, 2) search that phrase over a corpus, and 3) return a lists of documents that include that phrase. Solr will do that work, but it seems like overkill. –ELM

      • Saqib Ali says:

        Thanks for the response Eric.

        I am experimenting with Apache Solr’s shingle factory right now. I am able to mine the phrases, but it is a little cumbersome. I just think there has to be a better / easier way to do this. Hope springs eternal…… 🙂