Shooting the Trouble

Yesterday, Conductor users began reporting errors with the Paste functionality of the CKEditor, the rich text editor used by Conductor.  The problem manifested on some browsers, but not all – Chrome 17 appeared immune, but there were reports of problems from Firefox 3.6, IE 9, IE 8, Chrome 16.

<Quasi-Religious-Diatribe>

loathe, despise, and detest Paste from Word for HTML content.  It just doesn’t work well.

You would never rely on Google Translate or Babelfish to convert your English brochure to Spanish and then give that brochure to native Spanish speakers.  Did you translate that to  Castillian, Cuban, Mexican, or some other Spanish dialect?  So why would you write your content in Word then paste it into a browser and expect a high fidelity copy?

This is a complicated problem that I wish were resolved.  I understand that HTML can be intimidating…especially when you bring CSS and browser compatibility into the mix…but even the “best translation service” is fairly terrible to a native speaker.

So roll up your sleeves and get comfortable with the HTML.  If you are writing anything for the web, you need to understand it.

And hats off to those few people around the world who maintain the Rosetta Stone for Paste from Word functions.  It is a thankless job that requires working with some truly horrific ever shifting data and attempting as best you can to map that to an ever shifting landscape of browser implementations of HTML.

</Quasi-Religious-Diatribe>

deep breaths…count down from 10, 9, 8, 7, 6, 5, 4, 3, 2, 1…

Okay, I’m back.

The Initial Problem

The problem I was trying to solve was a pernicious Chrome and Safari paste from Word bug. First, lets follow the steps below to see the initial problem.

Step 1: Copy some simple text from Word

Nothing fancy, just a paragraph with a bulleted list.

Copy from Microsoft Word 2008 for Mac

Copy from Microsoft Word 2008 for Mac

Step 2: Paste Text into Chrome

Things look reasonable, though I don’t like the copied bullet. At this point, most people save their page and go about their business.  Its a reasonable thing to do. But step 3 reveals the problem.

Paste to Chrome Step 1

Paste to Chrome Step 1

Step 3: Toggle into Source View Mode

In the source view, there is a <p>&nbsp;</p> that “just appears”.  As it turns out, that little bit of HTML is present in Step 2, but is for some reason invisible in the CKEditor.

Paste to Chrome Step 2

Paste to Chrome Step 2

Step 4: Toggle out of Source View Mode

And the paragraph appears.

Paste to Chrome Step 3

Paste to Chrome Step 3

The Initial Yet Problematic Solution

As always, go to Google when you encounter a coding problem.  And as it turns out the CKEditor paste error is a known problem…without an elegant solution.  What was happening is the Paste action in CKEditor was wrapping the content in a P-tag.

If I were to copy from Word “Simple Paragraph”, the pasted content would be something like  ”<p><p>Simple Paragraph</p></p>”.  Chrome would resolve that by created “<p>&nbsp;</p><p>Simple Paragraph</p>” but not before the CKEditor had rendered Step 2 from above.

CKEditor provides a hook for the Paste event.  So I implemented a solution, but unfortunately it wasn’t adequate for all browsers.  So, I spent a good portion of yesterday investigating what was going on.

Painful Discovery

While I was working on the solution, I discovered something truly horrific.  Each browser receives different pasted values from Word.  And likewise handles this paste differently.

Chrome on OSX

…60+ lines of XML declarations then…
<!--StartFragment-->

<p class="MsoNormal">This is a paragraph</p>

<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!--[if !supportLists]--><span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol">·<span style="font-family: Times New Roman; font-size: 7pt; ">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><!--[endif]-->Item one</p>

<p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!--[if !supportLists]--><span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol">·<span style="font-family: Times New Roman; font-size: 7pt; ">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><!--[endif]-->Item two</p>

<!--EndFragment-->

Firefox on OSX

…170 lines of style declarations then…
<p class="MsoNormal">
  This is a paragraph
</p>
<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in; mso-list:l0 level1 lfo1">
  <span style="font-family:Symbol; mso-fareast-font-family:Symbol; mso-bidi-font-family: Symbol">
    <span style="mso-list:Ignore">·
      <span style="font:7.0pt &quot; Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>
    </span>
  </span>
  Item one
</p>
<p class="MsoListParagraphCxSpLast" style="text-indent:-.25in; mso-list:l0 level1 lfo1">
  <span style="font-family:Symbol; mso-fareast-font-family:Symbol; mso-bidi-font-family: Symbol">
    <span style="mso-list:Ignore">·
      <span style="font:7.0pt &quot; Times New Roman&quot;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</span>
    </span>
  </span>
  Item two
</p>

And all of the other browsers receive comparably different information from Microsoft Word.

Final Pasted Value That Is Saved in Conductor

And here we have the “final” pasted value of several different browsers.

Chrome 17 on OS X

<p>This is a paragraph</p><p>·&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>Item one</p><p>·&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>Item two</p>

Firefox 10 on OS X

<p>This is a paragraph</p><ul><li >Item one</li><li >Item two</li></ul>

Safari 5.1.1 on OS X

<p>This is a paragraph</p><ul><li   >Item one</li><li  >Item two</li></ul>

Opera 11.6 on OS X

<p>
  &nbsp;</p>
<p class="MsoNormal">
  &nbsp;</p>
<p class="MsoNormal">
  This is a paragraph</p>
<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;mso-list:l0 level1 lfo1">
  <span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol"><span style="mso-list:Ignore">&middot;<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span>Item one</p>
<p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;mso-list:l0 level1 lfo1">
<span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol"><span style="mso-list:Ignore">&middot;<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span>Item two</p>

Chrome on Windows

<div>This is a paragraph</div><div>•<span class="Apple-tab-span" >  </span>Item one</div><div>•<span class="Apple-tab-span" > </span>Item two</div><div><br></div>

IE8 on Windows

This is a paragraph<BR>•&nbsp;Item one<BR>•&nbsp;Item two<BR>

As you can see, the above content varies wildly.  It is the result of three different programs negotiating the Copy/Paste behavior: Microsoft Word, the browser, and CKEditor.  And in many cases, the pasted HTML is quite terrible – I’d say only Firefox and Safari are proper HTML given the source Word document.

The Current Solution to the Paste Issue

Below is the code that I have settled on for fixing the original paste from Word problems for Chrome and Safari. Below is the code that I tried to use but failed. But keep reading after the solution:

CKEDITOR.on("instanceReady", function (ev) {
  ev.editor.on("paste", function (e) {
    if (e.data["html"]) {
      // Strip lang, style, size, face, and bizarro Word tags
      var input = e.data["html"].replace(/<([^>]*)(?:lang|style|size|face|[ovwxp]:\w+)=(?:[^]*|""[^""]*""|[^\s>]+)([^>]*)>/gi, "<$1$2>");
      var output = ;

      // The Paste action in CKEditor was wrapping the content in a p-tag;
      // By only using the innerHTML of the first element, the auto wrapping
      // of a p-tag instead wraps the first element in a p-tag.
      // So pasting: <p>Hello</p><p>World</p>
      //   Was Pasted as <p><p>Hello</p><p>World</p></p>
      //   Resolves as <p>&nbsp;</p><p>Hello</p><p>World</p> AND
      //     the <p>&nbsp;</p> was invisible
      //   Paste as Hello<p>World</p>
      //   Resolves as <p>Hello</p><p>World</p>
      // I have trepidations about this, but it appears to work in a
      // relatively general case.

      // Internet Explorer may not paste well-formed HTML, but instead
      // paste innerHTML
      if ($(input).html() == "" ) {
        output = input;
      } else {

        // Iterate over the top-level DOM elements
        $(input).each(function(key,value){

          // For the first top-level DOM element, we want the innerHTML, so
          // that it can be wrapped by a P-tag…either in the Browser or in
          // the CKEditor
          if (key == 0) {
            output += value.innerHTML;
          } else {
            // outerHTML exists in some browsers as a native property
            // It is likely more reliable than the html method (in fact
            // in Chrome, $(<div>Bob</div>).html() returned Bob)
            if (value.outerHTML == undefined) {
              output += $(value).html();
            } else {
              // Likely more reliably than html(), as it is a native browser
              // method in some "modern" browser
              output += value.outerHTML;
            }
          }
        });
      };
      e.data["html"] = output;
    }
  });
});

Eventually, I opted to clean the HTML on the server side using the following regular expression:

text.sub!(/\A(\<p[^\>]*\>[\t\s]*\&nbsp\;[\t\s]*\<\/p\>)*/m,"")

HTML Emails Inserting Spaces in Odd Locations

About two weeks ago, my team received a report of a problem in one of our system generated emails.  A small handful of the words in longer paragraphs were being split.

For example, there was a long paragraph (200 words or so), and the word “condition” was split into “condit ion” – a strange problem but one related to a previously discovered limitation in the venerable yet pervasive sendmail program which we used for delivering the emails.

The Challenge

Sendmail splits long lines after the 998th character.  It does this by adding a carriage return (like hitting Return on your keyboard).  What was happening is the “t” in “condition” was at the 998th character.  Further muddying the water, was the fact that we are dealing with escaped HTML, so a quote (“) is actually represented as &quot; And there were also tags which are invisible to a human.

The Fumbling

I was aware of the 998th character issue of sendmail, but didn’t know of a good work around.  I started chatting with Jaron, a fellow Notre Dame programmer and good friend of mine, about the problem.

Both of our initial understandings of HTML emails was that they simply worked.  Which clearly was not the case.

Important Sidebar

Instead of starting the chat by stating the root cause, I solicited a request for help with my proposed solution – a regular expression to add a carriage return after every period, so long as it wasn’t part of an attribute of an html-tag.

Thankfully, I only spent a four minutes going down that path before I stated the root cause – carriage returns were being injected into an HTML email and it was breaking words.

When looking for help with a problem, don’t ask for help on a problem related to your proposed solution. Instead clearly state your understanding of the initial problem.  Then state your proposed solution for correcting the problem.

The Solution

After a lot of trial and error, we eventually settled on setting the email’s HTML part’s Content-Transfer-Encoding to base64, and encoded the HTML part in base64.

Below is our Rails 3.0.11 solution, it hasn’t been “cleaned up” but it highlights the key take-aways:

# Rails.root/app/models/notifier.rb
class Notifier < ActionMailer::Base
  def general_email
    # important configuration stuff
    # setting @object for template access
    mail { |format|
      format.text
      format.html(:content_transfer_encoding => base64)
    }.deliver
  end
end

# Rails.root/app/views/notifier/general_email.html.erb
<%= Base64.encode64(%(<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <title>#{@object.subject}</title>
</head>
<body>
  #{ @object.body.html_safe }
</body>
</html>))%>

Content is King and the King needs to Move

It appears that Marketing Communication’s message concerning “Content is King” is taking hold.  Our copy writer, Mike Roe is completely booked for the July 2011 to July 2012 fiscal year. (In fact I suspect he’s overbooked.)

Lets Talk about Chess for a Moment

I’ve played quite a bit of chess, though strangely never in a tournament.  My 9th grade son has been involved in chess since kindergarten so I’ve been to my fair share of tournaments.

One of the unique moves in chess is castling your king; It serves two purposes.  One as a protecte protective measure to get your king out of the volatile middle columns of the board, and two to bring your powerful rook into play.

Now Back to Our Regular Programming

Creating content takes time and that is one precious resource you are not getting more of. Your website is a window into your world and your message.  Email, Twitter, Facebook, LinkedIn, and at some point Google+ are other windows into your world.

The message you are conveying is completely and entirely dependent on the content pieces you produce.

Think of these pieces of content as pieces on a chess board.  As those pieces are developed and pushed into the field of play, you should also be thinking about castling your king…content.

But be wary as your defensive positioning can easily become a liability.  If you fail to advance one of your pawns that is guarding the king, an enemy rook or enemy queen on your first rank can quickly spell checkmate.

So as you are working with your content think about it’s portability, in part because you don’t want to write it again, but also because the technology platform of today may be gone tomorrow (I know my VCR doesn’t work anymore, how about yours?).

Blogging platforms, such as WordPress and Blogger provide easy tools for piping data out of your blog and into another blog. Google+ and Facebook are also behind the idea of data liberation, providing a means of downloading everything you’ve done (Scary right?).

Conductor, Marketing & Communications CMS, also provides a means of getting the information out of your website (the documentation is a little sparse, but it’s one of my goals for the year to flesh it out). And up until now, getting content into Conductor was not nearly as easy.

The Big Reveal

For the past few weeks, along with launching Notre Dame Philosophical Review‘s new site, I’ve been working on a data migration tool to move webpages from outside of Conductor into Conductor.  The tool is intended for developers (those with an understanding of HTML markup) and greatly assists in moving content into Conductor.

This data migration tool is in it’s alpha stage, having only successfully moved content onto my local machine.  But it will evolve over time.

My Testing Philosophy

In order to verify something is working correctly you must first verify that it was not working.

The above philosophy is a version of the old proverb “If it isn’t broke don’t fix it.” Note the implied act of testing that it is broken. In the various software that I’ve worked on over the past 6 years, I have worked to include and maintain automated test suites. Each application’s test suite includes different kinds of tests as well as tests that were created for a variety of reasons.

The Kinds of Tests

I started writing tests as part of working on Ruby on Rails applications, so my testing experience follows it’s testing guidelines.

  • Unit tests – testing a small chunk of code (i.e. Determine a Page’s URL upon creation)
  • Functional tests – testing a single action (i.e. Create a Page)
  • Integration tests – testing the interaction of multiple actions (Login → Create a Page → View the new Page → Logout)

Together these kinds tests create the vast majority of my automated tests; There are other tests that verify the response of external systems: search.nd.edu and eds.nd.edu are two examples.

The Reasons for a Given Test

In each of these cases, the code is changing. By taking the time to write a handful of tests, I can better understand the problem as well as work at exposing any other underlying issues.

Just this past week, by working on one test, I discovered the solution to a problem that I hadn’t been able to solve.

Why Automated Tests

The big payoff is that if I have a robust test suite, I can run it at anytime, over and over, and verify that all the tests pass; Which in turn raises my confidence that the tested system is working properly.  It does not, however, guarantee that it is working, only that what I’m testing is working.

As an added perk, the tests I write convey what I am expecting the system to do. Which means taking time to understand the tests may help me understand the nuances and interactions of a more complicated software system. The tests also help my fellow programmers understand what is going on.

However test suite that successfully runs need not indicate that the system works. It only verifies that the tests work. Which leads to…

Problems with Automated Tests

  • Did I think of all of the possible scenarios?
  • Did I properly configure my test environment?
  • Did I account for differences between my test environment and the production environment?
  • Can I make the test environment as close to the production environment as possible?

And the real big kicker…

Do I Have the Support to Write a Test Suite.

It takes time to write tests, and in some cases people may balk at taking that time, but how are you going to test “the whole system” after you’ve made a “small change” and another…and another…and how about that first change by the new developer…you know the one that came in after your entire programming team got hit by the proverbial bus.

Afterall…

It Doesn’t Work if it isn’t Tested

I was going to test my software anyway, so why not take the time to have the machine do what it’s best at doing: repetitive tasks. I took time to learn how to write tests and then writing explicit instructions for my computer to do the things that I should be doing with each code update.

Is it fool proof? No, but, so long as I’m learning and “taking notes”, my tests are learning as well. Ultimately these tests reflect my understanding of the software application.

Example of Fixing a Bug with Testing

  • A software bug is reported.
  • I run the test suite and verify all the tests pass.
  • I write a test to duplicate the failure. This test must initially fail; After all a failed test verifies that something is broken.
  • I update the code until the test passes; I have verified that it is working. I run the test suite and verify all the tests pass.

Now, imagine if your initial test suite had zero tests, and in fixing the bug, you created one test. Run that test after each update to make sure you don’t regress.

Resources

Automated Testing Institute

Kent Beck: Test Driven Development

Jay Fields: How We Test

The Extra Mile is Full of Surprises

Yesterday, a colleague of mine was having problems with an image. It seems that the code she was using, and had always used to insert an image was not working. After a quick look, there was an extra space in the code that was preventing the IMG-tag from properly rendering. (Note: the site uses Textile to manage it’s content; So an extra space means the Textile to HTML parser may encounter problems).

I poked around in the code for a bit, wanting to ensure that Conductor was not “helping” her by inserting an extra space…This did appear to be the case; And according to my colleague, she had in fact hit the space bar. As a general rule, the system should prevent bad data from being entered.  So I updated the system to remove any trailing spaces.  As is typical, when I make a change to the code, I write a test that first fails.  Then I update the code, and get the test to pass. After all, I can’t verify something is fixed unless I can first verify it is broken.

This fix was proving to be a bit cantankerous. The test was failing in a slightly unexpected way.  And then it struck me… While I was looking and updating the code to prevent the extra spacing from causing a problem, I had stumbled upon another problem. So, I dug in, and found, to my elation, the root cause of a long-standing, erratic bug that I had been chasing around, but never successfully squashing.

I cleaned up the first test that was failing, and wrote a second test to duplicate the behavior that was causing the long standing error. This second test didn’t initially pass, but I knew I was on the right path. So I tweaked my test, and was ecstatic when the test failed. After all, I had to verify that it was failing before I could fix it. I then stepped into the code, and fixed the problem. And the test succeeded! Which meant that my test now verified the expected behavior.

What started out as ensuring proper data, even accidentally entered, became a fix for a long standing problem.  Incidentally, my colleague was also a sem-regular “victim” of the long standing problem.

Carol, thanks for hanging in there with me as we muddled through this problem.

Pining for a little less change

This past week, one of the problems I’ve worked to solve was why our custom interface to the Google Search Appliance at Notre Dame was choking on the multi-site search of Italian Studies. What was happening was that the syntax for querying both The Devers Program in Dante Studies and Italian Studies was to include a query parameter of as_oq=site:www.dante.nd.edu+site:italianstudies.nd.edu. The resulting URL looked something like this:

http://italianstudies.nd.edu/search?as_oq=site%3Aitalianstudies.nd.edu+site%3Awww.dante.nd.edu&q=cachey

Changes

"mayan changing table" courtesy of smcgee

Note the colon between site and italianstudies.nd.edu has been replaced with %3A; The URL has been encoded. The browser and server correctly handle it, however, a problem arises. If the URL has already been encoded (replace : with %3A) and it gets re-encoded the %3A becomes %253A, because, wait for it…

The % is another character that needs to be encoded.

Instead of the URI encoding being idempotent, an URI encoded two times is different than the same URI encoded three times. Do you see the potential problem? If you are working with a series of functions or even separate libraries, one of those functions or libraries may decide that it needs to encode a URI before it does something with the URI. But, what if the URI was encoded by another process somewhere else in the program chain? There is certainly “best practices” that can be applied for URI encoding, but I believe URI encoding should be idempotent. Hopefully there is a sound reason for it not being idempotent…But at this point, I’m unaware of why.

Breaking down those steps

Previously I wrote about Baby stepping into a solution to find URL-based references to assets for Conductor.  There were actually a lot of small steps involved in producing the solution. I figure what might help is to break-down the steps of my most recent work.

“Natural Language” Date Ranges

tower of babel

"De toren van Babel, Pieter Bruegel de Oude" courtesy of janke

I had received a request for creating custom date ranges: Could you only show events from the 2010 admissions calendar year? Or could you group events by 2010 academic year? Could you group events by academic semester? Could you break down events by quarter?  Nothing to complicated … Except where do you store those rules?  And do we really mean 2010 or does that mean “current year.”

Possible Solutions

After a bit of brain-storming with Chris, another developer of AgencyND, we came up with a couple of options.

Hard-Coding

stone house

"The true Stone House" courtesy of Feliciano Guimarães

Rarely will you hear someone extoll the virtues of hard-coding; in fact it is considered an anit-pattern, something that may be commonly used but is likely detrimental to the long-term health of the system. In layman’s terms, hard-coding is indicative of exhausted design options, short-term solutions, and/or buzzer-beating patches to a system.  It won’t hurt you now, but the you of four years from now might very well want to find a Delorean, get up to 88 miles per hour, and have a few choice words with your present self.

Needless to say, I personally didn’t want to implement a hard-coded solution without first exploring a Domain-Specific solution.

Domain-Specific solution

The domain specific solution is to look for a common means of expression.  Is there a sentence/phrase structure that can help convey the meaning to people and be formal enough that a computer can parse the phrase?  My team explored a bit, and came up with a couple of different phrases that we could use:

  • “starting on 9/14 for 1 year as of 2 years ago”
  • “starting no later than 1 year after 10/1 of 2 years ago”
  • “as of 2 years ago on 10/1 and continuing until 1 year later”
  • “on 10/2 for 1 year as of 2 years ago”

These phrases are not necessarily intended for “public” consumption, but would instead be used primarily by web developers. However, we don’t want the syntax to be overly archaic.  Regardless of which language we chose, the above sentences lead to the next step.

Define the Domain

domain

courtesy of rubyblossom

If you squint your eyes there are 6 variables used in calculating the custom date range:

  • starting_month
  • starting_day
  • starting_time_quantity
  • starting_time_unit
  • duration_quantity
  • duration_unit

With a little substitution on the first option, we have the following:

“starting on starting_month/starting_day for duration_quantity duration_unit as of starting_time_quantity starting_time_unit ago”

Define the Expected Results

At this point, I went down a path of short-lived frustration. I immediately jumped into trying to fetch events whose dates fell within the date range, but excluded events that didn’t.  Ultimately I was trying to do too many things in one function and test. Fortunately, when unit testing, one of the senses I’ve developed is “if I have to do too much to setup a test, then I clearly am not doing something right.”  So I decided to break both the function and the test into smaller parts.  There would be one test for retrieving by date range, and one test for creating the date range based on the natural language. The challenge wasn’t retrieving events with a given date range; Ruby on Rail‘s Object Relational Mapper ActiveRecord makes that trivial.  The challenge was parsing the sentence correctly.  To do this, I spec-ed out a test that I wanted to build towards:

should_parse_date_range(
  "starting on 9/14 for 1 year as of today",
   :as_of => '2010/09/15',
   :start_date => "2010/09/14",
   :end_date => "2011/09/14"
)

Breaking the above down into it’s components:

should_parse_date_range
This is the method that will verify the result.
"starting on 9/14 for 1 year as of today"
This is the test case I’m wanting to verify.
:as_of => "2010/09/15"
This is the date (i.e. “today”) for which the range is calculated.
:start_date => "2010/09/14"
This is the expected start date of the date range, relative to the :as_of date.
:end_date => "2011/09/14"
This is the expected end date of the date range, relative to the :as_of date.

Then I began to plug away at the implementation. And settled on the following Regular Expression to parse the sentence:

/^starting on (\d{1,2})-(\d{1,2}) for (\d+) (year|month|day)s? as of (today|now)$/

or in other terms

“starting on starting_month/starting_day for duration_quantity duration_unit as of starting_time_quantity starting_time_unit ago”

Then, armed with Ruby on Rail’s ActiveSupport Time library I wrote the following code.

# starting_month set from 1st matched Regular Expression group
# starting_day set from 2nd matched Regular Expression group
# duration_quantity set from 3rd matched Regular Expression group
# duration_unit set from 4th matched Regular Expression group

today = Date.today
starting_time_quantity = 0
starting_time_unit = 'years'

# 1.year.ago(10/10/2010).beginning_of_day.in_time_zone
date_range_begin_date = starting_time_quantity.
  send(starting_time_unit).
  ago(today).
  beginning_of_day.in_time_zone

begin_date = Date.civil(
  date_range_begin_date.year, 
  starting_month, 
  starting_day
).beginning_of_day.
  in_time_zone

if begin_date > date_range_begin_date
  begin_date = 1.year.ago(begin_date)
end

# 1.year.since(10/10/2010).beginning_of_day.in_time_zone
end_date = duration_quantity.
  send(duration_unit).
  since(begin_date).
  beginning_of_day.
  in_time_zone

Range.new(begin_date,end_date)

It took a bit to get to the above algorithm, but it was done by incrementally adding test cases; Each added test case stated the desired output based on the actual input. And those tests are now run daily as part of the larger test suite. So I can forget about it and move on…until I need to remember what was happening; Then I simply refer to the tests to see what is actually happening.

Baby stepping into a solution

shoes

"First steps" courtesy of florence luong

This August I rolled out a new feature for Conductor: Asset References. In short, Conductor now provides a means of tracking which Pages, News, and/or Events make use of a given Asset. This feature is certainly helpful in both trouble-shooting your site and to help you understand where the content is being used.

I’m going to step through the system development that went into this feature.

Data Structures

Conductor is backed by a Relational Database responsible for storing the content of the site. As part of the database, we have a table for Pages, Events, and News. Each of those tables can have one or more of the following fields: content, excerpt, “parts” and “meta-fields”; A reference to an Asset can be found in any of the preceding attributes an HTML link, but not all of the tables have the same attributes (see below).

Events News Pages
content yes yes yes
excerpt yes yes no
parts no no yes
meta-fields yes yes yes

As a result there is a bit of a challenge; I don’t want to write three separate functions that looked for references to Assets (i.e. one for Events, one for News, etc.). So lets do a little bit of abstraction, blur your eyes if you will, and say that the heterogeneous Pages, Events and News are all “ThingsWithContent”; This is sometimes referred to as Duck-Typing.

Calculating an asset reference is not the primary role of a “ThingWithContent”; Managing content is. So I’m going to create another table called AssetReference who’s sole purpose is to record the relation between ThingsWithContent and Assets by using a many-to-many relationship.

An AssetReference is created by taking a “ThingWithContent” and then checking each of it’s attributes that could have references to an Asset. This process is run each time a “ThingWithContent” is created or updated.

Observers

"Observadores / Observers" courtesy of Jardín Botánico

Since a Page, News, and Event’s primary role is not checking for AssetReferences, I wanted to loosely couple the AssetReferences calculation; The goal of loose coupling is to insulate the coupled objects from any changes in the other. By implementing the Observer Pattern, I’m able to loosely couple a ThingWithContent to an AssetReference. I created a ReferenceObserver which is notified when a ThingWithContent is created or updated. The ReferenceObserver then builds the AssetReferences by inspecting the observed ThingWithContent and only processing the attributes (i.e. content, excerpt, parts, meta-fields) that the ThingWithContent has implemented. In short the ReferenceObserver can take any type of ThingWithContent and successfully check for AssetReferences.

Pattern Matching

Each of the implemented fields that might have an AssetReference are then checked by using a Regular Expression to find all of the links to an Asset. Regular expressions are a powerful tool for finding and even replacing text. If you haven’t seen them before, they are extremely daunting. And if you’ve seen them before, they are merely daunting. The regular expression I use is as follows:

/\/assets\/(a-conductor-site\.nd\.edu\/)?(\d+)\//

Pattern Matching courtesy of Kent Landerholm

"Pattern Matching" courtesy of Kent Landerholm

I’m going to break this down so you can understand what’s going on. To simplify, I’m going to remove most of the back-slashes (\). They are a technical necessity but in this instance a learning obstacle, so I’ll remove them.

//assets/(a-conductor-site.nd.edu/)?(\d+)//

Then I’m going to remove the leading and trailing forward-slash (/) as those are used to indicate a regular expression.

/assets/(a-conductor-site.nd.edu/)?(\d+)/

This is a bit more legible but still needs explanation. See the “/assets/(a-conductor-site.nd.edu/)?”, what that chunk says is to find text that includes “/assets/” and optionally includes “a-conductor-site.nd.edu/”; The parenthesis indicates a Group of text and the question mark means the Group is optional. The next chunk (\d+)/ means a Group of one or more numbers; The \d stands for any digit and the plus-sign (+) means one or more of the preceding character.

For example: “/assets/22/” would match the patterns, but “/assets/2a/” would not.

Once I have the asset URLs, it is a matter of connecting the Asset to the ThingWithContent.

Conclusion

By keeping the responsibility of determining AssetReferences away from the ThingWithContent, I’m able to keep the core responsibility of a ThingWithContent intact (i.e. provide content). By use of object inspection, I’m able to handle a heterogeneous set of ThingWithContent. And by use of Regular Expressions, I’m able to find and register links to multiple assets.