Truncating HTML in Ruby

Update: You should read the comments, then check out the extended verion of truncate_html that uses Tidy.

For the blogging Web site we wanted snippets of blog post for the front page and search results. This wasn't a problem at all for our test blog posts, which we made by pasting lorem ipsum into a text area:

<%= post.body.first(250) %>…

But this did cause a problem for real world blog posts, where the user would, for some reason beyond my understanding, write the post in Microsoft's Word first, then paste in the, er, HTML. I sanitized and removed various bits of HTML, but there came a point where it just didn't make sense to have

<img src="…

So, I set out to truncate XML properly. To this end I extended String with a truncate_html method. I just stuck it in lib/ in my Rails project and require'd it in the Post model. Here, have it:

require 'rexml/parsers/pullparser'

class String
  def truncate_html(len = 30)
    p = REXML::Parsers::PullParser.new(self)
    tags = []
    new_len = len
    results = ''
    while p.has_next? && new_len > 0
      p_e = p.pull
      case p_e.event_type
      when :start_element
        tags.push p_e[0]
        results << "<#{tags.last} #{attrs_to_s(p_e[1])}>"
      when :end_element
        results << "</#{tags.pop}>"
      when :text
        results << p_e[0].first(new_len)
        new_len -= p_e[0].length
      else
        results << "<!-- #{p_e.inspect} -->"
      end
    end
    tags.reverse.each do |tag|
      results << "</#{tag}>"
    end
    results
  end

  private

  def attrs_to_s(attrs)
    if attrs.empty?
      ''
    else
      attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
    end
  end
end
About these ads

13 Comments

  1. Posted January 11, 2007 at 9:24 am | Permalink

    I _think_ it should be:

    tags.reverse.each do |tag|

    (I think)

  2. Posted January 23, 2007 at 9:02 pm | Permalink

    Jason,

    You’re right. I’ve fixed it in the post (and my code) accordingly.

  3. Tony Payne
    Posted February 8, 2007 at 12:26 am | Permalink

    Very nice. I made a slight modification so that it won’t truncate in the middle of an entity reference:
    require ‘rexml/parsers/pullparser’
    require ‘htmlentities’
    class String
    def truncate_html(len = 30, ellipsis = ‘…’)
    p = REXML::Parsers::PullParser.new(self)
    tags = []
    new_len = len
    results = ”
    while p.has_next? && new_len > 0
    p_e = p.pull
    case p_e.event_type
    when :start_element
    tags.push p_e[0]
    results ”
    when :end_element
    results ”
    when :text
    text = HTMLEntities.decode_entities(p_e[0])
    results ”
    end
    end
    tags.reverse.each do |tag|
    results ”
    end
    results
    end

    private

    def attrs_to_s(attrs)
    if attrs.empty?

    else
    attrs.to_a.map { |attr| %{#{attr[0]}=”#{attr[1]}”} }.join(‘ ‘)
    end
    end
    end

  4. Tony Payne
    Posted February 8, 2007 at 12:28 am | Permalink

    Sorry about that formatting. Here’s the contents of the when :text without the broken HTML:

    text = HTMLEntities.decode_entities(p_e[0])
    results << HTMLEntities.encode_entities(text.first(new_len), :basic, :named)
    new_len -= text.length
    if new_len < 0
    results << ellipsis
    end

  5. Posted February 8, 2007 at 12:33 am | Permalink

    Oh, good call Tony. I’ll update the post with that fix (and credit to you and Jason), soon.

  6. Colin K
    Posted November 30, 2007 at 5:01 pm | Permalink

    Heck, as long as we’re at it, why not replace that line in when :start_element with this?

    results << “”

    Now your truncated html looks EXACTLY like you would have done it by hand!

    BTW, thanks a lot for the function, we use it like crazy

  7. Colin K
    Posted November 30, 2007 at 5:02 pm | Permalink

    *sigh*

    results << “<#{tags.last}#{p_e[1].empty? ? ” : ‘ ‘}#{attrs_to_s(p_e[1])}>”

  8. Posted January 30, 2008 at 1:08 pm | Permalink

    Made a similar helper in Hpricot.

  9. Posted February 20, 2008 at 12:36 pm | Permalink

    This is excellent. Thanks guys :)

  10. Posted November 16, 2008 at 8:39 am | Permalink

    Excellent site! I wish the owner to develop and please all! http://sex-free-online.ru/map.html

  11. Posted December 28, 2008 at 12:20 am | Permalink

    Thanks for the code! I’ve made a few small changes:

    - Tags without attributes were created like `` (so I made the `attrs_to_s` method return an space+attrs, and removed the space from the `start_element` case)

    - I added the at_end parameter, so you can add a string before the closing tags, most likely “…”

    For example:

    Instead of..

    >> puts “Something“.truncate_html(5) + “…”
    =>
    Someth

    You can do..

    >> puts “Something“.truncate_html(5, at_end = “…”)
    =>
    Someth…

    - I also changed `p_e[0].first(new_len)` to `p_e[0][0..new_len]` as p_e[0] didn’t seem to have a `.first` method?

    Oh, the code:

    http://pastie.org/347690

  12. sandip
    Posted June 23, 2009 at 9:02 am | Permalink

    Hi,

    Thanks for gr8 post!.
    It saved my lot of time.
    One quick question…I wanted to escape image and embed tag from my html text.

    What changes do i need ????

    Thanks,

    Sandip R~

  13. Kiyoshi
    Posted November 12, 2009 at 6:25 pm | Permalink

    I aggregated all changes made by everyone in the comments (taking care of HTML entity chars and adding an option to append a tail text to the end of the resulting string) and also added an option to not cut words in half when doing the truncate.

    Code: http://gist.github.com/233362

    Some examples:

    >> s = 'read what he said: <a href="http://google.com">&ldquo;One must believe his fate.&rdquo;</a>'
    => "read what he said: <a href=\"http://google.com\">&ldquo;One must believe his fate.&rdquo;</a>"

    >> # 28 truncates "One" into "On", options: word_cut = true, tail = " …"
    >> s.truncate_html(28)
    => "read what he said: <a href=\"http://google.com\">&ldquo;On</a> …"
    >> # 28, word_cut false
    >> s.truncate_html(28, :word_cut=>false)
    => "read what he said: <a href=\"http://google.com\">&ldquo;One</a> …"
    >> # 25 hits exactly inside the entity ref., but it's taken care of
    >> s.truncate_html(25)
    => "read what he said: <a href=\"http://google.com\">&ldquo</a> …"
    >> # but if you use word_cut = false, it will fetch text until the next space
    >> s.truncate_html(25, :word_cut=>false)
    => "read what he said: <a href=\"http://google.com\">&ldquo;One</a> …"
    >> # and finally the tail text
    >> s.truncate_html(25, :word_cut=>false, :tail => '<a href="#">Read more …</a>')
    => "read what he said: <a href=\"http://google.com\">&ldquo;One</a><a href=\"#\">Read more …</a>"


3 Trackbacks/Pingbacks

  1. [...] the given amount of characters. Luckily for us Mick Burns has already written a function for Truncating HTML in Ruby. [...]

  2. [...] 比较好的办法是像这篇文章里说的利用REXML和一个队列去完成,下面的评论中提到用 HTMLEntities 可以改善截断问题,但是在测试中发现会有UTF-8 invalid 问题(需要3bytes,只传2bytes,不知道是不是我数据的问题),于是乎放弃HTMLEntities,这样会有一个小bug,截断后末尾会有个乱码,八成是UTF-8 被截断了。 [...]

  3. [...] Truncating HTML in Ruby « Mike Burns, Coder (tags: truncate html ruby) [...]

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: