Update: You should read the comments, then check out the extended verion of truncate_html that uses Tidy.
For the blogging Web site we wanted snippets of blog post for the front page and search results. This wasn’t a problem at all for our test blog posts, which we made by pasting lorem ipsum into a text area:
<%= post.body.first(250) %>…
But this did cause a problem for real world blog posts, where the user would, for some reason beyond my understanding, write the post in Microsoft’s Word first, then paste in the, er, HTML. I sanitized and removed various bits of HTML, but there came a point where it just didn’t make sense to have
<img src="…
So, I set out to truncate XML properly. To this end I extended String with a truncate_html method. I just stuck it in lib/ in my Rails project and require‘d it in the Post model. Here, have it:
require 'rexml/parsers/pullparser'
class String
def truncate_html(len = 30)
p = REXML::Parsers::PullParser.new(self)
tags = []
new_len = len
results = ''
while p.has_next? && new_len > 0
p_e = p.pull
case p_e.event_type
when :start_element
tags.push p_e[0]
results << "<#{tags.last} #{attrs_to_s(p_e[1])}>"
when :end_element
results << "</#{tags.pop}>"
when :text
results << p_e[0].first(new_len)
new_len -= p_e[0].length
else
results << "<!-- #{p_e.inspect} -->"
end
end
tags.reverse.each do |tag|
results << "</#{tag}>"
end
results
end
private
def attrs_to_s(attrs)
if attrs.empty?
''
else
attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
end
end
end
12 Comments
I _think_ it should be:
tags.reverse.each do |tag|
(I think)
Jason,
You’re right. I’ve fixed it in the post (and my code) accordingly.
Very nice. I made a slight modification so that it won’t truncate in the middle of an entity reference:
require ‘rexml/parsers/pullparser’
require ‘htmlentities’
class String
def truncate_html(len = 30, ellipsis = ‘…’)
p = REXML::Parsers::PullParser.new(self)
tags = []
new_len = len
results = ”
while p.has_next? && new_len > 0
p_e = p.pull
case p_e.event_type
when :start_element
tags.push p_e[0]
results ”
when :end_element
results ”
when :text
text = HTMLEntities.decode_entities(p_e[0])
results ”
end
end
tags.reverse.each do |tag|
results ”
end
results
end
private
def attrs_to_s(attrs)
if attrs.empty?
”
else
attrs.to_a.map { |attr| %{#{attr[0]}=”#{attr[1]}”} }.join(’ ‘)
end
end
end
Sorry about that formatting. Here’s the contents of the when :text without the broken HTML:
text = HTMLEntities.decode_entities(p_e[0])
results << HTMLEntities.encode_entities(text.first(new_len), :basic, :named)
new_len -= text.length
if new_len < 0
results << ellipsis
end
Oh, good call Tony. I’ll update the post with that fix (and credit to you and Jason), soon.
Heck, as long as we’re at it, why not replace that line in when :start_element with this?
results << “”
Now your truncated html looks EXACTLY like you would have done it by hand!
BTW, thanks a lot for the function, we use it like crazy
*sigh*
results << “<#{tags.last}#{p_e[1].empty? ? ” : ‘ ‘}#{attrs_to_s(p_e[1])}>”
Made a similar helper in Hpricot.
This is excellent. Thanks guys :)
Excellent site! I wish the owner to develop and please all! http://sex-free-online.ru/map.html
Thanks for the code! I’ve made a few small changes:
- Tags without attributes were created like `` (so I made the `attrs_to_s` method return an space+attrs, and removed the space from the `start_element` case)
- I added the at_end parameter, so you can add a string before the closing tags, most likely “…”
For example:
Instead of..
>> puts “Something“.truncate_html(5) + “…”
=> Someth…
You can do..
>> puts “Something“.truncate_html(5, at_end = “…”)
=> Someth…
- I also changed `p_e[0].first(new_len)` to `p_e[0][0..new_len]` as p_e[0] didn’t seem to have a `.first` method?
Oh, the code:
http://pastie.org/347690
Hi,
Thanks for gr8 post!.
It saved my lot of time.
One quick question…I wanted to escape image and embed tag from my html text.
What changes do i need ????
Thanks,
Sandip R~
3 Trackbacks/Pingbacks
[...] the given amount of characters. Luckily for us Mick Burns has already written a function for Truncating HTML in Ruby. [...]
[...] 比较好的办法是像这篇文章里说的利用REXML和一个队列去完成,下面的评论中提到用 HTMLEntities 可以改善截断问题,但是在测试中发现会有UTF-8 invalid 问题(需要3bytes,只传2bytes,不知道是不是我数据的问题),于是乎放弃HTMLEntities,这样会有一个小bug,截断后末尾会有个乱码,八成是UTF-8 被截断了。 [...]
[...] Truncating HTML in Ruby « Mike Burns, Coder (tags: truncate html ruby) [...]