Writing, publishing, geekdom, and errata.

Using SED: So You Want to Make an eBook? (Bonus Tips)

4 comments
This post is part of So You Want to Make an eBook?. I'm releasing this book in sections on my blog, but when it's all finished I will offer the whole thing as a single eBook. Everyone who donates toward its production (use the coffee cups to the right, note that it's because of this effort) will get a free copy of this eBook. You can find all the posts here.

Pressing hard up against some deadlines today, but this tool saved me hours of work.

SED (Stream EDitor) is one of those wonderful tools which has a huge learning curve, but is fantastically powerful. You can get SED for all operating systems (though there are some differences with different versions). This isn't a required tool, but a damn useful one.

Sed (streams editor) isn't really a true text editor or text processor. Instead, it is used to filter text, i.e., it takes text input and performs some operation (or set of operations) on it and outputs the modified text.

Given all the "search for this and replace this" bits in the last section, you might already have an idea of how useful this can be. If you're on linux or Macs, you probably already have SED installed. A port to Windows can be found at http://gnuwin32.sourceforge.net/packages/sed.htm.

Rather than explain SED to you, I'm going to point you toward two explanations, the latter of which includes a link to the "SED one-liners". Because the tool is cross-platform, many of the points in any tutorial work with the others. There's a Mac specific one at http://face.centosprime.com/macosxw/sed-an-introduction-and-tutorial/, and one that explains all the "One-Liners" at http://www.catonmat.net/blog/sed-one-liners-explained-part-one. The One-Liners are a series of examples of using SED to accomplish specific goals. I'm going to give you two more (watch out for line wrapping!):

This first one removes all FONT tags from the file named FOO.BAR. Doesn't matter how complex the font tag is, it's gone.
sed -e 's/[<][/][Ff][^>]*[>]//g' -e 's/[<][Ff][^>]*[>]//g' foo.bar

This one does several things, and I'm sure there's an easier way to do it, but this is what I figured out. Each up and down line | separates | a step. Only the very first step should be different on windows boxes - type FOO.BAR does the same thing.
1. lists file
2. gets rid of wordbreak dashes (from the end of a line like when categor-
ically print-oriented people do it (but leaves ones in the middle of lines.)
3. subs @ for double return (actually a newline)
4. adds return to <p tag
5. removes extra spaces
6. gets rid of the @ symbols.
cat foo.bar | sed 's/-*$//' | sed 's/^$/@/' | sed '/^<p/a @' | sed ':a;N;$!ba;s/\n//g' | sed 's/@/\n/g' > 1.txt

Since I was converting a book from PDF, with dashed words as well as dashes where a word was split between lines and entirely too many hard carriage returns, this easily saved me four hours of work removing all of those by hand.

Sure, the syntax is complicated, and it took me about forty-five minutes of tweaking to get that last one working the way I wanted. But not only did it save me four hours that time - but it's going to save me time every time I have to convert a book like that in the future.

Awesome.

Edit: David Levine e-mailed me to let me know some better ways to do this but Blogger wasn't liking him at the time (possibly because of the code). His comments are below, unaltered. And as for his critique of my first example, he's right - i forgot to put the redirection element " > OUTPUT.BAR " (without the quotes, naturally) at the end of the line.

This first one removes all FONT tags from the file named FOO.BAR. ... sed -e 's/[<][/][Ff][^>]*[>]//g' -e 's/[<][Ff][^>]*[>]//g' foo.bar

No, that will splat the contents of FOO.BAR, minus the FONT tags, to the user's screen. Actually removing the tags from the file requires a different command line and probably the use of an intermediate file.

Also, I believe you can simplify the two edits into a single command as follows:

sed 's;</*[Ff][^>]*>;;g'

A list of characters in brackets (e.g. [xyz]) matches any single character in the list. So putting a single character in brackets (e.g. [x]) matches any single character as long as it's that character... in other words, you might as well just write "x".

The expression "/*" matches zero or more slashes, so this will also match the invalid tag "<//FONT...>", but I don't think that matters.

Note, though, that this command (both your version and mine) will fail if the < and > of a FONT tag are not on the same line of the file.

One more optimization: in your second example, the "cat foo.bar" can be replaced by adding "foo.bar" as the last command-line argument to "sed" as in your first example.

Hope this is useful to you.


This post was part of So You Want to Make an eBook?. I'm releasing this book in sections on my blog, but when it's all finished I will offer the whole thing as a single eBook. Everyone who donates toward its production (use the coffee cups to the right, note that it's because of this effort) will get a free copy of this eBook. You can find all the posts here.

4 comments :

Edward said...

SED has a huge learning curve partly because the documentation is so astoundingly unhelpful and confusing. SourgeForge clearly didn't write it with non-techies in mind. So far, I can't even get SED to run, and I'm not even sure I've got it installed correctly.

Do you have to install GnuWin first or something? Is there a help page somewhere that explains all this in "plain English"?

Sorry, just frustrated... been at this for an hour now...

Edward said...

Okay, never mind... I got it running, sort of. It's a command-line program; I hadn't realized that at first.

Steve Saus said...

Oh, yes. My mistake in not making that clear.

Steve Saus said...

Also see David D. Levine's comment, added into the body of the post above.