Writing, publishing, geekdom, and errata.

Converting The Text: So You Want to Make an eBook?

This post is part of So You Want to Make an eBook?. I'm releasing this book in sections on my blog, but when it's all finished I will offer the whole thing as a single eBook. Everyone who donates toward its production (use the coffee cups to the right, note that it's because of this effort) will get a free copy of this eBook. You can find all the posts here.

Converting the Text

Right. Got all those programs ready to go? Okay. This is the biggest part in creating an eBook. It's the most boring and attention-to-detail oriented part. It's also the part that makes you look like a professional.

First: If you're going to create a PDF, do that now. Whether you use a built-in "Save as PDF" or a "Print to PDF" solution, do it before you start mucking around with your file. Really. You'll also be able to use this PDF if you need to create screen captures of tables for your ePub, and for when you make a PDF offering. I have yet to see anything that can convert decently from ePub to PDF.

Next, create a new directory (folder, remember?) for this project. For example, I have C:\ePub\Author\Book (which is Window's format, or /home/user/Documents/ePub/Author/Book on *nix... you get the idea), where I substitute the author's name and book title. In the "Book" directory, I put the PDF I just created, and the original RTF. I also create several sub-directories. Here's the directory list from when I converted Jim C. Hines' book Goldfish Dreams:


Note that I didn't use any spaces. If you feel the need to use some kind of spacer, use_the_underscore_character_. Some programs don't like spaces (regardless of operating system), so it's just easier to avoid it altogether.

The "Base" directory is where we will be editing the files for our eBook. You can get a lot of these base files from the sample pack; it has the directory structure already in there. (We'll be handling those more in the next section.)

Open the RTF file in your word processor, select all, and make it all the same size and font. Be careful that you do not lose heading, bold, underline, and italics when you do this! (Yes, the same size and font. Remember our bit from the philosophy of eBooks; it's content, not style.) If you have a DOC, DOCX, ODT, or other funky format, this is the time that you "save as" to make it into an RTF file. As mentioned above, you'll be converting the original RTF into an HTML document by e-mailing yourself via Gmail. If you use another solution, your steps might be slightly different.

If you are trying to convert from PDF, you've got a lot of work ahead of you. A. Lot. Follow along here, and we'll hit that in an appendix.

When the document shows up in your inbox, use the web interface to "View" the document. Then save that page as "Web Page, HTML Only". Name the file so that you know which one is the original, for example, "goldfishdreams_original.html". It doesn't matter what it actually is, just that you know it's the original.

Once you've saved the HTML file, you will make another copy that you start editing. For example, "goldfishdreams_editing.html". This is also the point where I start my text file where I keep track of what steps I've taken (and notes of things to do later if I'm not at that step yet). You can use paper for this if you like, but I like having a separate todo.txt file with each project.

HTML is essentially a text file with "tags". The tags are inside of brackets, like this: <b>bold</b>. The tags are like stage directions for your web browser - they tell it what font, size, formatting, bold, and italics to use (along with a lot more). We don't need most of them; we'll tell it to use our own formatting later.

Again, I'm showing you from a Gmail conversion - if you use another utility, there might be tags beyond the ones that I discuss here. By and large, you can rip out extra tags with little harm. It's sometimes worthwhile to look up what those terms are so you don't accidentally delete something that you need. Edward, one of the beta readers, recommended W3Schools http://www.w3schools.com as a good resource, and I have to agree with him.

Stripping out the original formatting is vital - which is why it's darkly amusing when I hear people tell me how they make sure to put things in a special format before conversion.

So open the copy to edit, and we'll start performing search and replace functions. Here are the ones I had to do with the eBook conversion of Goldfish Dreams, by Jim C. Hines.

Replaced color="#0000FF" with a blank.
Replaced <font size="3" face="Times New Roman"> with a blank.
Replaced <font size="6" face="Times New Roman"> with a blank.
Replaced </font> with a blank.

Why: We will determine font size, color, and face using CSS, so we need to strip out all references to them beforehand. You want to search for color= and <font to turn up any other strange color or font faces. Check them as you go through - they might be important, or leftover editing notes.

Replaced <br> with <p> </p>.
Replaced <br /> with <p> </p>.
Why: This makes an extra space between lines properly. Use this sparingly.

Replaced <p>      (six spaces) with <p>
Why: Gmail tried to preserve indenting with multiple spaces. We will render it with CSS, so it does not need to be here.

Replaced two spaces with a space and &nbsp;.
Why: Otherwise the double spacing between sentences may disappear. This forces double spacing. There are arguments about whether or not you should have double spacing between sentences. Check out http://uxmovement.com/content/6-surprising-bad-practices-that-hurt-dyslexic-users for details and make your design decisions appropriately.

You can surround each chapter title (e.g. "Chapter One") with header tags. That is, it would end up looking like this: <h1>Chapter One</h1>. You could also just put them in bold and it would work just fine. I actually have two specific tags for titles and bylines that work well - we'll see them when we get to the CSS sheet next week.

If you had tables in your document, make a note of where they are located and delete it entirely. We will substitute a picture of the table for the table itself. Not all eReaders handle tables in the same way, and the results can be disappointing.

And ignore everything above the first <div>
tag. Realistically, at this point all you should largely have are the text, some <p>, <u>, and <i> elements (some might be <p align="center"> or something like that), and the closing of all those elements. The closing is when the element (<p>
or <u>) is "closed" by having another instance with a forward slash, like </p> or </u>.

At this point, there should not be any <span> tags, but do a search for <span just like you did for font and color, and for the same reason. Likewise, go through and just look for anything… odd. This is where humans are still needed. Make a note of any strangeness you find before we start the next round of search & replace. Save the file, and save a backup copy of the file.

A last note — HTML elements. “Curly quote’s cuteness”, ellipses (…), bullets (•), en and em dashes (– and —), as well as a lot of other symbols, are supported. However, you have to use special codes for them (as opposed to "this kind of quotes which are Steve's favorites...", which are standard symbols). You can find tables of the symbols here: http://www.w3schools.com/tags/ref_symbols.asp (scroll down to "other symbols"). You can use the element name, but if you want to have the highest level of compatibility, use the entity number instead. (Some stores prefer it.) This is most easily accomplished by doing search-and-replace as well. Hopefully, if it's your own book you'll know if there are ellipses or special characters in it. Otherwise you might just have to go through it line by line and find them all.

This post was part of So You Want to Make an eBook?. I'm releasing this book in sections on my blog, but when it's all finished I will offer the whole thing as a single eBook. Everyone who donates toward its production (use the coffee cups to the right, note that it's because of this effort) will get a free copy of this eBook. You can find all the posts here.


arch2ngel said...

Okay, dumb question: are foreign language characters accepted? Specifically, I'm referring to Spanish characters like the ñ in niña and señorita...

Steve Saus said...

It's not a dumb question! They're HTML entities as well. For example, ñ is &ntilde; . You can find a list of them at http://www.w3schools.com/tags/ref_entities.asp