One of the first things that confronts a writer wishing to e-publish their work is the confusing array of file formats and meta-data and the seeming lack of any standardization. In this article I will explore what an e-book really is, information regarding the different formats, tools to convert between formats, and how minor changes made as you write will make things much easier when you are ready to publish.
Why can’t I just save my doc file as an e-book?
When I was young, most writers used a typewriter to create a manuscript. There is some nostalgia surrounding typewriter, the snapping of the keys, the sound of the bell for each finished line; this was the sound of progress being made. When people were using typewriters, content was separated from format, layout, and design. You wrote a manuscript, double spaced, with whatever typeface was in your typewriter, typically in 10 or 12 point font. If you wanted to indicate special formatting, you would add hand-drawn markup to the document, or use some simple character based markup such as asterisks to indicate *bold* or underscore for _underline_and a slash for /italics/. At the publishing house, they would also add markup to the hard-copy, indicating margins, fonts, page-breaks, vertical spacing, table layout, images etc.. Authors worried about content, and publishers, for the most part, handled the presentation.
Nowadays, almost everyone uses a word processor of one type or another, with Microsoft’s Word being used by the majority. Publishers still want manuscripts in the same format (double spaced, 1 inch margins, etc.), but with the advent of word-processing, the markup can be embedded in the file. So when you italicize a phrase, as I did in the previous sentence, there is a code embedded in the text stream marking the start and end of the italic text. When viewed or printed, the phrase is shown in italics. This is known as presentational markup, and is what is used most often on word-processors. With presentational markup, you can change type family, size, weight, style, decorations, etc. You can go nut and dO crazy thi=&0=&. This allows writers to make bold, large titles, and chapter headings, or put the telepathic robot conversation in some odd font / style / weight to differentiate it from normal dialog.
The problem with presentational markup is that it is often used where descriptive or semantic markup should be used. Semantic markup differs from presentational markup in that it labels the individual parts of the document, such as the title, a paragraph, an image caption, or a heading, without defining presentation. For instance, the title is distinguished from the rest of the text by surrounding it with the appropriate markup codes, or tags. In html the markup tags are human readable and indicated by surrounding the tag name with angle brackets <tag> to open an element and including a trailing slash to close the element <tag /> as follows.
<title>This is a Title</title>
With semantic markup, the presentation is defined elsewhere, either in a separate file (known as a style sheet), or at the beginning of the document. In this way, equivalent parts of the document will have the same styling throughout. It is therefore easy to define and change the styling of every piece of the document that shares the markup, such as paragraphs, or chapter headers. It also becomes easy to generate a table of contents for a book by creating links to each chapter heading. This is important because the concept of a page is generally no longer meaningful due to variations in reading device sizes and capabilities.
This discussion of markup is necessary because all e-book formats require the document to have some sort of semantic markup. If you are self-publishing or want to understand why you can’t simply publish your MS-Word doc file as an e-book, you need to understand a little of what’s going on inside the e-book files themselves. The “e-book” is a container that supplies the document text, styling information, cover art, and meta-data to the reading application or hardware. A *.doc file is a document with hardly any semantic markup, containing mostly proprietary presentational markup.
File formats: the big 3
There are three major e-book formats that are supported on the majority of reading systems: ePUB, MOBI, and PDF. EPUB is an open format defined by the International Digital Publishing Forum (<idpf>), it is the primary format used on the iPad, Sony Reader, and the Barnes & Nobel NOOK, and can be read by any PC or internet based e-book reading software (eg. Calibre, Stanza, Bookworm, Ibis). Basically, all e-readers except the Kindle can read ePUB files without fuss.
PDF isn’t really an e-book format at all, it is a document format based on PostScript (PS). PDF is useful when you need to keep the “page” concept, and positioning on the page is important. It is also supports scalable vector graphics, so it is good for rendering technical drawings and diagrams. This really isn’t a good format for e-readers, most will read them, but it often requires horizontal panning which is no fun. It is useful if you need the e-reader version to match the printed version, or you need scalable graphics and mathematical formatting.
There are two other formats worth mentioning at this point, plain text (*.txt) and HTML (*.html, *.htm). Plain text has the advantage that it is readable on all e-readers. There are several formatting issues that need attention with respect to line wrapping, and there is no support for images, links or TOC, but for a simple document, it works well. HTML is important because not only is it the basis for web display, it is the underlying format for both ePUB and MOBI! Plain old HTML files can be viewed by the majority of e-readers without modification.
many other e-book formats