Understanding e-Book File Formats

One of the first things that confronts a writer wishing to e-publish their work is the confusing array of file formats and meta-data and the seeming lack of any standardization. In this article I will explore what an e-book really is, information regarding the different formats, tools to convert between formats, and how minor changes made as you write will make things much easier when you are ready to publish.

Why can’t I just save my doc file as an e-book?

When I was young, most writers used a typewriter to create a manuscript. There is some nostalgia surrounding typewriter, the snapping of the keys, the sound of the bell for each finished line; this was the sound of progress being made. When people were using typewriters, content was separated from format, layout, and design. You wrote a manuscript, double spaced, with whatever typeface was in your typewriter, typically in 10 or 12 point font. If you wanted to indicate special formatting, you would add hand-drawn markup to the document, or use some simple character based markup such as asterisks to indicate *bold* or underscore for _underline_and a slash for /italics/. At the publishing house, they would also add markup to the hard-copy, indicating margins, fonts, page-breaks, vertical spacing, table layout, images etc.. Authors worried about content, and publishers, for the most part, handled the presentation.

Nowadays, almost everyone uses a word processor of one type or another, with Microsoft’s Word being used by the majority. Publishers still want manuscripts in the same format (double spaced, 1 inch margins, etc.), but with the advent of word-processing, the markup can be embedded in the file. So when you italicize a phrase, as I did in the previous sentence, there is a code embedded in the text stream marking the start and end of the italic text. When viewed or printed, the phrase is shown in italics. This is known as presentational markup, and is what is used most often on word-processors. With presentational markup, you can change type family, size, weight, style, decorations, etc. You can go nuts and dO crazy things. This allows writers to make bold, large titles, and chapter headings, or put the telepathic robot conversation in some odd font / style / weight to differentiate it from normal dialog.

The problem with presentational markup is that it is often used where descriptive or semantic markup should be used. Semantic markup differs from presentational markup in that it labels the individual parts of the document, such as the title, a paragraph, an image caption, or a heading, without defining presentation. For instance, the title is distinguished from the rest of the text by surrounding it with the appropriate markup codes, or tags. In html the markup tags are human readable and indicated by surrounding the tag name with angle brackets <tag> to open an element and including a trailing slash to close the element <tag /> as follows.

<title>This is a Title</title>

With semantic markup, the presentation is defined elsewhere, either in a separate file (known as a style sheet), or at the beginning of the document. In this way, equivalent parts of the document will have the same styling throughout. It is therefore easy to define and change the styling of every piece of the document that shares the markup, such as paragraphs, or chapter headers. It also becomes easy to generate a table of contents for a book by creating links to each chapter heading. This is important because the concept of a page is generally no longer meaningful due to variations in reading device sizes and capabilities.

This discussion of markup is necessary because all e-book formats require the document to have some sort of semantic markup. If you are self-publishing or want to understand why you can’t simply publish your MS-Word doc file as an e-book, you need to understand a little of what’s going on inside the e-book files themselves. The “e-book” is a container that supplies the document text, styling information, cover art, and meta-data to the reading application or hardware. A *.doc file is a document with hardly any semantic markup, containing mostly proprietary presentational markup.

File formats: the big 3

There are three major e-book formats that are supported on the majority of reading systems: ePUB, MOBI, and PDF. EPUB is an open format defined by the International Digital Publishing Forum (<idpf>), it is the primary format used on the iPad, Sony Reader, and the Barnes & Nobel NOOK, and can be read by any PC or internet based e-book reading software (eg. Calibre, Stanza, Bookworm, Ibis). Basically, all e-readers except the Kindle can read ePUB files without fuss.

MOBI, the Mobipocket reader file format now owned by Amazon, can have the *.azw, *.prc, or *.mobi extention. AZW is Amazon’s version of the mobi format that can be read on the Kindle. It is essentially the mobi file structure with its own DRM scheme, and no javascript support. The Kindle can also read unprotected *.prc or *.mobi files directly. MOBI is technically an off-shoot of the ePUB format and shares many of the same conventions.

PDF isn’t really an e-book format at all, it is a document format based on PostScript (PS). PDF is useful when you need to keep the “page” concept, and positioning on the page is important. It is also supports scalable vector graphics, so it is good for rendering technical drawings and diagrams. This really isn’t a good format for e-readers, most will read them, but it often requires horizontal panning which is no fun. It is useful if you need the e-reader version to match the printed version, or you need scalable graphics and mathematical formatting.

There are two other formats worth mentioning at this point, plain text (*.txt) and HTML (*.html, *.htm). Plain text has the advantage that it is readable on all e-readers. There are several formatting issues that need attention with respect to line wrapping, and there is no support for images, links or TOC, but for a simple document, it works well. HTML is important because not only is it the basis for web display, it is the underlying format for both ePUB and MOBI! Plain old HTML files can be viewed by the majority of e-readers without modification.

There are many other e-book formats, but with ePUB and MOBI, you have a book that can be read without conversion on any device currently available. There are methods for reading ePUB on the Kindle utilizing the built in browser, but they rely on an active internet connection. It is simple enough to convert an ePUB to the MOBI format, that really there is no reason to not publish in both formats. Publishing only in PDF should be avoided where possible since it is a fixed width format, and is poorly supported on many devices.

Format File extensions Devices that CAN read Devices that CAN NOT read
Text *.txt All None
HTML, XHTML *.htm, *.html, *.xhtml Kindle, iOS, Android, Nook Sony, iREX, Kobo
ePUB *.epub Android, iOS, Nook, iRex, Sony, Kobo Kindle
Mobi-pocket *.mobi, *.prc, *.azw Kindle, Android, iOS, iRex Nook, Kobo, Sony
PDF *.pdf All None

All except the Kindle 1.0, and WISEreader. Reading experience varies greatly, typically much worse than other eBook formats.

E-book files: What’s inside?

The two e-book formats discussed above, mobi and ePUB are really collections of files that are rolled into a single file for distribution. EPUB files are actually standard *.zip archives and can be opened by changing the file name extension to “zip” or by using 7-zip software which can open ePUB files without changing the extension. Mobi files use a proprietary compression scheme, but are essentially the same concept, so I’ll limit the remaining discussion to ePUB.
Notice how the document is a bunch of html files instead of one. Each html file is a section or a chapter and each section terminates on a page-break. You normally don’t want text from chapter 2 to reflow into the bottom of the last page of chapter 1. You want to have it start on a new page. Putting each chapter into a separate file forces the e-reader to do this.

Writing for the Web

Since the two big formats are containers for HTML documents, it makes sense to keep that in mind while you are writing. Converting your document to HTML might take a lot of effort if you aren’t planning for the conversion as you write. For instance, if you write an entire novel in one file, separating chapters by inserting page-breaks, and typing chapter headings by changing the font size 24pt and making it bold, you will likely spend a lot of time trying to get your HTML to look right in the ebook.

Here’s some tips to make it easier:

  1. Create a separate file for each chapter as you write, or at least make it easy to find chapter boundaries (ie. Include “Chapter” in the heading, or some other indication, don’t rely on style alone). This also makes it easy to “give away” a sample chapter by posting it on the internet.
  2. Use styles instead of formats. In other words, use the built in title, heading 1, text body, caption, etc. styles. If you don’t know how already, learn how to modify the styles to adjust the display, not format each piece of text by changing it’s typeface, font size, color, etc.
  3. Try to save and edit the document as HTML. You can switch from document view to source view using Word or Open Office. This allows you to edit in the WYSIWYG editor, or edit the HTML directly.
  4. If you need to be able to have the manuscript as a single DOC file (to send to a publisher for instance), you can use “master documents” to combine the front matter, chapter files, index and whatever else you have into one big file at the end. This is a feature in most major word-processors and makes it easier to work on large documents in general.

Putting it all together

Once you know what’s in there, making the e-book is straightforward. First convert your manuscript into a set of HTML files, one for each chapter. Then add the content.opf , toc.ncx and mimetype and then zip all the files up into a single *.zip file. Change the “zip” extension to “epub” and you’re done. The challenge is in creating the ocf and ncx files. Luckily there are tools to help do this.

The most popular free tool for creating and converting ebook files is Calibre. Calibre can convert your manuscript into an epub file with little effort. The ocf and ncx files get created automatically. However, there are a lot of options, and any wrong setting could lead to an incomplete book with missing meta-data, cover image problems, or a broken TOC. Some of these problems may only become apparent when you try to convert the file to another format, such as MOBI.

Sigil is currently the only tool that allows you to create and edit ePUB files directly, without writing in one format and then converting. It is free, open source software and can work well once you are more comfortable with the ePUB format.

If you use OpenOffice or LibreOffice, you can use an Open Office extension that lets you save your manuscript directly as an ePUB. I chose to link directly to the file since the site is in Italian. If you want to see the original site and give kudos to the author, you can find it at http://lukesblog.it/ebooks/ebook-tools/writer2epub/ where there is information on installation and use. Google translate does a good job with the translation.

Adobe InDesign is another option, but only if you have $700 to spare. Adobe now offers a monthly subscription license for $35/month if that better fits in your budget.

Once you have a well-formatted, publication quality ePUB, you can convert it to MOBI using Calibre, or use an online service such as 2epub.com or Convert.Files. Both services use the Calibre engine to do the conversion, but 2epub prompts for some metadata and overall is a better experience in my opinion. In order to create a Kindle e-book for distribution on Amazon, you need to convert the epub to the AZW format using their KindleGen application.

If you plan to have a publication quality document, you must make sure you do a thorough quality check by using an ePUB validator in addition to viewing the document in a reading application or better yet, on the target device. If you are not very tech savvy, you may want to delegate this e-book creation and conversion to a professional. It is my hope that you can find the help you need here on WritelyDone.