Juraj's Blog

21 Dec 2024

Calculating ebook pages

I wanted to calculate the amount of pages per ebook, for statistical purposes. Most of the ebooks that I have are in the epub format. This file type is easy to parse, as the book is technically a ZIP file with an .epub extension, that contains XHTML files as content.

The length in pages can be estimated by unzipping the entire book, then getting the text inside the (X)HTML file and calculating the number of characters. To make my job easier, I decided to use the BeautifulSoup library and its get_text() method. Once I have the number of characters in the entire book, I divide them by 1,800, which is roughly the amount of characters per page of a book.

The project is done in Python, source code on Github.