what you missed out on. AI integration:
- AI does 99.9% correct homograph phonetic spelling replacemnt.
- AI does number conversion to spoken formats. December 1982 is read December Nineteen Eighty Two, 1100 feet high is read eleven hundred feet not one one zero zero or one thousand one hundred. Abbreviaitons are exanded so Dr. is not read D R but doctor, Mrs. is Misses etc.
- Ridiculous capitalizations are removed but DVD stays the same but ALL HANDS REPORT is changed to All hands report so weird shit doesn't happen.
- complete integration into DNXS-Spokenword. Now preprocessing input text before converting to speech.
Bookfix is a comprehensive GUI-based text processing application designed specifically for cleaning and formatting ebook text files for text-to-speech (TTS) and other applications. It provides both automated and interactive tools to transform raw ebook text into polished, readable content.
- File Support: Currently works with
.txt,.html, and.xhtmlfiles - Platform: Written in Python and tested on Linux OS
- Ebook Conversion: For best results, convert EPUB files to TXT using Calibre before processing
- EPUB Processing: While possible to process EPUB files directly by decompressing and handling HTML markup, TXT conversion is recommended for stability and ease of use
- Interactive Word Choices - Manually select replacements for specific words with keyboard shortcuts
- Automatic Replacements - Apply predefined find/replace rules from configuration
- Pagination Removal - Remove page numbers and pagination elements from HTML/TXT files
- Roman Numeral Conversion - Convert Roman numerals to Arabic numbers (II → 2, XIV → 14)
- All-Caps Sequence Processing - Interactive handling of uppercase sequences with auto-lowercase options
- Abbreviation Protection - Prevents conversion of common abbreviations (I.D., Ph.D., etc.)
- Numbered Line Editing - Manual editing interface for lines containing 3+ digit numbers
- Blank Line Removal - Clean up excessive whitespace and empty lines
- Input formats:
.txt,.html,.xhtml - Output format:
.txt(with_outputsuffix) - BeautifulSoup integration for HTML/XHTML processing
- Checkbox controls for enabling/disabling processing steps
- Real-time text preview with syntax highlighting
- Progress tracking with visual progress bars
- Keyboard shortcuts for faster operation
- Status updates throughout processing
- Python 3.6 or higher
- Required packages:
tkinter(usually included with Python)beautifulsoup4
# Install required package
pip install beautifulsoup4
# Run the program
python3 bookfix.py# Check Python version
python3 --version
# Install dependencies
pip3 install beautifulsoup4
# If tkinter is missing (rare)
brew install python-tk
# Run
python3 bookfix.py-
Launch the application
python3 bookfix.py
-
Set default directory (first run only)
- Select your ebook library or text files folder
- This setting is saved for future use
-
Select input file
- Choose a
.txt,.html, or.xhtmlfile to process
- Choose a
-
Configure processing options
- Check/uncheck desired processing steps
- All options are enabled by default
-
Start processing
- Click "Start Processing" button
- Follow interactive prompts as needed
-
Save results
- Click "Save" when processing is complete
- Output saved as
filename_output.txt
The bookfix.py script is a GUI tool built with the Tkinter library. Its main goal is to help users clean and standardize text from input files by providing a way to handle inconsistent wording and apply automatic cleanup rules.
The program guides the user through making decisions for specific words and then performs a series of automatic text transformations based on rules read from a separate data file.
The program processes text in the following order:
- Apply Automatic Replacements - Bulk find/replace operations
- Insert Periods into Abbreviations - Add periods to specified abbreviations
- Remove Pagination - Clean up page numbers and pagination elements
- Interactive Choices - Manual word-by-word replacement decisions
- Process All-Caps Sequences - Handle uppercase text interactively
- Convert Roman Numerals - Transform Roman numerals to Arabic numbers
- Convert to Lowercase - Optional full text lowercasing
- Remove Blank Lines - Clean up excessive whitespace
- Numbered Line Editing - Manual editing of lines with numbers
- Press number keys (1-9) to select replacement options
- View highlighted matches in context
- Progress tracking shows completion status
- Y/Yes - Lowercase this instance and all remaining instances
- N/No - Keep uppercase, skip for this session
- A/Add - Add to ignore list permanently
- I/Auto - Add to auto-lowercase list permanently
- Edit lines containing 3+ digit numbers
- Navigate with Previous/Next buttons
- Roman numeral reference guide included
The program uses a .data.txt file for configuration with the following sections:
# CHOICE
word -> option1;option2;option3
# REPLACE
old_text -> new_text
# PERIODS
abbreviation_without_periods
# CAP_IGNORE
SEQUENCE_TO_IGNORE
# UPPER_TO_LOWER
SEQUENCE_TO_LOWERCASE
# DEFAULT_FILE_DIR
/path/to/your/ebook/folder
# CHOICE
colour -> color;colour
realise -> realize;realise
# REPLACE
-- -> —
... -> …
# PERIODS
Mr
Dr
St
# CAP_IGNORE
NASA
FBI
# UPPER_TO_LOWER
CHAPTER
BOOK
# DEFAULT_FILE_DIR
/Users/username/Documents/Ebooks
filename_output.txt- Main processed outputdebug.txt- Choice replacement logmatches.txt- Detailed match processing logroman_conversions.log- Roman numeral conversion logpagination_debug.txt- Pagination removal logbookfix_execution.log- Complete execution log
Detailed timestamped logging to both stderr and execution log files for debugging and verification.
The application provides a clean, intuitive interface with:
- File selection dialog at startup for choosing input files
- Main processing window with checkbox options for each feature
- Interactive dialogs for word choice selection and processing decisions
1-9- Select replacement optionNumpad 1-9- Select replacement option (alternative)
Y- Yes (lowercase this and all remaining)N- No (keep uppercase)A- Add to ignore listI- Auto-lowercase (add to permanent list)
-
File not found errors
- Check file path and permissions
- Ensure file is not open in another program
-
Missing dependencies
pip install beautifulsoup4
-
GUI not appearing
- Verify tkinter installation
- Check Python version compatibility
-
Slow processing
- Large files may take time
- Monitor progress bar for status
- Use forward slashes in paths:
C:/Users/name/Documents - May require additional permissions for file access
- Grant file access permissions when prompted
- Use
python3command explicitly
- Ensure display server is running for GUI
- Install tkinter if not included:
sudo apt-get install python3-tk
bookfix.py- Main application file- Global variables for state management
- Modular functions for each processing step
- Tkinter GUI with event-driven architecture
- Add new processing steps to
run_processing() - Extend
.data.txtsections for new configuration options - Implement additional file format support
Below is an enumeration of the main functions in the application, with brief descriptions of their responsibilities.
- center_window(win)
Centers a given Tk window on screen.
- log_message(message, level)
Writes timestamped log entries to stderr and a log file, flushing immediately.
- load_data_file()
Manually parses .data.txt into sections: choices, replacements, periods, default directory, ignore, uppercase-to-lowercase.
- save_default_directory_to_data_file(dir)
Updates or creates the # DEFAULT_FILE_DIR section in .data.txt.
- save_caps_data_file(ignore, lowercase)
Updates # CAP_IGNORE and # UPPER_TO_LOWER sections in .data.txt.
- select_file()
Opens a file dialog for selecting an input file, respecting the default directory.
- process_choices()
Interactive find-and-replace according to choices rules, with progress bar.
- highlight_current_match()
Highlights the next match in the text area for user confirmation.
- handle_caps_choice(choice)
Handles user input (y/n/a/i) for all-caps sequences: lowercase now, ignore, or auto-lowercase across the document and persist rules.
- process_all_caps_sequences_gui()
Two-pass processing of all-caps sequences: automatic pass based on persistent rules, then interactive pass with buttons and keyboard shortcuts.
- apply_automatic_replacements()
Performs simple string replacements defined under # REPLACE.
- insert_periods_into_abbreviations()
Inserts dots into abbreviations defined under # PERIODS.
- convert_to_lowercase()
Converts the entire text buffer to lowercase.
- roman_to_arabic(roman)
Converts a single Roman numeral string to its integer equivalent, validating format.
- convert_roman_numerals()
Finds and replaces Roman numerals in the text with Arabic numbers, line by line.
- remove_pagination()
Detects and removes pagination elements in TXT and HTML files, logs removed items.
- remove_blank_lines(text)
Returns text with empty or whitespace-only lines removed.
- run_processing()
Orchestrates the full workflow based on checkbox states, including interactive and automatic steps, and displays the Save button.
- start_processing_button_command()
Disables the Start button, resets UI, clears old logs, and invokes run_processing().
- update_text_area()
Refreshes the displayed text to match the in-memory text variable.
- update_status_label(msg)
Updates the status label text in the GUI.
- save_file()
Saves the processed text to a new file with an _output.txt suffix.
- display_save_button()
Makes the Save button visible after processing.
- quit_program()
Exits the application cleanly.
This project is open source. Feel free to modify and distribute according to your needs.
For issues or questions:
- Check the log files for detailed error information
- Verify configuration file format
- Ensure all dependencies are installed
- Test with smaller files first
Last updated: 2025-07-20