-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathbookfix.py
More file actions
394 lines (256 loc) · 10.4 KB
/
bookfix.py
File metadata and controls
394 lines (256 loc) · 10.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# Bookfix - TTS Ebook Preprocessing Tool
Bookfix is a comprehensive GUI-based text processing application designed specifically for cleaning and formatting ebook text files for text-to-speech (TTS) and other applications. It provides both automated and interactive tools to transform raw ebook text into polished, readable content.
## Important Notes
- **File Support**: Currently works with `.txt`, `.html`, and `.xhtml` files
- **Platform**: Written in Python and tested on Linux OS
- **Ebook Conversion**: For best results, convert EPUB files to TXT using Calibre before processing
- **EPUB Processing**: While possible to process EPUB files directly by decompressing and handling HTML markup, TXT conversion is recommended for stability and ease of use
## Features
### 🔧 **Text Processing Capabilities**
- **Interactive Word Choices** - Manually select replacements for specific words with keyboard shortcuts
- **Automatic Replacements** - Apply predefined find/replace rules from configuration
- **Pagination Removal** - Remove page numbers and pagination elements from HTML/TXT files
- **Roman Numeral Conversion** - Convert Roman numerals to Arabic numbers (II → 2, XIV → 14)
- **All-Caps Sequence Processing** - Interactive handling of uppercase sequences with auto-lowercase options
- **Abbreviation Protection** - Prevents conversion of common abbreviations (I.D., Ph.D., etc.)
- **Numbered Line Editing** - Manual editing interface for lines containing 3+ digit numbers
- **Blank Line Removal** - Clean up excessive whitespace and empty lines
### 📁 **File Support**
- **Input formats**: `.txt`, `.html`, `.xhtml`
- **Output format**: `.txt` (with `_output` suffix)
- **BeautifulSoup integration** for HTML/XHTML processing
### 🎛️ **User Interface**
- **Checkbox controls** for enabling/disabling processing steps
- **Real-time text preview** with syntax highlighting
- **Progress tracking** with visual progress bars
- **Keyboard shortcuts** for faster operation
- **Status updates** throughout processing
## Installation
### Requirements
- Python 3.6 or higher
- Required packages:
- `tkinter` (usually included with Python)
- `beautifulsoup4`
### Setup
```bash
# Install required package
pip install beautifulsoup4
# Run the program
python3 bookfix.py
```
### macOS Installation
```bash
# Check Python version
python3 --version
# Install dependencies
pip3 install beautifulsoup4
# If tkinter is missing (rare)
brew install python-tk
# Run
python3 bookfix.py
```
## Usage
### Quick Start
1. **Launch the application**
```bash
python3 bookfix.py
```
2. **Set default directory** (first run only)
- Select your ebook library or text files folder
- This setting is saved for future use
3. **Select input file**
- Choose a `.txt`, `.html`, or `.xhtml` file to process
4. **Configure processing options**
- Check/uncheck desired processing steps
- All options are enabled by default
5. **Start processing**
- Click "Start Processing" button
- Follow interactive prompts as needed
6. **Save results**
- Click "Save" when processing is complete
- Output saved as `filename_output.txt`
## Synopsis
The `bookfix.py` script is a GUI tool built with the Tkinter library. Its main goal is to help users clean and standardize text from input files by providing a way to handle inconsistent wording and apply automatic cleanup rules.
The program guides the user through making decisions for specific words and then performs a series of automatic text transformations based on rules read from a separate data file.
### Processing Steps
The program processes text in the following order:
1. **Apply Automatic Replacements** - Bulk find/replace operations
2. **Insert Periods into Abbreviations** - Add periods to specified abbreviations
3. **Remove Pagination** - Clean up page numbers and pagination elements
4. **Interactive Choices** - Manual word-by-word replacement decisions
5. **Process All-Caps Sequences** - Handle uppercase text interactively
6. **Convert Roman Numerals** - Transform Roman numerals to Arabic numbers
7. **Convert to Lowercase** - Optional full text lowercasing
8. **Remove Blank Lines** - Clean up excessive whitespace
9. **Numbered Line Editing** - Manual editing of lines with numbers
### Interactive Features
#### Word Choices
- Press number keys (1-9) to select replacement options
- View highlighted matches in context
- Progress tracking shows completion status
#### All-Caps Processing
- **Y/Yes** - Lowercase this instance and all remaining instances
- **N/No** - Keep uppercase, skip for this session
- **A/Add** - Add to ignore list permanently
- **I/Auto** - Add to auto-lowercase list permanently
#### Numbered Line Editing
- Edit lines containing 3+ digit numbers
- Navigate with Previous/Next buttons
- Roman numeral reference guide included
## Configuration
### .data.txt File
The program uses a `.data.txt` file for configuration with the following sections:
```
# CHOICE
word -> option1;option2;option3
# REPLACE
old_text -> new_text
# PERIODS
abbreviation_without_periods
# CAP_IGNORE
SEQUENCE_TO_IGNORE
# UPPER_TO_LOWER
SEQUENCE_TO_LOWERCASE
# DEFAULT_FILE_DIR
/path/to/your/ebook/folder
```
### Example Configuration
```
# CHOICE
colour -> color;colour
realise -> realize;realise
# REPLACE
-- -> —
... -> …
# PERIODS
Mr
Dr
St
# CAP_IGNORE
NASA
FBI
# UPPER_TO_LOWER
CHAPTER
BOOK
# DEFAULT_FILE_DIR
/Users/username/Documents/Ebooks
```
## Output Files
### Generated Files
- `filename_output.txt` - Main processed output
- `debug.txt` - Choice replacement log
- `matches.txt` - Detailed match processing log
- `roman_conversions.log` - Roman numeral conversion log
- `pagination_debug.txt` - Pagination removal log
- `bookfix_execution.log` - Complete execution log
### Logging
Detailed timestamped logging to both stderr and execution log files for debugging and verification.
## User Interface
The application provides a clean, intuitive interface with:
- File selection dialog at startup for choosing input files
- Main processing window with checkbox options for each feature
- Interactive dialogs for word choice selection and processing decisions
## Keyboard Shortcuts
### Interactive Choices
- `1-9` - Select replacement option
- `Numpad 1-9` - Select replacement option (alternative)
### All-Caps Processing
- `Y` - Yes (lowercase this and all remaining)
- `N` - No (keep uppercase)
- `A` - Add to ignore list
- `I` - Auto-lowercase (add to permanent list)
## Troubleshooting
### Common Issues
1. **File not found errors**
- Check file path and permissions
- Ensure file is not open in another program
2. **Missing dependencies**
```bash
pip install beautifulsoup4
```
3. **GUI not appearing**
- Verify tkinter installation
- Check Python version compatibility
4. **Slow processing**
- Large files may take time
- Monitor progress bar for status
### Platform-Specific Notes
#### Windows
- Use forward slashes in paths: `C:/Users/name/Documents`
- May require additional permissions for file access
#### macOS
- Grant file access permissions when prompted
- Use `python3` command explicitly
#### Linux
- Ensure display server is running for GUI
- Install tkinter if not included: `sudo apt-get install python3-tk`
## Development
### Code Structure
- `bookfix.py` - Main application file
- Global variables for state management
- Modular functions for each processing step
- Tkinter GUI with event-driven architecture
### Extending Functionality
- Add new processing steps to `run_processing()`
- Extend `.data.txt` sections for new configuration options
- Implement additional file format support
## Function Reference
Below is an enumeration of the main functions in the application, with brief descriptions of their responsibilities.
* center_window(win)
Centers a given Tk window on screen.
* log_message(message, level)
Writes timestamped log entries to stderr and a log file, flushing immediately.
* load_data_file()
Manually parses .data.txt into sections: choices, replacements, periods, default directory, ignore, uppercase-to-lowercase.
* save_default_directory_to_data_file(dir)
Updates or creates the # DEFAULT_FILE_DIR section in .data.txt.
* save_caps_data_file(ignore, lowercase)
Updates # CAP_IGNORE and # UPPER_TO_LOWER sections in .data.txt.
* select_file()
Opens a file dialog for selecting an input file, respecting the default directory.
* process_choices()
Interactive find-and-replace according to choices rules, with progress bar.
* highlight_current_match()
Highlights the next match in the text area for user confirmation.
* handle_caps_choice(choice)
Handles user input (y/n/a/i) for all-caps sequences: lowercase now, ignore, or auto-lowercase across the document and persist rules.
* process_all_caps_sequences_gui()
Two-pass processing of all-caps sequences: automatic pass based on persistent rules, then interactive pass with buttons and keyboard shortcuts.
* apply_automatic_replacements()
Performs simple string replacements defined under # REPLACE.
* insert_periods_into_abbreviations()
Inserts dots into abbreviations defined under # PERIODS.
* convert_to_lowercase()
Converts the entire text buffer to lowercase.
* roman_to_arabic(roman)
Converts a single Roman numeral string to its integer equivalent, validating format.
* convert_roman_numerals()
Finds and replaces Roman numerals in the text with Arabic numbers, line by line.
* remove_pagination()
Detects and removes pagination elements in TXT and HTML files, logs removed items.
* remove_blank_lines(text)
Returns text with empty or whitespace-only lines removed.
* run_processing()
Orchestrates the full workflow based on checkbox states, including interactive and automatic steps, and displays the Save button.
* start_processing_button_command()
Disables the Start button, resets UI, clears old logs, and invokes run_processing().
* update_text_area()
Refreshes the displayed text to match the in-memory text variable.
* update_status_label(msg)
Updates the status label text in the GUI.
* save_file()
Saves the processed text to a new file with an _output.txt suffix.
* display_save_button()
Makes the Save button visible after processing.
* quit_program()
Exits the application cleanly.
## License
This project is open source. Feel free to modify and distribute according to your needs.
## Support
For issues or questions:
1. Check the log files for detailed error information
2. Verify configuration file format
3. Ensure all dependencies are installed
4. Test with smaller files first
---
*Last updated: 2025-07-20*