Skip to content

Normalizewhitespace#228

Open
aaxis-em wants to merge 1 commit intoedgi-govdata-archiving:mainfrom
aaxis-em:normalizewhitespace
Open

Normalizewhitespace#228
aaxis-em wants to merge 1 commit intoedgi-govdata-archiving:mainfrom
aaxis-em:normalizewhitespace

Conversation

@aaxis-em
Copy link
Contributor

fixes #198

Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aaxis-em, thanks for working on this! I appreciate the effort. 🙇

However, I’m not sure about the approach here. From the original issue:

a good way to handle this is to give tokens a diffable text representation (in which non-breaking spaces are replaced by spaces) and a literal text representation (where these types of characters are unchanged). The former is used for comparisons, but the latter is used when stitching the actual diff back together.

This PR normalizes all types of spaces to just " " and will cause that to be the output in the diff, even when a different kind of space is used.

A better approach here is probably to customize the __eq__() method in DiffToken (and some of its subclasses) to use normalized text, while the html() method continues to return the original text:

class DiffToken(str):
""" Represents a diffable token, generally a word that is displayed to
the user. Opening tags are attached to this token when they are
adjacent (pre_tags) and closing tags that follow the word
(post_tags). Some exceptions occur when there are empty tags
adjacent to a word, so there may be close tags in pre_tags, or
open tags in post_tags.
We also keep track of whether the word was originally followed by
whitespace, even though we do not want to treat the word as
equivalent to a similar word that does not have a trailing
space."""
# When this is true, the token will be eliminated from the
# displayed diff if no change has occurred:
hide_when_equal = False
def __new__(cls, text, pre_tags=None, post_tags=None, trailing_whitespace=""):
obj = str.__new__(cls, text)
if pre_tags is not None:
obj.pre_tags = pre_tags
else:
obj.pre_tags = []
if post_tags is not None:
obj.post_tags = post_tags
else:
obj.post_tags = []
obj.trailing_whitespace = trailing_whitespace
return obj
def __repr__(self):
return 'DiffToken(%s, %r, %r, %r)' % (str.__repr__(self), self.pre_tags,
self.post_tags, self.trailing_whitespace)
def html(self):
return str(self)

A good example of this kind of thing is href_token, which uses a special comparator function in its __eq__() method rather than comparing the actual URL that it renders in html():

class href_token(DiffToken):
""" Represents the href in an anchor tag. Unlike other words, we only
show the href when it changes. """
hide_when_equal = True
def __new__(cls, href, comparator, pre_tags=None,
post_tags=None, trailing_whitespace=""):
obj = DiffToken.__new__(cls, text=href,
pre_tags=pre_tags,
post_tags=post_tags,
trailing_whitespace=trailing_whitespace)
obj.comparator = comparator
return obj
def __eq__(self, other):
# This equality check aims to apply specific rules to the contents of
# the href element solving false positive cases
if not isinstance(other, href_token):
return False
if self.comparator:
return self.comparator.compare(str(self), str(other))
return super().__eq__(other)
def __hash__(self):
return super().__hash__()
def html(self):
return ' Link: %s' % self


Also, please separate these changes from the OCI label changes you made as of #227, and please also remove the style changes — they make the diff here much harder to review.

@Mr0grog
Copy link
Member

Mr0grog commented Mar 19, 2026

Hi @aaxis-em, sorry I have not been able to get to this. I’m at a conference this week, and will try and take a look this weekend or early next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Normalize whitespace in HTML token diff

2 participants