Regex character class range generator with certain characters excluded

While I was working on a personal project, I found I needed to exclude certain characters in a regex range. So I thought, why not implement a custom range exclusion check for greater efficiency, it'll only take 5 lines or so. 60+ lines later, I produced this:

regex_supplement.py:

"""Supplementary regex modifying methods."""



from itertools import tee



def pairwise(iterable):

    """s -> (s0,s1), (s1,s2), (s2, s3), ...



    From the itertools recipes.

    """

    a, b = tee(iterable)

    next(b, None)

    return zip(a, b)



def _offset_byte_char(char, offset):

    """Offset a single byte char"""

    return (ord(char) + offset).to_bytes(1, byteorder='big')



def escape_byte_in_character_class(char):

    """Escapes characters as necessary within the character class."""

    return b'\' + char if char in b'-]^' else char



def byte_range(startchar, endchar, excluded):

    """Returns a pattern matching characters in the range startchar-endchar.

    Characters in excluded are excluded from the range.

    """

    excluded = sorted(char for char in excluded if startchar <= char <= endchar)



    char_ranges = 

    if len(excluded) >= 1:

        first_exclude = excluded[0]

        if startchar != first_exclude:

            # Another possibility = + 1

            char_ranges.append(

                (startchar, _offset_byte_char(first_exclude, -1))

            )

        for start, end in pairwise(excluded):

            # Adjacent or equal

            if ord(end) - 1 <= ord(start):

                continue

            char_ranges.append(

                (_offset_byte_char(start, 1), _offset_byte_char(end, -1))

            )

        last_exclude = excluded[-1]

        if endchar != last_exclude:

            char_ranges.append(

                (_offset_byte_char(last_exclude, 1), endchar)

            )

    else:

        char_ranges = [(startchar, endchar)]



    char_output = b''

    escape = escape_byte_in_character_class

    for char_range in char_ranges:

        # Doesn't minimize all '-', but that quickly gets complicated.

        # (Whether '-' needs to be escaped within a regex range is context dependent.)

        # '^' has even more potential easy optimization.

        start, end = char_range

        if start == end:

            char_output += escape(start)

        elif ord(start) == ord(end) - 1:

            char_output += escape(start) + escape(end)

        else:

            char_output += escape(start) + b'-' + escape(end)

    return b'[' + char_output + b']'

test_regex_supplement.py:

"""Regex supplement tests"""



import regex_supplement as re_supp



def test_byte_regex_range_empty():

    """Test that empty exclusions do not affect the range"""

    assert re_supp.byte_range(b'a', b'c', ) == b'[a-c]'



def test_byte_regex_range_exclusion_outside():

    """An exclusion outside of the regex range should have no effect."""

    assert re_supp.byte_range(b'a', b'c', [b'e']) == b'[a-c]'



offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'



def test_byte_regex_range_escaped_rbrac():

    """Test that ']' is escaped"""

    compare_escaped_within_range(b']')



def test_byte_regex_range_escaped_hyphen():

    """Test that '-' is escaped"""

    compare_escaped_within_range(b'-')



def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    compare_escaped_within_range(b'^')



def test_byte_regex_range_standard_1():

    """Test that a standard range behaves as expected"""

    assert re_supp.byte_range(b'a', b'g', [b'd']) == b'[a-ce-g]'



def test_byte_regex_range_standard_2():

    """Test that a standard range with multiple exclusions behaves as expected"""

    assert re_supp.byte_range(b'a', b'k', [b'd', b'h']) == b'[a-ce-gi-k]'



def test_byte_regex_range_optimized_1():

    """Test that ranges of 1 char are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'c', [b'b']) == b'[ac]'



def test_byte_regex_range_optimized_2():

    """Test that multiple ranges of 1 chars are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'e', [b'b', b'd']) == b'[ace]'

This is only implemented for bytestrings, because that's what I needed for my project. I actually learned that Python regular expressions can handle bytestrings during this project. The tests are intended to be run by pytest. I was originally considering adding in an optimization for single character ranges, but I decided not to because it would lead to more complicated escape-handling code (and the possibility for subtle bugs like double-escaping ]), and it wasn't needed for my purposes.

I'm mostly concerned with efficiency (mostly of the resultant regex, but also of the program) and accuracy-checking, but stylistic and readability improvements are also appreciated.

Also, in hindsight, I might have considered implementing a lookahead with an exclusion character check preceding the range, but my current approach does have the advantage of discarding excluded characters that are outside of the range, and requiring less escaping.

edited Jan 2 at 21:12

asked Jan 2 at 20:46

Graham

970113

add a comment |

regex_supplement.py:

"""Supplementary regex modifying methods."""



from itertools import tee



def pairwise(iterable):

    """s -> (s0,s1), (s1,s2), (s2, s3), ...



    From the itertools recipes.

    """

    a, b = tee(iterable)

    next(b, None)

    return zip(a, b)



def _offset_byte_char(char, offset):

    """Offset a single byte char"""

    return (ord(char) + offset).to_bytes(1, byteorder='big')



def escape_byte_in_character_class(char):

    """Escapes characters as necessary within the character class."""

    return b'\' + char if char in b'-]^' else char



def byte_range(startchar, endchar, excluded):

    """Returns a pattern matching characters in the range startchar-endchar.

    Characters in excluded are excluded from the range.

    """

    excluded = sorted(char for char in excluded if startchar <= char <= endchar)



    char_ranges = 

    if len(excluded) >= 1:

        first_exclude = excluded[0]

        if startchar != first_exclude:

            # Another possibility = + 1

            char_ranges.append(

                (startchar, _offset_byte_char(first_exclude, -1))

            )

        for start, end in pairwise(excluded):

            # Adjacent or equal

            if ord(end) - 1 <= ord(start):

                continue

            char_ranges.append(

                (_offset_byte_char(start, 1), _offset_byte_char(end, -1))

            )

        last_exclude = excluded[-1]

        if endchar != last_exclude:

            char_ranges.append(

                (_offset_byte_char(last_exclude, 1), endchar)

            )

    else:

        char_ranges = [(startchar, endchar)]



    char_output = b''

    escape = escape_byte_in_character_class

    for char_range in char_ranges:

        # Doesn't minimize all '-', but that quickly gets complicated.

        # (Whether '-' needs to be escaped within a regex range is context dependent.)

        # '^' has even more potential easy optimization.

        start, end = char_range

        if start == end:

            char_output += escape(start)

        elif ord(start) == ord(end) - 1:

            char_output += escape(start) + escape(end)

        else:

            char_output += escape(start) + b'-' + escape(end)

    return b'[' + char_output + b']'

test_regex_supplement.py:

"""Regex supplement tests"""



import regex_supplement as re_supp



def test_byte_regex_range_empty():

    """Test that empty exclusions do not affect the range"""

    assert re_supp.byte_range(b'a', b'c', ) == b'[a-c]'



def test_byte_regex_range_exclusion_outside():

    """An exclusion outside of the regex range should have no effect."""

    assert re_supp.byte_range(b'a', b'c', [b'e']) == b'[a-c]'



offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'



def test_byte_regex_range_escaped_rbrac():

    """Test that ']' is escaped"""

    compare_escaped_within_range(b']')



def test_byte_regex_range_escaped_hyphen():

    """Test that '-' is escaped"""

    compare_escaped_within_range(b'-')



def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    compare_escaped_within_range(b'^')



def test_byte_regex_range_standard_1():

    """Test that a standard range behaves as expected"""

    assert re_supp.byte_range(b'a', b'g', [b'd']) == b'[a-ce-g]'



def test_byte_regex_range_standard_2():

    """Test that a standard range with multiple exclusions behaves as expected"""

    assert re_supp.byte_range(b'a', b'k', [b'd', b'h']) == b'[a-ce-gi-k]'



def test_byte_regex_range_optimized_1():

    """Test that ranges of 1 char are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'c', [b'b']) == b'[ac]'



def test_byte_regex_range_optimized_2():

    """Test that multiple ranges of 1 chars are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'e', [b'b', b'd']) == b'[ace]'

I'm mostly concerned with efficiency (mostly of the resultant regex, but also of the program) and accuracy-checking, but stylistic and readability improvements are also appreciated.

edited Jan 2 at 21:12

asked Jan 2 at 20:46

Graham

970113

add a comment |

regex_supplement.py:

"""Supplementary regex modifying methods."""



from itertools import tee



def pairwise(iterable):

    """s -> (s0,s1), (s1,s2), (s2, s3), ...



    From the itertools recipes.

    """

    a, b = tee(iterable)

    next(b, None)

    return zip(a, b)



def _offset_byte_char(char, offset):

    """Offset a single byte char"""

    return (ord(char) + offset).to_bytes(1, byteorder='big')



def escape_byte_in_character_class(char):

    """Escapes characters as necessary within the character class."""

    return b'\' + char if char in b'-]^' else char



def byte_range(startchar, endchar, excluded):

    """Returns a pattern matching characters in the range startchar-endchar.

    Characters in excluded are excluded from the range.

    """

    excluded = sorted(char for char in excluded if startchar <= char <= endchar)



    char_ranges = 

    if len(excluded) >= 1:

        first_exclude = excluded[0]

        if startchar != first_exclude:

            # Another possibility = + 1

            char_ranges.append(

                (startchar, _offset_byte_char(first_exclude, -1))

            )

        for start, end in pairwise(excluded):

            # Adjacent or equal

            if ord(end) - 1 <= ord(start):

                continue

            char_ranges.append(

                (_offset_byte_char(start, 1), _offset_byte_char(end, -1))

            )

        last_exclude = excluded[-1]

        if endchar != last_exclude:

            char_ranges.append(

                (_offset_byte_char(last_exclude, 1), endchar)

            )

    else:

        char_ranges = [(startchar, endchar)]



    char_output = b''

    escape = escape_byte_in_character_class

    for char_range in char_ranges:

        # Doesn't minimize all '-', but that quickly gets complicated.

        # (Whether '-' needs to be escaped within a regex range is context dependent.)

        # '^' has even more potential easy optimization.

        start, end = char_range

        if start == end:

            char_output += escape(start)

        elif ord(start) == ord(end) - 1:

            char_output += escape(start) + escape(end)

        else:

            char_output += escape(start) + b'-' + escape(end)

    return b'[' + char_output + b']'

test_regex_supplement.py:

"""Regex supplement tests"""



import regex_supplement as re_supp



def test_byte_regex_range_empty():

    """Test that empty exclusions do not affect the range"""

    assert re_supp.byte_range(b'a', b'c', ) == b'[a-c]'



def test_byte_regex_range_exclusion_outside():

    """An exclusion outside of the regex range should have no effect."""

    assert re_supp.byte_range(b'a', b'c', [b'e']) == b'[a-c]'



offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'



def test_byte_regex_range_escaped_rbrac():

    """Test that ']' is escaped"""

    compare_escaped_within_range(b']')



def test_byte_regex_range_escaped_hyphen():

    """Test that '-' is escaped"""

    compare_escaped_within_range(b'-')



def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    compare_escaped_within_range(b'^')



def test_byte_regex_range_standard_1():

    """Test that a standard range behaves as expected"""

    assert re_supp.byte_range(b'a', b'g', [b'd']) == b'[a-ce-g]'



def test_byte_regex_range_standard_2():

    """Test that a standard range with multiple exclusions behaves as expected"""

    assert re_supp.byte_range(b'a', b'k', [b'd', b'h']) == b'[a-ce-gi-k]'



def test_byte_regex_range_optimized_1():

    """Test that ranges of 1 char are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'c', [b'b']) == b'[ac]'



def test_byte_regex_range_optimized_2():

    """Test that multiple ranges of 1 chars are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'e', [b'b', b'd']) == b'[ace]'

I'm mostly concerned with efficiency (mostly of the resultant regex, but also of the program) and accuracy-checking, but stylistic and readability improvements are also appreciated.

edited Jan 2 at 21:12

asked Jan 2 at 20:46

Graham

970113

regex_supplement.py:

"""Supplementary regex modifying methods."""



from itertools import tee



def pairwise(iterable):

    """s -> (s0,s1), (s1,s2), (s2, s3), ...



    From the itertools recipes.

    """

    a, b = tee(iterable)

    next(b, None)

    return zip(a, b)



def _offset_byte_char(char, offset):

    """Offset a single byte char"""

    return (ord(char) + offset).to_bytes(1, byteorder='big')



def escape_byte_in_character_class(char):

    """Escapes characters as necessary within the character class."""

    return b'\' + char if char in b'-]^' else char



def byte_range(startchar, endchar, excluded):

    """Returns a pattern matching characters in the range startchar-endchar.

    Characters in excluded are excluded from the range.

    """

    excluded = sorted(char for char in excluded if startchar <= char <= endchar)



    char_ranges = 

    if len(excluded) >= 1:

        first_exclude = excluded[0]

        if startchar != first_exclude:

            # Another possibility = + 1

            char_ranges.append(

                (startchar, _offset_byte_char(first_exclude, -1))

            )

        for start, end in pairwise(excluded):

            # Adjacent or equal

            if ord(end) - 1 <= ord(start):

                continue

            char_ranges.append(

                (_offset_byte_char(start, 1), _offset_byte_char(end, -1))

            )

        last_exclude = excluded[-1]

        if endchar != last_exclude:

            char_ranges.append(

                (_offset_byte_char(last_exclude, 1), endchar)

            )

    else:

        char_ranges = [(startchar, endchar)]



    char_output = b''

    escape = escape_byte_in_character_class

    for char_range in char_ranges:

        # Doesn't minimize all '-', but that quickly gets complicated.

        # (Whether '-' needs to be escaped within a regex range is context dependent.)

        # '^' has even more potential easy optimization.

        start, end = char_range

        if start == end:

            char_output += escape(start)

        elif ord(start) == ord(end) - 1:

            char_output += escape(start) + escape(end)

        else:

            char_output += escape(start) + b'-' + escape(end)

    return b'[' + char_output + b']'

test_regex_supplement.py:

"""Regex supplement tests"""



import regex_supplement as re_supp



def test_byte_regex_range_empty():

    """Test that empty exclusions do not affect the range"""

    assert re_supp.byte_range(b'a', b'c', ) == b'[a-c]'



def test_byte_regex_range_exclusion_outside():

    """An exclusion outside of the regex range should have no effect."""

    assert re_supp.byte_range(b'a', b'c', [b'e']) == b'[a-c]'



offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'



def test_byte_regex_range_escaped_rbrac():

    """Test that ']' is escaped"""

    compare_escaped_within_range(b']')



def test_byte_regex_range_escaped_hyphen():

    """Test that '-' is escaped"""

    compare_escaped_within_range(b'-')



def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    compare_escaped_within_range(b'^')



def test_byte_regex_range_standard_1():

    """Test that a standard range behaves as expected"""

    assert re_supp.byte_range(b'a', b'g', [b'd']) == b'[a-ce-g]'



def test_byte_regex_range_standard_2():

    """Test that a standard range with multiple exclusions behaves as expected"""

    assert re_supp.byte_range(b'a', b'k', [b'd', b'h']) == b'[a-ce-gi-k]'



def test_byte_regex_range_optimized_1():

    """Test that ranges of 1 char are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'c', [b'b']) == b'[ac]'



def test_byte_regex_range_optimized_2():

    """Test that multiple ranges of 1 chars are optimized to single characters."""

    assert re_supp.byte_range(b'a', b'e', [b'b', b'd']) == b'[ace]'

I'm mostly concerned with efficiency (mostly of the resultant regex, but also of the program) and accuracy-checking, but stylistic and readability improvements are also appreciated.

python python-3.x regex

edited Jan 2 at 21:12

asked Jan 2 at 20:46

Graham

970113

edited Jan 2 at 21:12

asked Jan 2 at 20:46

Graham

970113

edited Jan 2 at 21:12

asked Jan 2 at 20:46

Graham

970113

asked Jan 2 at 20:46

Graham

970113

asked Jan 2 at 20:46

Graham

970113

add a comment |

1 Answer
1

active

oldest

votes

offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

Two weird things going on here. First, your unit test depends on an internal implementation detail of re_supp (namely the _offset_byte_char method). Second, you're actually storing that method in a global variable, which makes your test not standalone. You couldn't extract just the compare_escaped_within_range test into a different test file, for example, unless you brought the global offset along with it. It would at least be more maintainable to say

def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    offset = re_supp._offset_byte_char

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

But it would be even better not to depend on that detail at all! Either make _offset_byte_char a public member of the API (lose the underscore), or just reimplement it yourself where you need it.

When you reimplement it, consider that Python offers the chr builtin to do exactly what you're currently doing with all that byteorder='big' nonsense on ints. Although I guess Python3 does force you to know about latin_1 encoding:

offset = lambda x, y: chr(ord(x) + y).encode('latin_1')

IIUC, your public API really just consists of the function byte_range(b, e, xs), where b and e are single bytes and xs is an iterable of bytes — which generates a regex character class matching any of the bytes in b to e inclusive, but excluding all of the bytes in xs.

In other words, it produces a regex which is the moral equivalent of (?=[b-e])(?![xs])., or more succinctly (?![xs])[b-e].

I guess you knew this, but it does feel like your 60-line thing is a bit overengineered. ;)

Personally I would write ch everywhere you wrote char, but that's probably partly due to my being primarily a C and C++ programmer. ;) ch would still be shorter and arguably harder to confuse with chr, though...

Again on the testing side: your compare_escaped_within_range seems overdetermined and overcomplicated. If I were writing tests for your API, I would write one like this:

def test_inputs(b, e, xs):

    needle = re_supp.byte_range(b, e, xs)

    for i in range(256):

        haystack = chr(i).encode('latin_1')

        expected = (ord(b) <= i <= ord(e)) and (haystack not in xs)

        actual = bool(re.match(needle, haystack))

        assert actual == expected

And then I'd make sure all of my tests at least verified that test_inputs(b, e, xs) passed for their inputs. Even if they went on to verify other properties.

def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    test_inputs(b'^', b'a', [b'a'])

    assert re_supp.byte_range(b'^', b'a', [b'a']) == b'[\^-`]')  # should we even verify this, or are we micromanaging?

Incidentally, you got very lucky that out of ], ^, and -, no two of them are exactly 2 positions apart in the ASCII table! Did you know that? Do you expect your maintainer to know that 3 is a magic number in this respect — or should you maybe document it for posterity?

I ran my test_inputs on some random data and discovered some bugs in your code. Most blatantly, you forgot that needs to be escaped. So test_inputs(b'\', b'A', ) fails.

Fix the trivial bug by adding two characters in escape_byte_in_character_class. Now add a regression test: compare_escaped_within_range(b'\'). The test still fails! Why? :)

And you don't handle the completely empty case: test_inputs(b'A', b'A', [b'A']) fails.

And you don't handle test_inputs(b'B', b'A', ) either, but that's more of an input validation issue, which is an issue of taste. Maybe you want to return b'(?!.).' in that case; maybe you want to assert-fail; or maybe you're happy just saying "don't do that."

The moral of the story is that you should always write a test harness that permits fuzzing! It's never wasted effort. ...Well, it's never effort that fails to find bugs, anyway. :)

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

Very useful answer all around! Regarding switching to chr().encode('latin1'), I feel like that's relying on an implementation detail of bytestrings that latin1 happens to map to bytestrings perfectly. I guess that's unlikely to change, but I still feel it makes the code less clear. I think Python just has a rough edge by always requiring byteorder in to_bytes. For 1 byte conversions, there are no bytes to order, so it doesn't have relevance! And yes, my solution is admittedly over-engineered, but it does have some benefits, which I mentioned in the last paragraph of my question.
– Graham
2 days ago

1

I could have also done bytes([i]) for the int to bytestring conversion, which might be the most succinct and best solution. I'm not arguing that what I've done with offset is good style (it's clearly bad style), but I'm not certain what you mean when say "You couldn't extract just the compare_escaped_within_range test into a different test file, for example, unless you brought the global offset along with it." If you extracted compare_escaped_within_range, it would keep the reference to the global offset within the original module, unless I misunderstand what you mean by "extract".
– Graham
2 days ago

+1 for bytes([i]). :)
– Quuxplusone
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210765%2fregex-character-class-range-generator-with-certain-characters-excluded%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    offset = re_supp._offset_byte_char

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

But it would be even better not to depend on that detail at all! Either make _offset_byte_char a public member of the API (lose the underscore), or just reimplement it yourself where you need it.

offset = lambda x, y: chr(ord(x) + y).encode('latin_1')

In other words, it produces a regex which is the moral equivalent of (?=[b-e])(?![xs])., or more succinctly (?![xs])[b-e].

I guess you knew this, but it does feel like your 60-line thing is a bit overengineered. ;)

Again on the testing side: your compare_escaped_within_range seems overdetermined and overcomplicated. If I were writing tests for your API, I would write one like this:

def test_inputs(b, e, xs):

    needle = re_supp.byte_range(b, e, xs)

    for i in range(256):

        haystack = chr(i).encode('latin_1')

        expected = (ord(b) <= i <= ord(e)) and (haystack not in xs)

        actual = bool(re.match(needle, haystack))

        assert actual == expected

And then I'd make sure all of my tests at least verified that test_inputs(b, e, xs) passed for their inputs. Even if they went on to verify other properties.

def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    test_inputs(b'^', b'a', [b'a'])

    assert re_supp.byte_range(b'^', b'a', [b'a']) == b'[\^-`]')  # should we even verify this, or are we micromanaging?

I ran my test_inputs on some random data and discovered some bugs in your code. Most blatantly, you forgot that needs to be escaped. So test_inputs(b'\', b'A', ) fails.

Fix the trivial bug by adding two characters in escape_byte_in_character_class. Now add a regression test: compare_escaped_within_range(b'\'). The test still fails! Why? :)

And you don't handle the completely empty case: test_inputs(b'A', b'A', [b'A']) fails.

The moral of the story is that you should always write a test harness that permits fuzzing! It's never wasted effort. ...Well, it's never effort that fails to find bugs, anyway. :)

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

Very useful answer all around! Regarding switching to chr().encode('latin1'), I feel like that's relying on an implementation detail of bytestrings that latin1 happens to map to bytestrings perfectly. I guess that's unlikely to change, but I still feel it makes the code less clear. I think Python just has a rough edge by always requiring byteorder in to_bytes. For 1 byte conversions, there are no bytes to order, so it doesn't have relevance! And yes, my solution is admittedly over-engineered, but it does have some benefits, which I mentioned in the last paragraph of my question.
– Graham
2 days ago

1

I could have also done bytes([i]) for the int to bytestring conversion, which might be the most succinct and best solution. I'm not arguing that what I've done with offset is good style (it's clearly bad style), but I'm not certain what you mean when say "You couldn't extract just the compare_escaped_within_range test into a different test file, for example, unless you brought the global offset along with it." If you extracted compare_escaped_within_range, it would keep the reference to the global offset within the original module, unless I misunderstand what you mean by "extract".
– Graham
2 days ago

+1 for bytes([i]). :)
– Quuxplusone
2 days ago

add a comment |

offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    offset = re_supp._offset_byte_char

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

But it would be even better not to depend on that detail at all! Either make _offset_byte_char a public member of the API (lose the underscore), or just reimplement it yourself where you need it.

offset = lambda x, y: chr(ord(x) + y).encode('latin_1')

In other words, it produces a regex which is the moral equivalent of (?=[b-e])(?![xs])., or more succinctly (?![xs])[b-e].

I guess you knew this, but it does feel like your 60-line thing is a bit overengineered. ;)

Again on the testing side: your compare_escaped_within_range seems overdetermined and overcomplicated. If I were writing tests for your API, I would write one like this:

def test_inputs(b, e, xs):

    needle = re_supp.byte_range(b, e, xs)

    for i in range(256):

        haystack = chr(i).encode('latin_1')

        expected = (ord(b) <= i <= ord(e)) and (haystack not in xs)

        actual = bool(re.match(needle, haystack))

        assert actual == expected

And then I'd make sure all of my tests at least verified that test_inputs(b, e, xs) passed for their inputs. Even if they went on to verify other properties.

def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    test_inputs(b'^', b'a', [b'a'])

    assert re_supp.byte_range(b'^', b'a', [b'a']) == b'[\^-`]')  # should we even verify this, or are we micromanaging?

I ran my test_inputs on some random data and discovered some bugs in your code. Most blatantly, you forgot that needs to be escaped. So test_inputs(b'\', b'A', ) fails.

Fix the trivial bug by adding two characters in escape_byte_in_character_class. Now add a regression test: compare_escaped_within_range(b'\'). The test still fails! Why? :)

And you don't handle the completely empty case: test_inputs(b'A', b'A', [b'A']) fails.

The moral of the story is that you should always write a test harness that permits fuzzing! It's never wasted effort. ...Well, it's never effort that fails to find bugs, anyway. :)

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

Very useful answer all around! Regarding switching to chr().encode('latin1'), I feel like that's relying on an implementation detail of bytestrings that latin1 happens to map to bytestrings perfectly. I guess that's unlikely to change, but I still feel it makes the code less clear. I think Python just has a rough edge by always requiring byteorder in to_bytes. For 1 byte conversions, there are no bytes to order, so it doesn't have relevance! And yes, my solution is admittedly over-engineered, but it does have some benefits, which I mentioned in the last paragraph of my question.
– Graham
2 days ago

1

I could have also done bytes([i]) for the int to bytestring conversion, which might be the most succinct and best solution. I'm not arguing that what I've done with offset is good style (it's clearly bad style), but I'm not certain what you mean when say "You couldn't extract just the compare_escaped_within_range test into a different test file, for example, unless you brought the global offset along with it." If you extracted compare_escaped_within_range, it would keep the reference to the global offset within the original module, unless I misunderstand what you mean by "extract".
– Graham
2 days ago

+1 for bytes([i]). :)
– Quuxplusone
2 days ago

add a comment |

offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    offset = re_supp._offset_byte_char

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

But it would be even better not to depend on that detail at all! Either make _offset_byte_char a public member of the API (lose the underscore), or just reimplement it yourself where you need it.

offset = lambda x, y: chr(ord(x) + y).encode('latin_1')

In other words, it produces a regex which is the moral equivalent of (?=[b-e])(?![xs])., or more succinctly (?![xs])[b-e].

I guess you knew this, but it does feel like your 60-line thing is a bit overengineered. ;)

Again on the testing side: your compare_escaped_within_range seems overdetermined and overcomplicated. If I were writing tests for your API, I would write one like this:

def test_inputs(b, e, xs):

    needle = re_supp.byte_range(b, e, xs)

    for i in range(256):

        haystack = chr(i).encode('latin_1')

        expected = (ord(b) <= i <= ord(e)) and (haystack not in xs)

        actual = bool(re.match(needle, haystack))

        assert actual == expected

And then I'd make sure all of my tests at least verified that test_inputs(b, e, xs) passed for their inputs. Even if they went on to verify other properties.

def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    test_inputs(b'^', b'a', [b'a'])

    assert re_supp.byte_range(b'^', b'a', [b'a']) == b'[\^-`]')  # should we even verify this, or are we micromanaging?

I ran my test_inputs on some random data and discovered some bugs in your code. Most blatantly, you forgot that needs to be escaped. So test_inputs(b'\', b'A', ) fails.

Fix the trivial bug by adding two characters in escape_byte_in_character_class. Now add a regression test: compare_escaped_within_range(b'\'). The test still fails! Why? :)

And you don't handle the completely empty case: test_inputs(b'A', b'A', [b'A']) fails.

The moral of the story is that you should always write a test harness that permits fuzzing! It's never wasted effort. ...Well, it's never effort that fails to find bugs, anyway. :)

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

offset = re_supp._offset_byte_char



def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

def compare_escaped_within_range(char):

    """Test that the character is escaped at the beginning of a range."""

    offset = re_supp._offset_byte_char

    end = offset(char, 3)

    assert re_supp.byte_range(char, end, [end]) == b'[\' + char + b'-' + offset(end, -1) + b']'

But it would be even better not to depend on that detail at all! Either make _offset_byte_char a public member of the API (lose the underscore), or just reimplement it yourself where you need it.

offset = lambda x, y: chr(ord(x) + y).encode('latin_1')

In other words, it produces a regex which is the moral equivalent of (?=[b-e])(?![xs])., or more succinctly (?![xs])[b-e].

I guess you knew this, but it does feel like your 60-line thing is a bit overengineered. ;)

Again on the testing side: your compare_escaped_within_range seems overdetermined and overcomplicated. If I were writing tests for your API, I would write one like this:

def test_inputs(b, e, xs):

    needle = re_supp.byte_range(b, e, xs)

    for i in range(256):

        haystack = chr(i).encode('latin_1')

        expected = (ord(b) <= i <= ord(e)) and (haystack not in xs)

        actual = bool(re.match(needle, haystack))

        assert actual == expected

And then I'd make sure all of my tests at least verified that test_inputs(b, e, xs) passed for their inputs. Even if they went on to verify other properties.

def test_byte_regex_range_escaped_caret():

    """Test that '^' is escaped"""

    test_inputs(b'^', b'a', [b'a'])

    assert re_supp.byte_range(b'^', b'a', [b'a']) == b'[\^-`]')  # should we even verify this, or are we micromanaging?

I ran my test_inputs on some random data and discovered some bugs in your code. Most blatantly, you forgot that needs to be escaped. So test_inputs(b'\', b'A', ) fails.

Fix the trivial bug by adding two characters in escape_byte_in_character_class. Now add a regression test: compare_escaped_within_range(b'\'). The test still fails! Why? :)

And you don't handle the completely empty case: test_inputs(b'A', b'A', [b'A']) fails.

The moral of the story is that you should always write a test harness that permits fuzzing! It's never wasted effort. ...Well, it's never effort that fails to find bugs, anyway. :)

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

answered Jan 3 at 5:12

Quuxplusone

11.5k11959

Very useful answer all around! Regarding switching to chr().encode('latin1'), I feel like that's relying on an implementation detail of bytestrings that latin1 happens to map to bytestrings perfectly. I guess that's unlikely to change, but I still feel it makes the code less clear. I think Python just has a rough edge by always requiring byteorder in to_bytes. For 1 byte conversions, there are no bytes to order, so it doesn't have relevance! And yes, my solution is admittedly over-engineered, but it does have some benefits, which I mentioned in the last paragraph of my question.
– Graham
2 days ago

1

I could have also done bytes([i]) for the int to bytestring conversion, which might be the most succinct and best solution. I'm not arguing that what I've done with offset is good style (it's clearly bad style), but I'm not certain what you mean when say "You couldn't extract just the compare_escaped_within_range test into a different test file, for example, unless you brought the global offset along with it." If you extracted compare_escaped_within_range, it would keep the reference to the global offset within the original module, unless I misunderstand what you mean by "extract".
– Graham
2 days ago

+1 for bytes([i]). :)
– Quuxplusone
2 days ago

add a comment |

Very useful answer all around! Regarding switching to chr().encode('latin1'), I feel like that's relying on an implementation detail of bytestrings that latin1 happens to map to bytestrings perfectly. I guess that's unlikely to change, but I still feel it makes the code less clear. I think Python just has a rough edge by always requiring byteorder in to_bytes. For 1 byte conversions, there are no bytes to order, so it doesn't have relevance! And yes, my solution is admittedly over-engineered, but it does have some benefits, which I mentioned in the last paragraph of my question.
– Graham
2 days ago

1

I could have also done bytes([i]) for the int to bytestring conversion, which might be the most succinct and best solution. I'm not arguing that what I've done with offset is good style (it's clearly bad style), but I'm not certain what you mean when say "You couldn't extract just the compare_escaped_within_range test into a different test file, for example, unless you brought the global offset along with it." If you extracted compare_escaped_within_range, it would keep the reference to the global offset within the original module, unless I misunderstand what you mean by "extract".
– Graham
2 days ago

+1 for bytes([i]). :)
– Quuxplusone
2 days ago

Very useful answer all around! Regarding switching to chr().encode('latin1'), I feel like that's relying on an implementation detail of bytestrings that latin1 happens to map to bytestrings perfectly. I guess that's unlikely to change, but I still feel it makes the code less clear. I think Python just has a rough edge by always requiring byteorder in to_bytes. For 1 byte conversions, there are no bytes to order, so it doesn't have relevance! And yes, my solution is admittedly over-engineered, but it does have some benefits, which I mentioned in the last paragraph of my question.
– Graham
2 days ago

I could have also done bytes([i]) for the int to bytestring conversion, which might be the most succinct and best solution. I'm not arguing that what I've done with offset is good style (it's clearly bad style), but I'm not certain what you mean when say "You couldn't extract just the compare_escaped_within_range test into a different test file, for example, unless you brought the global offset along with it." If you extracted compare_escaped_within_range, it would keep the reference to the global offset within the original module, unless I misunderstand what you mean by "extract".
– Graham
2 days ago

+1 for bytes([i]). :)
– Quuxplusone
2 days ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfrktyl