Convert Sql LIKE to Regex

up vote
2
down vote

favorite

I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.

This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc to .*?abc is capturing both "abcdef" and "123abcdef".

I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.

Here is my code:

internal static string ConvertLikeToRegex(string pattern)

{

    // Turn "off" all regular expression related syntax in the pattern string. 

    StringBuilder builder = new StringBuilder(Regex.Escape(pattern));



    // these are needed because the .*? replacement below at the begining or end of the string is not

    // accounting for cases such as LIKE '%abc' or LIKE 'abc%'

    bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

    bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



    // this is a little tricky

    // ends with in like is '%abc'

    // in regex it's 'abc$'

    // so need to tanspose

    if (startsWith)

    {

        builder.Replace("%", "", 0, 1);

        builder.Append("$");

    }



    // same but inverse here

    if (endsWith)

    {

        builder.Replace("%", "", pattern.Length - 1, 1);

        builder.Insert(0, "^");

    }



    /* Replace the SQL LIKE wildcard metacharacters with the

    * equivalent regular expression metacharacters. */

    builder.Replace("%", ".*?").Replace("_", ".");



    /* The previous call to Regex.Escape actually turned off

    * too many metacharacters, i.e. those which are recognized by

    * both the regular expression engine and the SQL LIKE

    * statement ([...] and [^...]). Those metacharacters have

    * to be manually unescaped here. */

    builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");



    return builder.ToString();

}

My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.

edited Dec 7 '13 at 16:11

asked Dec 7 '13 at 15:56

dkackman

116127

What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16

LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26

add a comment |

up vote
2
down vote

favorite

I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.

I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.

Here is my code:

internal static string ConvertLikeToRegex(string pattern)

{

    // Turn "off" all regular expression related syntax in the pattern string. 

    StringBuilder builder = new StringBuilder(Regex.Escape(pattern));



    // these are needed because the .*? replacement below at the begining or end of the string is not

    // accounting for cases such as LIKE '%abc' or LIKE 'abc%'

    bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

    bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



    // this is a little tricky

    // ends with in like is '%abc'

    // in regex it's 'abc$'

    // so need to tanspose

    if (startsWith)

    {

        builder.Replace("%", "", 0, 1);

        builder.Append("$");

    }



    // same but inverse here

    if (endsWith)

    {

        builder.Replace("%", "", pattern.Length - 1, 1);

        builder.Insert(0, "^");

    }



    /* Replace the SQL LIKE wildcard metacharacters with the

    * equivalent regular expression metacharacters. */

    builder.Replace("%", ".*?").Replace("_", ".");



    /* The previous call to Regex.Escape actually turned off

    * too many metacharacters, i.e. those which are recognized by

    * both the regular expression engine and the SQL LIKE

    * statement ([...] and [^...]). Those metacharacters have

    * to be manually unescaped here. */

    builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");



    return builder.ToString();

}

My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.

edited Dec 7 '13 at 16:11

asked Dec 7 '13 at 15:56

dkackman

116127

What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16

LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26

add a comment |

up vote
2
down vote

favorite

I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.

I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.

Here is my code:

internal static string ConvertLikeToRegex(string pattern)

{

    // Turn "off" all regular expression related syntax in the pattern string. 

    StringBuilder builder = new StringBuilder(Regex.Escape(pattern));



    // these are needed because the .*? replacement below at the begining or end of the string is not

    // accounting for cases such as LIKE '%abc' or LIKE 'abc%'

    bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

    bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



    // this is a little tricky

    // ends with in like is '%abc'

    // in regex it's 'abc$'

    // so need to tanspose

    if (startsWith)

    {

        builder.Replace("%", "", 0, 1);

        builder.Append("$");

    }



    // same but inverse here

    if (endsWith)

    {

        builder.Replace("%", "", pattern.Length - 1, 1);

        builder.Insert(0, "^");

    }



    /* Replace the SQL LIKE wildcard metacharacters with the

    * equivalent regular expression metacharacters. */

    builder.Replace("%", ".*?").Replace("_", ".");



    /* The previous call to Regex.Escape actually turned off

    * too many metacharacters, i.e. those which are recognized by

    * both the regular expression engine and the SQL LIKE

    * statement ([...] and [^...]). Those metacharacters have

    * to be manually unescaped here. */

    builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");



    return builder.ToString();

}

My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.

edited Dec 7 '13 at 16:11

asked Dec 7 '13 at 15:56

dkackman

116127

I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.

I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.

Here is my code:

internal static string ConvertLikeToRegex(string pattern)

{

    // Turn "off" all regular expression related syntax in the pattern string. 

    StringBuilder builder = new StringBuilder(Regex.Escape(pattern));



    // these are needed because the .*? replacement below at the begining or end of the string is not

    // accounting for cases such as LIKE '%abc' or LIKE 'abc%'

    bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

    bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



    // this is a little tricky

    // ends with in like is '%abc'

    // in regex it's 'abc$'

    // so need to tanspose

    if (startsWith)

    {

        builder.Replace("%", "", 0, 1);

        builder.Append("$");

    }



    // same but inverse here

    if (endsWith)

    {

        builder.Replace("%", "", pattern.Length - 1, 1);

        builder.Insert(0, "^");

    }



    /* Replace the SQL LIKE wildcard metacharacters with the

    * equivalent regular expression metacharacters. */

    builder.Replace("%", ".*?").Replace("_", ".");



    /* The previous call to Regex.Escape actually turned off

    * too many metacharacters, i.e. those which are recognized by

    * both the regular expression engine and the SQL LIKE

    * statement ([...] and [^...]). Those metacharacters have

    * to be manually unescaped here. */

    builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");



    return builder.ToString();

}

My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.

c# sql linq regex converting

edited Dec 7 '13 at 16:11

asked Dec 7 '13 at 15:56

dkackman

116127

edited Dec 7 '13 at 16:11

asked Dec 7 '13 at 15:56

dkackman

116127

edited Dec 7 '13 at 16:11

asked Dec 7 '13 at 15:56

dkackman

116127

asked Dec 7 '13 at 15:56

dkackman

116127

asked Dec 7 '13 at 15:56

dkackman

116127

What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16

LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26

add a comment |

What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16

LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26

What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16

LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26

add a comment |

4 Answers
4

active

oldest

votes

up vote
7
down vote

accepted

I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.

What you should do is to always add ^ at the start and $ at the end.

This means the following conversions:

bcd → ^bcd$

%bcd → ^.*?bcd$

bcd% → ^bcd.*?$

%bcd% → ^.*?bcd.*?$

In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.

answered Dec 7 '13 at 17:10

svick

22.6k43778

Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12

Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22

nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54

@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10

add a comment |

up vote
4
down vote

Because the LIKE clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE clause. A sample grammar could look like:

expr := wild-card + expr

      | wild-char + expr

      | escape + expr

      | string + expr

      | ""



wild-card := %  

wild-char := _  

escape := [%|_]  

string := [^%_]+ (One or > more characters that are not wild-card or wild-char)

NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)

The steps to accomplish the LIKE syntax conversion are as follows:

Define your Token classes:

public abstract class Token {

    private final String value;



    public Token(String value) {

        this.value = value;

    }



    public abstract String convert();



    public String getValue() {

        return value;

    }

}



public class EscapeToken extends Token {

    public EscapeToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return getValue();

    }

}



public class WildcardToken extends Token {

    public WildcardToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".*";

    }

}



public class WildcharToken extends Token {

    public WildcharToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".";

    }

}



public class StringToken extends Token {

    public StringToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return Pattern.quote(getValue());

    }

}

Create a Lexer (or Tokenizer):

public class Tokenizer {



   private Collection<Tuple> patterns = new LinkedList<>();



   public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {

       this.patterns.add(Tuple.of(Pattern.compile(regex), creator));

       return this;

    }



    public Collection<Token> tokenize(String clause) throws ParseException {

        Collection<Token> tokens = new ArrayList<>();

        String copy = String.copyValueOf(clause.toCharArray());



        int position = 0;

        while (!copy.equals("")) {

            boolean found = false;

            for (Tuple tuple : this.patterns) {

                Pattern pattern = tuple.get(0, Pattern.class);

                Matcher m = pattern.matcher(copy);

                if (m.find()) {

                    found = true;

                    String token = m.group(1);

                    Function<String, Token> fn = (Function<String, Token>) tuple.get(1);

                    tokens.add(fn.apply(token));

                    copy = m.replaceFirst("");

                    position += token.length();

                    break;

                }

            }



            if (!found) {

                throw new ParseException("Unexpected sequence found in input string.", ++position);

            }

        }



        return tokens;



    }

}

Create SQL LIKE to RegEx Transpiler:

public class SqlLikeTranspiler {

    private static Tokenizer TOKENIZER = new Tokenizer()

            .add("^(\[[^]]*])", ConstantToken::new)

            .add("^(%)", WildcardToken::new)

            .add("^(_)", WildcharToken::new)

            .add("^([^\[\]%_]+)", StringToken::new);



    public static String toRegEx(String pattern) throws ParseException {

        StringBuilder sb = new StringBuilder().append("^");

        for (Token token : TOKENIZER.tokenize(pattern)) {

            sb.append(token.convert());

        }



        return sb.append("$").toString();

    }

}

NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).

By creating a lexer and converting using this methodology, we can prevent LIKE clauses like %abc[%]%, which should match any string with the sub-string abc% in it, from being converted to a regular expression like .*abc[.*].* which will match any string with either the sub-string abc. or abc*.

The provided code is Java.

edited Nov 13 at 13:04

Toby Speight

21.9k536108

answered Nov 12 at 16:31

Damon Sutherland

411

New contributor

add a comment |

up vote
3
down vote

Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:

bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



if (startsWith)

{

    builder.Replace("%", "", 0, 1);

    builder.Append("$");

}

if (endsWith)

{

    builder.Replace("%", "", pattern.Length - 1, 1);

    builder.Insert(0, "^");

}

to this:

bool leadingLiteral = !pattern.StartsWith("%");

if (leadingLiteral)

{

    builder.Insert(0, "^");

}



bool trailingLiteral = !pattern.EndsWith("%");

if (trailingLiteral)

{

    builder.Append("$");

}

You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.

edited Dec 11 '13 at 21:15

answered Dec 7 '13 at 17:10

John Deters

83659

add a comment |

up vote
2
down vote

In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE (under Using Wildcard Characters as Literals). LIKE '5[%]' should be translated as regex '^5%$'.

answered Dec 7 '13 at 17:25

200_success

127k15148410

I think a more natural (and equivalent) translation would be ^5[%]$.
– svick
Dec 7 '13 at 18:14

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f36861%2fconvert-sql-like-to-regex%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
7
down vote

accepted

I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.

What you should do is to always add ^ at the start and $ at the end.

This means the following conversions:

bcd → ^bcd$

%bcd → ^.*?bcd$

bcd% → ^bcd.*?$

%bcd% → ^.*?bcd.*?$

In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.

answered Dec 7 '13 at 17:10

svick

22.6k43778

Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12

Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22

nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54

@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10

add a comment |

up vote
7
down vote

accepted

I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.

What you should do is to always add ^ at the start and $ at the end.

This means the following conversions:

bcd → ^bcd$

%bcd → ^.*?bcd$

bcd% → ^bcd.*?$

%bcd% → ^.*?bcd.*?$

In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.

answered Dec 7 '13 at 17:10

svick

22.6k43778

Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12

Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22

nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54

@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10

add a comment |

up vote
7
down vote

accepted

I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.

What you should do is to always add ^ at the start and $ at the end.

This means the following conversions:

bcd → ^bcd$

%bcd → ^.*?bcd$

bcd% → ^bcd.*?$

%bcd% → ^.*?bcd.*?$

In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.

answered Dec 7 '13 at 17:10

svick

22.6k43778

I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.

What you should do is to always add ^ at the start and $ at the end.

This means the following conversions:

bcd → ^bcd$

%bcd → ^.*?bcd$

bcd% → ^bcd.*?$

%bcd% → ^.*?bcd.*?$

In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.

answered Dec 7 '13 at 17:10

svick

22.6k43778

answered Dec 7 '13 at 17:10

svick

22.6k43778

answered Dec 7 '13 at 17:10

svick

22.6k43778

answered Dec 7 '13 at 17:10

svick

22.6k43778

Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12

Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22

nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54

@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10

add a comment |

Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12

Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22

nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54

@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10

Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12

Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22

nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54

@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10

add a comment |

up vote
4
down vote

expr := wild-card + expr

      | wild-char + expr

      | escape + expr

      | string + expr

      | ""



wild-card := %  

wild-char := _  

escape := [%|_]  

string := [^%_]+ (One or > more characters that are not wild-card or wild-char)

NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)

The steps to accomplish the LIKE syntax conversion are as follows:

Define your Token classes:

public abstract class Token {

    private final String value;



    public Token(String value) {

        this.value = value;

    }



    public abstract String convert();



    public String getValue() {

        return value;

    }

}



public class EscapeToken extends Token {

    public EscapeToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return getValue();

    }

}



public class WildcardToken extends Token {

    public WildcardToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".*";

    }

}



public class WildcharToken extends Token {

    public WildcharToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".";

    }

}



public class StringToken extends Token {

    public StringToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return Pattern.quote(getValue());

    }

}

Create a Lexer (or Tokenizer):

public class Tokenizer {



   private Collection<Tuple> patterns = new LinkedList<>();



   public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {

       this.patterns.add(Tuple.of(Pattern.compile(regex), creator));

       return this;

    }



    public Collection<Token> tokenize(String clause) throws ParseException {

        Collection<Token> tokens = new ArrayList<>();

        String copy = String.copyValueOf(clause.toCharArray());



        int position = 0;

        while (!copy.equals("")) {

            boolean found = false;

            for (Tuple tuple : this.patterns) {

                Pattern pattern = tuple.get(0, Pattern.class);

                Matcher m = pattern.matcher(copy);

                if (m.find()) {

                    found = true;

                    String token = m.group(1);

                    Function<String, Token> fn = (Function<String, Token>) tuple.get(1);

                    tokens.add(fn.apply(token));

                    copy = m.replaceFirst("");

                    position += token.length();

                    break;

                }

            }



            if (!found) {

                throw new ParseException("Unexpected sequence found in input string.", ++position);

            }

        }



        return tokens;



    }

}

Create SQL LIKE to RegEx Transpiler:

public class SqlLikeTranspiler {

    private static Tokenizer TOKENIZER = new Tokenizer()

            .add("^(\[[^]]*])", ConstantToken::new)

            .add("^(%)", WildcardToken::new)

            .add("^(_)", WildcharToken::new)

            .add("^([^\[\]%_]+)", StringToken::new);



    public static String toRegEx(String pattern) throws ParseException {

        StringBuilder sb = new StringBuilder().append("^");

        for (Token token : TOKENIZER.tokenize(pattern)) {

            sb.append(token.convert());

        }



        return sb.append("$").toString();

    }

}

NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).

The provided code is Java.

edited Nov 13 at 13:04

Toby Speight

21.9k536108

answered Nov 12 at 16:31

Damon Sutherland

411

New contributor

add a comment |

up vote
4
down vote

expr := wild-card + expr

      | wild-char + expr

      | escape + expr

      | string + expr

      | ""



wild-card := %  

wild-char := _  

escape := [%|_]  

string := [^%_]+ (One or > more characters that are not wild-card or wild-char)

NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)

The steps to accomplish the LIKE syntax conversion are as follows:

Define your Token classes:

public abstract class Token {

    private final String value;



    public Token(String value) {

        this.value = value;

    }



    public abstract String convert();



    public String getValue() {

        return value;

    }

}



public class EscapeToken extends Token {

    public EscapeToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return getValue();

    }

}



public class WildcardToken extends Token {

    public WildcardToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".*";

    }

}



public class WildcharToken extends Token {

    public WildcharToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".";

    }

}



public class StringToken extends Token {

    public StringToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return Pattern.quote(getValue());

    }

}

Create a Lexer (or Tokenizer):

public class Tokenizer {



   private Collection<Tuple> patterns = new LinkedList<>();



   public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {

       this.patterns.add(Tuple.of(Pattern.compile(regex), creator));

       return this;

    }



    public Collection<Token> tokenize(String clause) throws ParseException {

        Collection<Token> tokens = new ArrayList<>();

        String copy = String.copyValueOf(clause.toCharArray());



        int position = 0;

        while (!copy.equals("")) {

            boolean found = false;

            for (Tuple tuple : this.patterns) {

                Pattern pattern = tuple.get(0, Pattern.class);

                Matcher m = pattern.matcher(copy);

                if (m.find()) {

                    found = true;

                    String token = m.group(1);

                    Function<String, Token> fn = (Function<String, Token>) tuple.get(1);

                    tokens.add(fn.apply(token));

                    copy = m.replaceFirst("");

                    position += token.length();

                    break;

                }

            }



            if (!found) {

                throw new ParseException("Unexpected sequence found in input string.", ++position);

            }

        }



        return tokens;



    }

}

Create SQL LIKE to RegEx Transpiler:

public class SqlLikeTranspiler {

    private static Tokenizer TOKENIZER = new Tokenizer()

            .add("^(\[[^]]*])", ConstantToken::new)

            .add("^(%)", WildcardToken::new)

            .add("^(_)", WildcharToken::new)

            .add("^([^\[\]%_]+)", StringToken::new);



    public static String toRegEx(String pattern) throws ParseException {

        StringBuilder sb = new StringBuilder().append("^");

        for (Token token : TOKENIZER.tokenize(pattern)) {

            sb.append(token.convert());

        }



        return sb.append("$").toString();

    }

}

NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).

The provided code is Java.

edited Nov 13 at 13:04

Toby Speight

21.9k536108

answered Nov 12 at 16:31

Damon Sutherland

411

New contributor

add a comment |

up vote
4
down vote

expr := wild-card + expr

      | wild-char + expr

      | escape + expr

      | string + expr

      | ""



wild-card := %  

wild-char := _  

escape := [%|_]  

string := [^%_]+ (One or > more characters that are not wild-card or wild-char)

NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)

The steps to accomplish the LIKE syntax conversion are as follows:

Define your Token classes:

public abstract class Token {

    private final String value;



    public Token(String value) {

        this.value = value;

    }



    public abstract String convert();



    public String getValue() {

        return value;

    }

}



public class EscapeToken extends Token {

    public EscapeToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return getValue();

    }

}



public class WildcardToken extends Token {

    public WildcardToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".*";

    }

}



public class WildcharToken extends Token {

    public WildcharToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".";

    }

}



public class StringToken extends Token {

    public StringToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return Pattern.quote(getValue());

    }

}

Create a Lexer (or Tokenizer):

public class Tokenizer {



   private Collection<Tuple> patterns = new LinkedList<>();



   public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {

       this.patterns.add(Tuple.of(Pattern.compile(regex), creator));

       return this;

    }



    public Collection<Token> tokenize(String clause) throws ParseException {

        Collection<Token> tokens = new ArrayList<>();

        String copy = String.copyValueOf(clause.toCharArray());



        int position = 0;

        while (!copy.equals("")) {

            boolean found = false;

            for (Tuple tuple : this.patterns) {

                Pattern pattern = tuple.get(0, Pattern.class);

                Matcher m = pattern.matcher(copy);

                if (m.find()) {

                    found = true;

                    String token = m.group(1);

                    Function<String, Token> fn = (Function<String, Token>) tuple.get(1);

                    tokens.add(fn.apply(token));

                    copy = m.replaceFirst("");

                    position += token.length();

                    break;

                }

            }



            if (!found) {

                throw new ParseException("Unexpected sequence found in input string.", ++position);

            }

        }



        return tokens;



    }

}

Create SQL LIKE to RegEx Transpiler:

public class SqlLikeTranspiler {

    private static Tokenizer TOKENIZER = new Tokenizer()

            .add("^(\[[^]]*])", ConstantToken::new)

            .add("^(%)", WildcardToken::new)

            .add("^(_)", WildcharToken::new)

            .add("^([^\[\]%_]+)", StringToken::new);



    public static String toRegEx(String pattern) throws ParseException {

        StringBuilder sb = new StringBuilder().append("^");

        for (Token token : TOKENIZER.tokenize(pattern)) {

            sb.append(token.convert());

        }



        return sb.append("$").toString();

    }

}

NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).

The provided code is Java.

edited Nov 13 at 13:04

Toby Speight

21.9k536108

answered Nov 12 at 16:31

Damon Sutherland

411

New contributor

expr := wild-card + expr

      | wild-char + expr

      | escape + expr

      | string + expr

      | ""



wild-card := %  

wild-char := _  

escape := [%|_]  

string := [^%_]+ (One or > more characters that are not wild-card or wild-char)

NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)

The steps to accomplish the LIKE syntax conversion are as follows:

Define your Token classes:

public abstract class Token {

    private final String value;



    public Token(String value) {

        this.value = value;

    }



    public abstract String convert();



    public String getValue() {

        return value;

    }

}



public class EscapeToken extends Token {

    public EscapeToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return getValue();

    }

}



public class WildcardToken extends Token {

    public WildcardToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".*";

    }

}



public class WildcharToken extends Token {

    public WildcharToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return ".";

    }

}



public class StringToken extends Token {

    public StringToken(String value) {

        super(value);

    }



    @Override

    public String convert() {

        return Pattern.quote(getValue());

    }

}

Create a Lexer (or Tokenizer):

public class Tokenizer {



   private Collection<Tuple> patterns = new LinkedList<>();



   public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {

       this.patterns.add(Tuple.of(Pattern.compile(regex), creator));

       return this;

    }



    public Collection<Token> tokenize(String clause) throws ParseException {

        Collection<Token> tokens = new ArrayList<>();

        String copy = String.copyValueOf(clause.toCharArray());



        int position = 0;

        while (!copy.equals("")) {

            boolean found = false;

            for (Tuple tuple : this.patterns) {

                Pattern pattern = tuple.get(0, Pattern.class);

                Matcher m = pattern.matcher(copy);

                if (m.find()) {

                    found = true;

                    String token = m.group(1);

                    Function<String, Token> fn = (Function<String, Token>) tuple.get(1);

                    tokens.add(fn.apply(token));

                    copy = m.replaceFirst("");

                    position += token.length();

                    break;

                }

            }



            if (!found) {

                throw new ParseException("Unexpected sequence found in input string.", ++position);

            }

        }



        return tokens;



    }

}

Create SQL LIKE to RegEx Transpiler:

public class SqlLikeTranspiler {

    private static Tokenizer TOKENIZER = new Tokenizer()

            .add("^(\[[^]]*])", ConstantToken::new)

            .add("^(%)", WildcardToken::new)

            .add("^(_)", WildcharToken::new)

            .add("^([^\[\]%_]+)", StringToken::new);



    public static String toRegEx(String pattern) throws ParseException {

        StringBuilder sb = new StringBuilder().append("^");

        for (Token token : TOKENIZER.tokenize(pattern)) {

            sb.append(token.convert());

        }



        return sb.append("$").toString();

    }

}

NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).

The provided code is Java.

edited Nov 13 at 13:04

Toby Speight

21.9k536108

answered Nov 12 at 16:31

Damon Sutherland

411

New contributor

edited Nov 13 at 13:04

Toby Speight

21.9k536108

edited Nov 13 at 13:04

Toby Speight

21.9k536108

edited Nov 13 at 13:04

Toby Speight

21.9k536108

answered Nov 12 at 16:31

Damon Sutherland

411

New contributor

answered Nov 12 at 16:31

Damon Sutherland

411

answered Nov 12 at 16:31

Damon Sutherland

411

New contributor

Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

up vote
3
down vote

Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:

bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



if (startsWith)

{

    builder.Replace("%", "", 0, 1);

    builder.Append("$");

}

if (endsWith)

{

    builder.Replace("%", "", pattern.Length - 1, 1);

    builder.Insert(0, "^");

}

to this:

bool leadingLiteral = !pattern.StartsWith("%");

if (leadingLiteral)

{

    builder.Insert(0, "^");

}



bool trailingLiteral = !pattern.EndsWith("%");

if (trailingLiteral)

{

    builder.Append("$");

}

edited Dec 11 '13 at 21:15

answered Dec 7 '13 at 17:10

John Deters

83659

add a comment |

up vote
3
down vote

Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:

bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



if (startsWith)

{

    builder.Replace("%", "", 0, 1);

    builder.Append("$");

}

if (endsWith)

{

    builder.Replace("%", "", pattern.Length - 1, 1);

    builder.Insert(0, "^");

}

to this:

bool leadingLiteral = !pattern.StartsWith("%");

if (leadingLiteral)

{

    builder.Insert(0, "^");

}



bool trailingLiteral = !pattern.EndsWith("%");

if (trailingLiteral)

{

    builder.Append("$");

}

edited Dec 11 '13 at 21:15

answered Dec 7 '13 at 17:10

John Deters

83659

add a comment |

up vote
3
down vote

Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:

bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



if (startsWith)

{

    builder.Replace("%", "", 0, 1);

    builder.Append("$");

}

if (endsWith)

{

    builder.Replace("%", "", pattern.Length - 1, 1);

    builder.Insert(0, "^");

}

to this:

bool leadingLiteral = !pattern.StartsWith("%");

if (leadingLiteral)

{

    builder.Insert(0, "^");

}



bool trailingLiteral = !pattern.EndsWith("%");

if (trailingLiteral)

{

    builder.Append("$");

}

edited Dec 11 '13 at 21:15

answered Dec 7 '13 at 17:10

John Deters

83659

Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:

bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");

bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");



if (startsWith)

{

    builder.Replace("%", "", 0, 1);

    builder.Append("$");

}

if (endsWith)

{

    builder.Replace("%", "", pattern.Length - 1, 1);

    builder.Insert(0, "^");

}

to this:

bool leadingLiteral = !pattern.StartsWith("%");

if (leadingLiteral)

{

    builder.Insert(0, "^");

}



bool trailingLiteral = !pattern.EndsWith("%");

if (trailingLiteral)

{

    builder.Append("$");

}

edited Dec 11 '13 at 21:15

answered Dec 7 '13 at 17:10

John Deters

83659

edited Dec 11 '13 at 21:15

answered Dec 7 '13 at 17:10

John Deters

83659

answered Dec 7 '13 at 17:10

John Deters

83659

answered Dec 7 '13 at 17:10

John Deters

83659

add a comment |

up vote
2
down vote

answered Dec 7 '13 at 17:25

200_success

127k15148410

I think a more natural (and equivalent) translation would be ^5[%]$.
– svick
Dec 7 '13 at 18:14

add a comment |

up vote
2
down vote

answered Dec 7 '13 at 17:25

200_success

127k15148410

I think a more natural (and equivalent) translation would be ^5[%]$.
– svick
Dec 7 '13 at 18:14

add a comment |

up vote
2
down vote

answered Dec 7 '13 at 17:25

200_success

127k15148410

answered Dec 7 '13 at 17:25

200_success

127k15148410

answered Dec 7 '13 at 17:25

200_success

127k15148410

answered Dec 7 '13 at 17:25

200_success

127k15148410

answered Dec 7 '13 at 17:25

200_success

127k15148410

I think a more natural (and equivalent) translation would be ^5[%]$.
– svick
Dec 7 '13 at 18:14

add a comment |

I think a more natural (and equivalent) translation would be ^5[%]$.
– svick
Dec 7 '13 at 18:14

I think a more natural (and equivalent) translation would be ^5[%]$.
– svick
Dec 7 '13 at 18:14

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfrktyl