Convert Sql LIKE to Regex











up vote
2
down vote

favorite
1












I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.



This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc to .*?abc is capturing both "abcdef" and "123abcdef".



I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.



Here is my code:



internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));

// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}

// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}

/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");

/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");

return builder.ToString();
}


My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.










share|improve this question
























  • What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
    – Cruncher
    Dec 7 '13 at 16:16












  • LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
    – dkackman
    Dec 7 '13 at 16:26

















up vote
2
down vote

favorite
1












I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.



This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc to .*?abc is capturing both "abcdef" and "123abcdef".



I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.



Here is my code:



internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));

// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}

// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}

/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");

/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");

return builder.ToString();
}


My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.










share|improve this question
























  • What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
    – Cruncher
    Dec 7 '13 at 16:16












  • LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
    – dkackman
    Dec 7 '13 at 16:26















up vote
2
down vote

favorite
1









up vote
2
down vote

favorite
1






1





I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.



This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc to .*?abc is capturing both "abcdef" and "123abcdef".



I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.



Here is my code:



internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));

// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}

// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}

/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");

/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");

return builder.ToString();
}


My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.










share|improve this question















I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like extension method. For some time I have been using this conversion.



This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc to .*?abc is capturing both "abcdef" and "123abcdef".



I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.



Here is my code:



internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));

// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}

// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}

/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");

/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");

return builder.ToString();
}


My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.







c# sql linq regex converting






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 7 '13 at 16:11

























asked Dec 7 '13 at 15:56









dkackman

116127




116127












  • What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
    – Cruncher
    Dec 7 '13 at 16:16












  • LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
    – dkackman
    Dec 7 '13 at 16:26




















  • What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
    – Cruncher
    Dec 7 '13 at 16:16












  • LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
    – dkackman
    Dec 7 '13 at 16:26


















What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16






What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16














LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26






LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26












4 Answers
4






active

oldest

votes

















up vote
7
down vote



accepted










I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.



What you should do is to always add ^ at the start and $ at the end.



This means the following conversions:





  • bcd^bcd$


  • %bcd^.*?bcd$


  • bcd%^bcd.*?$


  • %bcd%^.*?bcd.*?$


In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.






share|improve this answer





















  • Even simpler than my suggestion. Nice!
    – John Deters
    Dec 7 '13 at 17:12










  • Perfect. Works like a charm. I figured I was missing something much more straightforward.
    – dkackman
    Dec 7 '13 at 17:22










  • nb: "?" is not needed since LIKE will always match the entire string anyway.
    – Sylverdrag
    Dec 7 '13 at 19:54










  • @Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
    – svick
    Dec 7 '13 at 20:10




















up vote
4
down vote













Because the LIKE clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE clause. A sample grammar could look like:



expr := wild-card + expr
| wild-char + expr
| escape + expr
| string + expr
| ""

wild-card := %
wild-char := _
escape := [%|_]
string := [^%_]+ (One or > more characters that are not wild-card or wild-char)


NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)



The steps to accomplish the LIKE syntax conversion are as follows:







  1. Define your Token classes:



    public abstract class Token {
    private final String value;

    public Token(String value) {
    this.value = value;
    }

    public abstract String convert();

    public String getValue() {
    return value;
    }
    }

    public class EscapeToken extends Token {
    public EscapeToken(String value) {
    super(value);
    }

    @Override
    public String convert() {
    return getValue();
    }
    }

    public class WildcardToken extends Token {
    public WildcardToken(String value) {
    super(value);
    }

    @Override
    public String convert() {
    return ".*";
    }
    }

    public class WildcharToken extends Token {
    public WildcharToken(String value) {
    super(value);
    }

    @Override
    public String convert() {
    return ".";
    }
    }

    public class StringToken extends Token {
    public StringToken(String value) {
    super(value);
    }

    @Override
    public String convert() {
    return Pattern.quote(getValue());
    }
    }



  2. Create a Lexer (or Tokenizer):



    public class Tokenizer {

    private Collection<Tuple> patterns = new LinkedList<>();

    public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
    this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
    return this;
    }

    public Collection<Token> tokenize(String clause) throws ParseException {
    Collection<Token> tokens = new ArrayList<>();
    String copy = String.copyValueOf(clause.toCharArray());

    int position = 0;
    while (!copy.equals("")) {
    boolean found = false;
    for (Tuple tuple : this.patterns) {
    Pattern pattern = tuple.get(0, Pattern.class);
    Matcher m = pattern.matcher(copy);
    if (m.find()) {
    found = true;
    String token = m.group(1);
    Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
    tokens.add(fn.apply(token));
    copy = m.replaceFirst("");
    position += token.length();
    break;
    }
    }

    if (!found) {
    throw new ParseException("Unexpected sequence found in input string.", ++position);
    }
    }

    return tokens;

    }
    }



  3. Create SQL LIKE to RegEx Transpiler:



    public class SqlLikeTranspiler {
    private static Tokenizer TOKENIZER = new Tokenizer()
    .add("^(\[[^]]*])", ConstantToken::new)
    .add("^(%)", WildcardToken::new)
    .add("^(_)", WildcharToken::new)
    .add("^([^\[\]%_]+)", StringToken::new);

    public static String toRegEx(String pattern) throws ParseException {
    StringBuilder sb = new StringBuilder().append("^");
    for (Token token : TOKENIZER.tokenize(pattern)) {
    sb.append(token.convert());
    }

    return sb.append("$").toString();
    }
    }



NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).



By creating a lexer and converting using this methodology, we can prevent LIKE clauses like %abc[%]%, which should match any string with the sub-string abc% in it, from being converted to a regular expression like .*abc[.*].* which will match any string with either the sub-string abc. or abc*.



The provided code is Java.






share|improve this answer










New contributor




Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    up vote
    3
    down vote













    Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:



    bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
    bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

    if (startsWith)
    {
    builder.Replace("%", "", 0, 1);
    builder.Append("$");
    }
    if (endsWith)
    {
    builder.Replace("%", "", pattern.Length - 1, 1);
    builder.Insert(0, "^");
    }


    to this:



    bool leadingLiteral = !pattern.StartsWith("%");
    if (leadingLiteral)
    {
    builder.Insert(0, "^");
    }

    bool trailingLiteral = !pattern.EndsWith("%");
    if (trailingLiteral)
    {
    builder.Append("$");
    }


    You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.






    share|improve this answer






























      up vote
      2
      down vote













      In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE (under Using Wildcard Characters as Literals). LIKE '5[%]' should be translated as regex '^5%$'.






      share|improve this answer





















      • I think a more natural (and equivalent) translation would be ^5[%]$.
        – svick
        Dec 7 '13 at 18:14











      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "196"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f36861%2fconvert-sql-like-to-regex%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      7
      down vote



      accepted










      I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.



      What you should do is to always add ^ at the start and $ at the end.



      This means the following conversions:





      • bcd^bcd$


      • %bcd^.*?bcd$


      • bcd%^bcd.*?$


      • %bcd%^.*?bcd.*?$


      In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.






      share|improve this answer





















      • Even simpler than my suggestion. Nice!
        – John Deters
        Dec 7 '13 at 17:12










      • Perfect. Works like a charm. I figured I was missing something much more straightforward.
        – dkackman
        Dec 7 '13 at 17:22










      • nb: "?" is not needed since LIKE will always match the entire string anyway.
        – Sylverdrag
        Dec 7 '13 at 19:54










      • @Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
        – svick
        Dec 7 '13 at 20:10

















      up vote
      7
      down vote



      accepted










      I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.



      What you should do is to always add ^ at the start and $ at the end.



      This means the following conversions:





      • bcd^bcd$


      • %bcd^.*?bcd$


      • bcd%^bcd.*?$


      • %bcd%^.*?bcd.*?$


      In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.






      share|improve this answer





















      • Even simpler than my suggestion. Nice!
        – John Deters
        Dec 7 '13 at 17:12










      • Perfect. Works like a charm. I figured I was missing something much more straightforward.
        – dkackman
        Dec 7 '13 at 17:22










      • nb: "?" is not needed since LIKE will always match the entire string anyway.
        – Sylverdrag
        Dec 7 '13 at 19:54










      • @Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
        – svick
        Dec 7 '13 at 20:10















      up vote
      7
      down vote



      accepted







      up vote
      7
      down vote



      accepted






      I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.



      What you should do is to always add ^ at the start and $ at the end.



      This means the following conversions:





      • bcd^bcd$


      • %bcd^.*?bcd$


      • bcd%^bcd.*?$


      • %bcd%^.*?bcd.*?$


      In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.






      share|improve this answer












      I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd shouldn't match abcde, but it does with your code.



      What you should do is to always add ^ at the start and $ at the end.



      This means the following conversions:





      • bcd^bcd$


      • %bcd^.*?bcd$


      • bcd%^bcd.*?$


      • %bcd%^.*?bcd.*?$


      In the cases where the pattern starts with %, the ^ is not necessary (and similarly for $ and % at the end), but it also doesn't do any harm.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Dec 7 '13 at 17:10









      svick

      22.6k43778




      22.6k43778












      • Even simpler than my suggestion. Nice!
        – John Deters
        Dec 7 '13 at 17:12










      • Perfect. Works like a charm. I figured I was missing something much more straightforward.
        – dkackman
        Dec 7 '13 at 17:22










      • nb: "?" is not needed since LIKE will always match the entire string anyway.
        – Sylverdrag
        Dec 7 '13 at 19:54










      • @Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
        – svick
        Dec 7 '13 at 20:10




















      • Even simpler than my suggestion. Nice!
        – John Deters
        Dec 7 '13 at 17:12










      • Perfect. Works like a charm. I figured I was missing something much more straightforward.
        – dkackman
        Dec 7 '13 at 17:22










      • nb: "?" is not needed since LIKE will always match the entire string anyway.
        – Sylverdrag
        Dec 7 '13 at 19:54










      • @Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
        – svick
        Dec 7 '13 at 20:10


















      Even simpler than my suggestion. Nice!
      – John Deters
      Dec 7 '13 at 17:12




      Even simpler than my suggestion. Nice!
      – John Deters
      Dec 7 '13 at 17:12












      Perfect. Works like a charm. I figured I was missing something much more straightforward.
      – dkackman
      Dec 7 '13 at 17:22




      Perfect. Works like a charm. I figured I was missing something much more straightforward.
      – dkackman
      Dec 7 '13 at 17:22












      nb: "?" is not needed since LIKE will always match the entire string anyway.
      – Sylverdrag
      Dec 7 '13 at 19:54




      nb: "?" is not needed since LIKE will always match the entire string anyway.
      – Sylverdrag
      Dec 7 '13 at 19:54












      @Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
      – svick
      Dec 7 '13 at 20:10






      @Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
      – svick
      Dec 7 '13 at 20:10














      up vote
      4
      down vote













      Because the LIKE clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE clause. A sample grammar could look like:



      expr := wild-card + expr
      | wild-char + expr
      | escape + expr
      | string + expr
      | ""

      wild-card := %
      wild-char := _
      escape := [%|_]
      string := [^%_]+ (One or > more characters that are not wild-card or wild-char)


      NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)



      The steps to accomplish the LIKE syntax conversion are as follows:







      1. Define your Token classes:



        public abstract class Token {
        private final String value;

        public Token(String value) {
        this.value = value;
        }

        public abstract String convert();

        public String getValue() {
        return value;
        }
        }

        public class EscapeToken extends Token {
        public EscapeToken(String value) {
        super(value);
        }

        @Override
        public String convert() {
        return getValue();
        }
        }

        public class WildcardToken extends Token {
        public WildcardToken(String value) {
        super(value);
        }

        @Override
        public String convert() {
        return ".*";
        }
        }

        public class WildcharToken extends Token {
        public WildcharToken(String value) {
        super(value);
        }

        @Override
        public String convert() {
        return ".";
        }
        }

        public class StringToken extends Token {
        public StringToken(String value) {
        super(value);
        }

        @Override
        public String convert() {
        return Pattern.quote(getValue());
        }
        }



      2. Create a Lexer (or Tokenizer):



        public class Tokenizer {

        private Collection<Tuple> patterns = new LinkedList<>();

        public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
        this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
        return this;
        }

        public Collection<Token> tokenize(String clause) throws ParseException {
        Collection<Token> tokens = new ArrayList<>();
        String copy = String.copyValueOf(clause.toCharArray());

        int position = 0;
        while (!copy.equals("")) {
        boolean found = false;
        for (Tuple tuple : this.patterns) {
        Pattern pattern = tuple.get(0, Pattern.class);
        Matcher m = pattern.matcher(copy);
        if (m.find()) {
        found = true;
        String token = m.group(1);
        Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
        tokens.add(fn.apply(token));
        copy = m.replaceFirst("");
        position += token.length();
        break;
        }
        }

        if (!found) {
        throw new ParseException("Unexpected sequence found in input string.", ++position);
        }
        }

        return tokens;

        }
        }



      3. Create SQL LIKE to RegEx Transpiler:



        public class SqlLikeTranspiler {
        private static Tokenizer TOKENIZER = new Tokenizer()
        .add("^(\[[^]]*])", ConstantToken::new)
        .add("^(%)", WildcardToken::new)
        .add("^(_)", WildcharToken::new)
        .add("^([^\[\]%_]+)", StringToken::new);

        public static String toRegEx(String pattern) throws ParseException {
        StringBuilder sb = new StringBuilder().append("^");
        for (Token token : TOKENIZER.tokenize(pattern)) {
        sb.append(token.convert());
        }

        return sb.append("$").toString();
        }
        }



      NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).



      By creating a lexer and converting using this methodology, we can prevent LIKE clauses like %abc[%]%, which should match any string with the sub-string abc% in it, from being converted to a regular expression like .*abc[.*].* which will match any string with either the sub-string abc. or abc*.



      The provided code is Java.






      share|improve this answer










      New contributor




      Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















        up vote
        4
        down vote













        Because the LIKE clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE clause. A sample grammar could look like:



        expr := wild-card + expr
        | wild-char + expr
        | escape + expr
        | string + expr
        | ""

        wild-card := %
        wild-char := _
        escape := [%|_]
        string := [^%_]+ (One or > more characters that are not wild-card or wild-char)


        NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)



        The steps to accomplish the LIKE syntax conversion are as follows:







        1. Define your Token classes:



          public abstract class Token {
          private final String value;

          public Token(String value) {
          this.value = value;
          }

          public abstract String convert();

          public String getValue() {
          return value;
          }
          }

          public class EscapeToken extends Token {
          public EscapeToken(String value) {
          super(value);
          }

          @Override
          public String convert() {
          return getValue();
          }
          }

          public class WildcardToken extends Token {
          public WildcardToken(String value) {
          super(value);
          }

          @Override
          public String convert() {
          return ".*";
          }
          }

          public class WildcharToken extends Token {
          public WildcharToken(String value) {
          super(value);
          }

          @Override
          public String convert() {
          return ".";
          }
          }

          public class StringToken extends Token {
          public StringToken(String value) {
          super(value);
          }

          @Override
          public String convert() {
          return Pattern.quote(getValue());
          }
          }



        2. Create a Lexer (or Tokenizer):



          public class Tokenizer {

          private Collection<Tuple> patterns = new LinkedList<>();

          public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
          this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
          return this;
          }

          public Collection<Token> tokenize(String clause) throws ParseException {
          Collection<Token> tokens = new ArrayList<>();
          String copy = String.copyValueOf(clause.toCharArray());

          int position = 0;
          while (!copy.equals("")) {
          boolean found = false;
          for (Tuple tuple : this.patterns) {
          Pattern pattern = tuple.get(0, Pattern.class);
          Matcher m = pattern.matcher(copy);
          if (m.find()) {
          found = true;
          String token = m.group(1);
          Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
          tokens.add(fn.apply(token));
          copy = m.replaceFirst("");
          position += token.length();
          break;
          }
          }

          if (!found) {
          throw new ParseException("Unexpected sequence found in input string.", ++position);
          }
          }

          return tokens;

          }
          }



        3. Create SQL LIKE to RegEx Transpiler:



          public class SqlLikeTranspiler {
          private static Tokenizer TOKENIZER = new Tokenizer()
          .add("^(\[[^]]*])", ConstantToken::new)
          .add("^(%)", WildcardToken::new)
          .add("^(_)", WildcharToken::new)
          .add("^([^\[\]%_]+)", StringToken::new);

          public static String toRegEx(String pattern) throws ParseException {
          StringBuilder sb = new StringBuilder().append("^");
          for (Token token : TOKENIZER.tokenize(pattern)) {
          sb.append(token.convert());
          }

          return sb.append("$").toString();
          }
          }



        NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).



        By creating a lexer and converting using this methodology, we can prevent LIKE clauses like %abc[%]%, which should match any string with the sub-string abc% in it, from being converted to a regular expression like .*abc[.*].* which will match any string with either the sub-string abc. or abc*.



        The provided code is Java.






        share|improve this answer










        New contributor




        Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.




















          up vote
          4
          down vote










          up vote
          4
          down vote









          Because the LIKE clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE clause. A sample grammar could look like:



          expr := wild-card + expr
          | wild-char + expr
          | escape + expr
          | string + expr
          | ""

          wild-card := %
          wild-char := _
          escape := [%|_]
          string := [^%_]+ (One or > more characters that are not wild-card or wild-char)


          NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)



          The steps to accomplish the LIKE syntax conversion are as follows:







          1. Define your Token classes:



            public abstract class Token {
            private final String value;

            public Token(String value) {
            this.value = value;
            }

            public abstract String convert();

            public String getValue() {
            return value;
            }
            }

            public class EscapeToken extends Token {
            public EscapeToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return getValue();
            }
            }

            public class WildcardToken extends Token {
            public WildcardToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return ".*";
            }
            }

            public class WildcharToken extends Token {
            public WildcharToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return ".";
            }
            }

            public class StringToken extends Token {
            public StringToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return Pattern.quote(getValue());
            }
            }



          2. Create a Lexer (or Tokenizer):



            public class Tokenizer {

            private Collection<Tuple> patterns = new LinkedList<>();

            public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
            this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
            return this;
            }

            public Collection<Token> tokenize(String clause) throws ParseException {
            Collection<Token> tokens = new ArrayList<>();
            String copy = String.copyValueOf(clause.toCharArray());

            int position = 0;
            while (!copy.equals("")) {
            boolean found = false;
            for (Tuple tuple : this.patterns) {
            Pattern pattern = tuple.get(0, Pattern.class);
            Matcher m = pattern.matcher(copy);
            if (m.find()) {
            found = true;
            String token = m.group(1);
            Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
            tokens.add(fn.apply(token));
            copy = m.replaceFirst("");
            position += token.length();
            break;
            }
            }

            if (!found) {
            throw new ParseException("Unexpected sequence found in input string.", ++position);
            }
            }

            return tokens;

            }
            }



          3. Create SQL LIKE to RegEx Transpiler:



            public class SqlLikeTranspiler {
            private static Tokenizer TOKENIZER = new Tokenizer()
            .add("^(\[[^]]*])", ConstantToken::new)
            .add("^(%)", WildcardToken::new)
            .add("^(_)", WildcharToken::new)
            .add("^([^\[\]%_]+)", StringToken::new);

            public static String toRegEx(String pattern) throws ParseException {
            StringBuilder sb = new StringBuilder().append("^");
            for (Token token : TOKENIZER.tokenize(pattern)) {
            sb.append(token.convert());
            }

            return sb.append("$").toString();
            }
            }



          NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).



          By creating a lexer and converting using this methodology, we can prevent LIKE clauses like %abc[%]%, which should match any string with the sub-string abc% in it, from being converted to a regular expression like .*abc[.*].* which will match any string with either the sub-string abc. or abc*.



          The provided code is Java.






          share|improve this answer










          New contributor




          Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          Because the LIKE clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE clause. A sample grammar could look like:



          expr := wild-card + expr
          | wild-char + expr
          | escape + expr
          | string + expr
          | ""

          wild-card := %
          wild-char := _
          escape := [%|_]
          string := [^%_]+ (One or > more characters that are not wild-card or wild-char)


          NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE character (see T-SQL)



          The steps to accomplish the LIKE syntax conversion are as follows:







          1. Define your Token classes:



            public abstract class Token {
            private final String value;

            public Token(String value) {
            this.value = value;
            }

            public abstract String convert();

            public String getValue() {
            return value;
            }
            }

            public class EscapeToken extends Token {
            public EscapeToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return getValue();
            }
            }

            public class WildcardToken extends Token {
            public WildcardToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return ".*";
            }
            }

            public class WildcharToken extends Token {
            public WildcharToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return ".";
            }
            }

            public class StringToken extends Token {
            public StringToken(String value) {
            super(value);
            }

            @Override
            public String convert() {
            return Pattern.quote(getValue());
            }
            }



          2. Create a Lexer (or Tokenizer):



            public class Tokenizer {

            private Collection<Tuple> patterns = new LinkedList<>();

            public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
            this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
            return this;
            }

            public Collection<Token> tokenize(String clause) throws ParseException {
            Collection<Token> tokens = new ArrayList<>();
            String copy = String.copyValueOf(clause.toCharArray());

            int position = 0;
            while (!copy.equals("")) {
            boolean found = false;
            for (Tuple tuple : this.patterns) {
            Pattern pattern = tuple.get(0, Pattern.class);
            Matcher m = pattern.matcher(copy);
            if (m.find()) {
            found = true;
            String token = m.group(1);
            Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
            tokens.add(fn.apply(token));
            copy = m.replaceFirst("");
            position += token.length();
            break;
            }
            }

            if (!found) {
            throw new ParseException("Unexpected sequence found in input string.", ++position);
            }
            }

            return tokens;

            }
            }



          3. Create SQL LIKE to RegEx Transpiler:



            public class SqlLikeTranspiler {
            private static Tokenizer TOKENIZER = new Tokenizer()
            .add("^(\[[^]]*])", ConstantToken::new)
            .add("^(%)", WildcardToken::new)
            .add("^(_)", WildcharToken::new)
            .add("^([^\[\]%_]+)", StringToken::new);

            public static String toRegEx(String pattern) throws ParseException {
            StringBuilder sb = new StringBuilder().append("^");
            for (Token token : TOKENIZER.tokenize(pattern)) {
            sb.append(token.convert());
            }

            return sb.append("$").toString();
            }
            }



          NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^ and $ respectively).



          By creating a lexer and converting using this methodology, we can prevent LIKE clauses like %abc[%]%, which should match any string with the sub-string abc% in it, from being converted to a regular expression like .*abc[.*].* which will match any string with either the sub-string abc. or abc*.



          The provided code is Java.







          share|improve this answer










          New contributor




          Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          share|improve this answer



          share|improve this answer








          edited Nov 13 at 13:04









          Toby Speight

          21.9k536108




          21.9k536108






          New contributor




          Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          answered Nov 12 at 16:31









          Damon Sutherland

          411




          411




          New contributor




          Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





          New contributor





          Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          Damon Sutherland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






















              up vote
              3
              down vote













              Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:



              bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
              bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

              if (startsWith)
              {
              builder.Replace("%", "", 0, 1);
              builder.Append("$");
              }
              if (endsWith)
              {
              builder.Replace("%", "", pattern.Length - 1, 1);
              builder.Insert(0, "^");
              }


              to this:



              bool leadingLiteral = !pattern.StartsWith("%");
              if (leadingLiteral)
              {
              builder.Insert(0, "^");
              }

              bool trailingLiteral = !pattern.EndsWith("%");
              if (trailingLiteral)
              {
              builder.Append("$");
              }


              You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.






              share|improve this answer



























                up vote
                3
                down vote













                Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:



                bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
                bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

                if (startsWith)
                {
                builder.Replace("%", "", 0, 1);
                builder.Append("$");
                }
                if (endsWith)
                {
                builder.Replace("%", "", pattern.Length - 1, 1);
                builder.Insert(0, "^");
                }


                to this:



                bool leadingLiteral = !pattern.StartsWith("%");
                if (leadingLiteral)
                {
                builder.Insert(0, "^");
                }

                bool trailingLiteral = !pattern.EndsWith("%");
                if (trailingLiteral)
                {
                builder.Append("$");
                }


                You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.






                share|improve this answer

























                  up vote
                  3
                  down vote










                  up vote
                  3
                  down vote









                  Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:



                  bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
                  bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

                  if (startsWith)
                  {
                  builder.Replace("%", "", 0, 1);
                  builder.Append("$");
                  }
                  if (endsWith)
                  {
                  builder.Replace("%", "", pattern.Length - 1, 1);
                  builder.Insert(0, "^");
                  }


                  to this:



                  bool leadingLiteral = !pattern.StartsWith("%");
                  if (leadingLiteral)
                  {
                  builder.Insert(0, "^");
                  }

                  bool trailingLiteral = !pattern.EndsWith("%");
                  if (trailingLiteral)
                  {
                  builder.Append("$");
                  }


                  You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.






                  share|improve this answer














                  Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:



                  bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
                  bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");

                  if (startsWith)
                  {
                  builder.Replace("%", "", 0, 1);
                  builder.Append("$");
                  }
                  if (endsWith)
                  {
                  builder.Replace("%", "", pattern.Length - 1, 1);
                  builder.Insert(0, "^");
                  }


                  to this:



                  bool leadingLiteral = !pattern.StartsWith("%");
                  if (leadingLiteral)
                  {
                  builder.Insert(0, "^");
                  }

                  bool trailingLiteral = !pattern.EndsWith("%");
                  if (trailingLiteral)
                  {
                  builder.Append("$");
                  }


                  You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Dec 11 '13 at 21:15

























                  answered Dec 7 '13 at 17:10









                  John Deters

                  83659




                  83659






















                      up vote
                      2
                      down vote













                      In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE (under Using Wildcard Characters as Literals). LIKE '5[%]' should be translated as regex '^5%$'.






                      share|improve this answer





















                      • I think a more natural (and equivalent) translation would be ^5[%]$.
                        – svick
                        Dec 7 '13 at 18:14















                      up vote
                      2
                      down vote













                      In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE (under Using Wildcard Characters as Literals). LIKE '5[%]' should be translated as regex '^5%$'.






                      share|improve this answer





















                      • I think a more natural (and equivalent) translation would be ^5[%]$.
                        – svick
                        Dec 7 '13 at 18:14













                      up vote
                      2
                      down vote










                      up vote
                      2
                      down vote









                      In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE (under Using Wildcard Characters as Literals). LIKE '5[%]' should be translated as regex '^5%$'.






                      share|improve this answer












                      In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE (under Using Wildcard Characters as Literals). LIKE '5[%]' should be translated as regex '^5%$'.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Dec 7 '13 at 17:25









                      200_success

                      127k15148410




                      127k15148410












                      • I think a more natural (and equivalent) translation would be ^5[%]$.
                        – svick
                        Dec 7 '13 at 18:14


















                      • I think a more natural (and equivalent) translation would be ^5[%]$.
                        – svick
                        Dec 7 '13 at 18:14
















                      I think a more natural (and equivalent) translation would be ^5[%]$.
                      – svick
                      Dec 7 '13 at 18:14




                      I think a more natural (and equivalent) translation would be ^5[%]$.
                      – svick
                      Dec 7 '13 at 18:14


















                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f36861%2fconvert-sql-like-to-regex%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Сан-Квентин

                      8-я гвардейская общевойсковая армия

                      Алькесар