Convert Sql LIKE to Regex
up vote
2
down vote
favorite
I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like
extension method. For some time I have been using this conversion.
This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc
to .*?abc
is capturing both "abcdef" and "123abcdef".
I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.
Here is my code:
internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));
// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");
/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");
return builder.ToString();
}
My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.
c# sql linq regex converting
add a comment |
up vote
2
down vote
favorite
I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like
extension method. For some time I have been using this conversion.
This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc
to .*?abc
is capturing both "abcdef" and "123abcdef".
I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.
Here is my code:
internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));
// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");
/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");
return builder.ToString();
}
My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.
c# sql linq regex converting
What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16
LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like
extension method. For some time I have been using this conversion.
This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc
to .*?abc
is capturing both "abcdef" and "123abcdef".
I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.
Here is my code:
internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));
// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");
/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");
return builder.ToString();
}
My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.
c# sql linq regex converting
I have a bit of code that converts a Sql Like expression to a regex expression for the purposes of a Linq to objects Like
extension method. For some time I have been using this conversion.
This conversion replaces all "%" with ".*?" which works for contains patterns but is over matching for starts with or ends with patterns. So the conversion of %abc
to .*?abc
is capturing both "abcdef" and "123abcdef".
I've amended the conversion algorithm to account for starts with and ends with LIKE expressions.
Here is my code:
internal static string ConvertLikeToRegex(string pattern)
{
// Turn "off" all regular expression related syntax in the pattern string.
StringBuilder builder = new StringBuilder(Regex.Escape(pattern));
// these are needed because the .*? replacement below at the begining or end of the string is not
// accounting for cases such as LIKE '%abc' or LIKE 'abc%'
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
// this is a little tricky
// ends with in like is '%abc'
// in regex it's 'abc$'
// so need to tanspose
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
// same but inverse here
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
/* Replace the SQL LIKE wildcard metacharacters with the
* equivalent regular expression metacharacters. */
builder.Replace("%", ".*?").Replace("_", ".");
/* The previous call to Regex.Escape actually turned off
* too many metacharacters, i.e. those which are recognized by
* both the regular expression engine and the SQL LIKE
* statement ([...] and [^...]). Those metacharacters have
* to be manually unescaped here. */
builder.Replace(@"[", "[").Replace(@"]", "]").Replace(@"^", "^");
return builder.ToString();
}
My initial units test are passing but not being overly conversant in regex syntax I am wondering if this is the best approach or if there are gaps in the conversion I am not seeing.
c# sql linq regex converting
c# sql linq regex converting
edited Dec 7 '13 at 16:11
asked Dec 7 '13 at 15:56
dkackman
116127
116127
What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16
LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26
add a comment |
What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16
LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26
What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16
What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16
LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26
LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26
add a comment |
4 Answers
4
active
oldest
votes
up vote
7
down vote
accepted
I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd
shouldn't match abcde
, but it does with your code.
What you should do is to always add ^
at the start and $
at the end.
This means the following conversions:
bcd
→^bcd$
%bcd
→^.*?bcd$
bcd%
→^bcd.*?$
%bcd%
→^.*?bcd.*?$
In the cases where the pattern starts with %
, the ^
is not necessary (and similarly for $
and %
at the end), but it also doesn't do any harm.
Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12
Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22
nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54
@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10
add a comment |
up vote
4
down vote
Because the LIKE
clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE
clause. A sample grammar could look like:
expr := wild-card + expr
| wild-char + expr
| escape + expr
| string + expr
| ""
wild-card := %
wild-char := _
escape := [%|_]
string := [^%_]+ (One or > more characters that are not wild-card or wild-char)
NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE
character (see T-SQL)
The steps to accomplish the LIKE
syntax conversion are as follows:
Define your Token classes:
public abstract class Token {
private final String value;
public Token(String value) {
this.value = value;
}
public abstract String convert();
public String getValue() {
return value;
}
}
public class EscapeToken extends Token {
public EscapeToken(String value) {
super(value);
}
@Override
public String convert() {
return getValue();
}
}
public class WildcardToken extends Token {
public WildcardToken(String value) {
super(value);
}
@Override
public String convert() {
return ".*";
}
}
public class WildcharToken extends Token {
public WildcharToken(String value) {
super(value);
}
@Override
public String convert() {
return ".";
}
}
public class StringToken extends Token {
public StringToken(String value) {
super(value);
}
@Override
public String convert() {
return Pattern.quote(getValue());
}
}
Create a Lexer (or Tokenizer):
public class Tokenizer {
private Collection<Tuple> patterns = new LinkedList<>();
public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
return this;
}
public Collection<Token> tokenize(String clause) throws ParseException {
Collection<Token> tokens = new ArrayList<>();
String copy = String.copyValueOf(clause.toCharArray());
int position = 0;
while (!copy.equals("")) {
boolean found = false;
for (Tuple tuple : this.patterns) {
Pattern pattern = tuple.get(0, Pattern.class);
Matcher m = pattern.matcher(copy);
if (m.find()) {
found = true;
String token = m.group(1);
Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
tokens.add(fn.apply(token));
copy = m.replaceFirst("");
position += token.length();
break;
}
}
if (!found) {
throw new ParseException("Unexpected sequence found in input string.", ++position);
}
}
return tokens;
}
}
Create SQL
LIKE
to RegEx Transpiler:
public class SqlLikeTranspiler {
private static Tokenizer TOKENIZER = new Tokenizer()
.add("^(\[[^]]*])", ConstantToken::new)
.add("^(%)", WildcardToken::new)
.add("^(_)", WildcharToken::new)
.add("^([^\[\]%_]+)", StringToken::new);
public static String toRegEx(String pattern) throws ParseException {
StringBuilder sb = new StringBuilder().append("^");
for (Token token : TOKENIZER.tokenize(pattern)) {
sb.append(token.convert());
}
return sb.append("$").toString();
}
}
NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^
and $
respectively).
By creating a lexer and converting using this methodology, we can prevent LIKE
clauses like %abc[%]%
, which should match any string with the sub-string abc%
in it, from being converted to a regular expression like .*abc[.*].*
which will match any string with either the sub-string abc.
or abc*
.
The provided code is Java.
New contributor
add a comment |
up vote
3
down vote
Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
to this:
bool leadingLiteral = !pattern.StartsWith("%");
if (leadingLiteral)
{
builder.Insert(0, "^");
}
bool trailingLiteral = !pattern.EndsWith("%");
if (trailingLiteral)
{
builder.Append("$");
}
You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.
add a comment |
up vote
2
down vote
In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE
(under Using Wildcard Characters as Literals). LIKE '5[%]'
should be translated as regex '^5%$'
.
I think a more natural (and equivalent) translation would be^5[%]$
.
– svick
Dec 7 '13 at 18:14
add a comment |
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
accepted
I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd
shouldn't match abcde
, but it does with your code.
What you should do is to always add ^
at the start and $
at the end.
This means the following conversions:
bcd
→^bcd$
%bcd
→^.*?bcd$
bcd%
→^bcd.*?$
%bcd%
→^.*?bcd.*?$
In the cases where the pattern starts with %
, the ^
is not necessary (and similarly for $
and %
at the end), but it also doesn't do any harm.
Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12
Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22
nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54
@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10
add a comment |
up vote
7
down vote
accepted
I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd
shouldn't match abcde
, but it does with your code.
What you should do is to always add ^
at the start and $
at the end.
This means the following conversions:
bcd
→^bcd$
%bcd
→^.*?bcd$
bcd%
→^bcd.*?$
%bcd%
→^.*?bcd.*?$
In the cases where the pattern starts with %
, the ^
is not necessary (and similarly for $
and %
at the end), but it also doesn't do any harm.
Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12
Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22
nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54
@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10
add a comment |
up vote
7
down vote
accepted
up vote
7
down vote
accepted
I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd
shouldn't match abcde
, but it does with your code.
What you should do is to always add ^
at the start and $
at the end.
This means the following conversions:
bcd
→^bcd$
%bcd
→^.*?bcd$
bcd%
→^bcd.*?$
%bcd%
→^.*?bcd.*?$
In the cases where the pattern starts with %
, the ^
is not necessary (and similarly for $
and %
at the end), but it also doesn't do any harm.
I think you're overcomplicating this and your code still doesn't work correctly. The LIKE pattern bcd
shouldn't match abcde
, but it does with your code.
What you should do is to always add ^
at the start and $
at the end.
This means the following conversions:
bcd
→^bcd$
%bcd
→^.*?bcd$
bcd%
→^bcd.*?$
%bcd%
→^.*?bcd.*?$
In the cases where the pattern starts with %
, the ^
is not necessary (and similarly for $
and %
at the end), but it also doesn't do any harm.
answered Dec 7 '13 at 17:10
svick
22.6k43778
22.6k43778
Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12
Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22
nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54
@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10
add a comment |
Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12
Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22
nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54
@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10
Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12
Even simpler than my suggestion. Nice!
– John Deters
Dec 7 '13 at 17:12
Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22
Perfect. Works like a charm. I figured I was missing something much more straightforward.
– dkackman
Dec 7 '13 at 17:22
nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54
nb: "?" is not needed since LIKE will always match the entire string anyway.
– Sylverdrag
Dec 7 '13 at 19:54
@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10
@Sylverdrag Yeah, I think you're right, I just copied that from the code in the question.
– svick
Dec 7 '13 at 20:10
add a comment |
up vote
4
down vote
Because the LIKE
clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE
clause. A sample grammar could look like:
expr := wild-card + expr
| wild-char + expr
| escape + expr
| string + expr
| ""
wild-card := %
wild-char := _
escape := [%|_]
string := [^%_]+ (One or > more characters that are not wild-card or wild-char)
NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE
character (see T-SQL)
The steps to accomplish the LIKE
syntax conversion are as follows:
Define your Token classes:
public abstract class Token {
private final String value;
public Token(String value) {
this.value = value;
}
public abstract String convert();
public String getValue() {
return value;
}
}
public class EscapeToken extends Token {
public EscapeToken(String value) {
super(value);
}
@Override
public String convert() {
return getValue();
}
}
public class WildcardToken extends Token {
public WildcardToken(String value) {
super(value);
}
@Override
public String convert() {
return ".*";
}
}
public class WildcharToken extends Token {
public WildcharToken(String value) {
super(value);
}
@Override
public String convert() {
return ".";
}
}
public class StringToken extends Token {
public StringToken(String value) {
super(value);
}
@Override
public String convert() {
return Pattern.quote(getValue());
}
}
Create a Lexer (or Tokenizer):
public class Tokenizer {
private Collection<Tuple> patterns = new LinkedList<>();
public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
return this;
}
public Collection<Token> tokenize(String clause) throws ParseException {
Collection<Token> tokens = new ArrayList<>();
String copy = String.copyValueOf(clause.toCharArray());
int position = 0;
while (!copy.equals("")) {
boolean found = false;
for (Tuple tuple : this.patterns) {
Pattern pattern = tuple.get(0, Pattern.class);
Matcher m = pattern.matcher(copy);
if (m.find()) {
found = true;
String token = m.group(1);
Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
tokens.add(fn.apply(token));
copy = m.replaceFirst("");
position += token.length();
break;
}
}
if (!found) {
throw new ParseException("Unexpected sequence found in input string.", ++position);
}
}
return tokens;
}
}
Create SQL
LIKE
to RegEx Transpiler:
public class SqlLikeTranspiler {
private static Tokenizer TOKENIZER = new Tokenizer()
.add("^(\[[^]]*])", ConstantToken::new)
.add("^(%)", WildcardToken::new)
.add("^(_)", WildcharToken::new)
.add("^([^\[\]%_]+)", StringToken::new);
public static String toRegEx(String pattern) throws ParseException {
StringBuilder sb = new StringBuilder().append("^");
for (Token token : TOKENIZER.tokenize(pattern)) {
sb.append(token.convert());
}
return sb.append("$").toString();
}
}
NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^
and $
respectively).
By creating a lexer and converting using this methodology, we can prevent LIKE
clauses like %abc[%]%
, which should match any string with the sub-string abc%
in it, from being converted to a regular expression like .*abc[.*].*
which will match any string with either the sub-string abc.
or abc*
.
The provided code is Java.
New contributor
add a comment |
up vote
4
down vote
Because the LIKE
clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE
clause. A sample grammar could look like:
expr := wild-card + expr
| wild-char + expr
| escape + expr
| string + expr
| ""
wild-card := %
wild-char := _
escape := [%|_]
string := [^%_]+ (One or > more characters that are not wild-card or wild-char)
NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE
character (see T-SQL)
The steps to accomplish the LIKE
syntax conversion are as follows:
Define your Token classes:
public abstract class Token {
private final String value;
public Token(String value) {
this.value = value;
}
public abstract String convert();
public String getValue() {
return value;
}
}
public class EscapeToken extends Token {
public EscapeToken(String value) {
super(value);
}
@Override
public String convert() {
return getValue();
}
}
public class WildcardToken extends Token {
public WildcardToken(String value) {
super(value);
}
@Override
public String convert() {
return ".*";
}
}
public class WildcharToken extends Token {
public WildcharToken(String value) {
super(value);
}
@Override
public String convert() {
return ".";
}
}
public class StringToken extends Token {
public StringToken(String value) {
super(value);
}
@Override
public String convert() {
return Pattern.quote(getValue());
}
}
Create a Lexer (or Tokenizer):
public class Tokenizer {
private Collection<Tuple> patterns = new LinkedList<>();
public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
return this;
}
public Collection<Token> tokenize(String clause) throws ParseException {
Collection<Token> tokens = new ArrayList<>();
String copy = String.copyValueOf(clause.toCharArray());
int position = 0;
while (!copy.equals("")) {
boolean found = false;
for (Tuple tuple : this.patterns) {
Pattern pattern = tuple.get(0, Pattern.class);
Matcher m = pattern.matcher(copy);
if (m.find()) {
found = true;
String token = m.group(1);
Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
tokens.add(fn.apply(token));
copy = m.replaceFirst("");
position += token.length();
break;
}
}
if (!found) {
throw new ParseException("Unexpected sequence found in input string.", ++position);
}
}
return tokens;
}
}
Create SQL
LIKE
to RegEx Transpiler:
public class SqlLikeTranspiler {
private static Tokenizer TOKENIZER = new Tokenizer()
.add("^(\[[^]]*])", ConstantToken::new)
.add("^(%)", WildcardToken::new)
.add("^(_)", WildcharToken::new)
.add("^([^\[\]%_]+)", StringToken::new);
public static String toRegEx(String pattern) throws ParseException {
StringBuilder sb = new StringBuilder().append("^");
for (Token token : TOKENIZER.tokenize(pattern)) {
sb.append(token.convert());
}
return sb.append("$").toString();
}
}
NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^
and $
respectively).
By creating a lexer and converting using this methodology, we can prevent LIKE
clauses like %abc[%]%
, which should match any string with the sub-string abc%
in it, from being converted to a regular expression like .*abc[.*].*
which will match any string with either the sub-string abc.
or abc*
.
The provided code is Java.
New contributor
add a comment |
up vote
4
down vote
up vote
4
down vote
Because the LIKE
clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE
clause. A sample grammar could look like:
expr := wild-card + expr
| wild-char + expr
| escape + expr
| string + expr
| ""
wild-card := %
wild-char := _
escape := [%|_]
string := [^%_]+ (One or > more characters that are not wild-card or wild-char)
NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE
character (see T-SQL)
The steps to accomplish the LIKE
syntax conversion are as follows:
Define your Token classes:
public abstract class Token {
private final String value;
public Token(String value) {
this.value = value;
}
public abstract String convert();
public String getValue() {
return value;
}
}
public class EscapeToken extends Token {
public EscapeToken(String value) {
super(value);
}
@Override
public String convert() {
return getValue();
}
}
public class WildcardToken extends Token {
public WildcardToken(String value) {
super(value);
}
@Override
public String convert() {
return ".*";
}
}
public class WildcharToken extends Token {
public WildcharToken(String value) {
super(value);
}
@Override
public String convert() {
return ".";
}
}
public class StringToken extends Token {
public StringToken(String value) {
super(value);
}
@Override
public String convert() {
return Pattern.quote(getValue());
}
}
Create a Lexer (or Tokenizer):
public class Tokenizer {
private Collection<Tuple> patterns = new LinkedList<>();
public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
return this;
}
public Collection<Token> tokenize(String clause) throws ParseException {
Collection<Token> tokens = new ArrayList<>();
String copy = String.copyValueOf(clause.toCharArray());
int position = 0;
while (!copy.equals("")) {
boolean found = false;
for (Tuple tuple : this.patterns) {
Pattern pattern = tuple.get(0, Pattern.class);
Matcher m = pattern.matcher(copy);
if (m.find()) {
found = true;
String token = m.group(1);
Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
tokens.add(fn.apply(token));
copy = m.replaceFirst("");
position += token.length();
break;
}
}
if (!found) {
throw new ParseException("Unexpected sequence found in input string.", ++position);
}
}
return tokens;
}
}
Create SQL
LIKE
to RegEx Transpiler:
public class SqlLikeTranspiler {
private static Tokenizer TOKENIZER = new Tokenizer()
.add("^(\[[^]]*])", ConstantToken::new)
.add("^(%)", WildcardToken::new)
.add("^(_)", WildcharToken::new)
.add("^([^\[\]%_]+)", StringToken::new);
public static String toRegEx(String pattern) throws ParseException {
StringBuilder sb = new StringBuilder().append("^");
for (Token token : TOKENIZER.tokenize(pattern)) {
sb.append(token.convert());
}
return sb.append("$").toString();
}
}
NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^
and $
respectively).
By creating a lexer and converting using this methodology, we can prevent LIKE
clauses like %abc[%]%
, which should match any string with the sub-string abc%
in it, from being converted to a regular expression like .*abc[.*].*
which will match any string with either the sub-string abc.
or abc*
.
The provided code is Java.
New contributor
Because the LIKE
clause has a defined syntax, to do this right (meaning no clause will be incorrectly converted), you will need to use (or create) a simple lexer to parse the LIKE
clause. A sample grammar could look like:
expr := wild-card + expr
| wild-char + expr
| escape + expr
| string + expr
| ""
wild-card := %
wild-char := _
escape := [%|_]
string := [^%_]+ (One or > more characters that are not wild-card or wild-char)
NOTE: Although the above grammar will work by default, note that SQL allows the user to specify a user-defined ESCAPE
character (see T-SQL)
The steps to accomplish the LIKE
syntax conversion are as follows:
Define your Token classes:
public abstract class Token {
private final String value;
public Token(String value) {
this.value = value;
}
public abstract String convert();
public String getValue() {
return value;
}
}
public class EscapeToken extends Token {
public EscapeToken(String value) {
super(value);
}
@Override
public String convert() {
return getValue();
}
}
public class WildcardToken extends Token {
public WildcardToken(String value) {
super(value);
}
@Override
public String convert() {
return ".*";
}
}
public class WildcharToken extends Token {
public WildcharToken(String value) {
super(value);
}
@Override
public String convert() {
return ".";
}
}
public class StringToken extends Token {
public StringToken(String value) {
super(value);
}
@Override
public String convert() {
return Pattern.quote(getValue());
}
}
Create a Lexer (or Tokenizer):
public class Tokenizer {
private Collection<Tuple> patterns = new LinkedList<>();
public <T extends Token> Tokenizer add(String regex, Function<String, Token> creator) {
this.patterns.add(Tuple.of(Pattern.compile(regex), creator));
return this;
}
public Collection<Token> tokenize(String clause) throws ParseException {
Collection<Token> tokens = new ArrayList<>();
String copy = String.copyValueOf(clause.toCharArray());
int position = 0;
while (!copy.equals("")) {
boolean found = false;
for (Tuple tuple : this.patterns) {
Pattern pattern = tuple.get(0, Pattern.class);
Matcher m = pattern.matcher(copy);
if (m.find()) {
found = true;
String token = m.group(1);
Function<String, Token> fn = (Function<String, Token>) tuple.get(1);
tokens.add(fn.apply(token));
copy = m.replaceFirst("");
position += token.length();
break;
}
}
if (!found) {
throw new ParseException("Unexpected sequence found in input string.", ++position);
}
}
return tokens;
}
}
Create SQL
LIKE
to RegEx Transpiler:
public class SqlLikeTranspiler {
private static Tokenizer TOKENIZER = new Tokenizer()
.add("^(\[[^]]*])", ConstantToken::new)
.add("^(%)", WildcardToken::new)
.add("^(_)", WildcharToken::new)
.add("^([^\[\]%_]+)", StringToken::new);
public static String toRegEx(String pattern) throws ParseException {
StringBuilder sb = new StringBuilder().append("^");
for (Token token : TOKENIZER.tokenize(pattern)) {
sb.append(token.convert());
}
return sb.append("$").toString();
}
}
NOTE: We ensure the match is not too generous by indicating the resulting regular expression has start and end tags (^
and $
respectively).
By creating a lexer and converting using this methodology, we can prevent LIKE
clauses like %abc[%]%
, which should match any string with the sub-string abc%
in it, from being converted to a regular expression like .*abc[.*].*
which will match any string with either the sub-string abc.
or abc*
.
The provided code is Java.
New contributor
edited Nov 13 at 13:04
Toby Speight
21.9k536108
21.9k536108
New contributor
answered Nov 12 at 16:31
Damon Sutherland
411
411
New contributor
New contributor
add a comment |
add a comment |
up vote
3
down vote
Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
to this:
bool leadingLiteral = !pattern.StartsWith("%");
if (leadingLiteral)
{
builder.Insert(0, "^");
}
bool trailingLiteral = !pattern.EndsWith("%");
if (trailingLiteral)
{
builder.Append("$");
}
You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.
add a comment |
up vote
3
down vote
Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
to this:
bool leadingLiteral = !pattern.StartsWith("%");
if (leadingLiteral)
{
builder.Insert(0, "^");
}
bool trailingLiteral = !pattern.EndsWith("%");
if (trailingLiteral)
{
builder.Append("$");
}
You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.
add a comment |
up vote
3
down vote
up vote
3
down vote
Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
to this:
bool leadingLiteral = !pattern.StartsWith("%");
if (leadingLiteral)
{
builder.Insert(0, "^");
}
bool trailingLiteral = !pattern.EndsWith("%");
if (trailingLiteral)
{
builder.Append("$");
}
You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.
Your name "startsWith" is confusing because it looks at both ends, and controls the placement of the trailing $. I'd simplify this:
bool startsWith = pattern.StartsWith("%") && !pattern.EndsWith("%");
bool endsWith = !pattern.StartsWith("%") && pattern.EndsWith("%");
if (startsWith)
{
builder.Replace("%", "", 0, 1);
builder.Append("$");
}
if (endsWith)
{
builder.Replace("%", "", pattern.Length - 1, 1);
builder.Insert(0, "^");
}
to this:
bool leadingLiteral = !pattern.StartsWith("%");
if (leadingLiteral)
{
builder.Insert(0, "^");
}
bool trailingLiteral = !pattern.EndsWith("%");
if (trailingLiteral)
{
builder.Append("$");
}
You may also note that I left the named Boolean variables in the code. I like using "explanatory variables" instead of comments. The optimizer will get rid of them in a production build, so they cost nothing to the runtime. But I think they make the code more readable and therefore more maintainable. They encourage you, the developer, to think about what you're doing. And they encourage you to think of an appropriate name. If you find it hard to name a thing, that may be a sign that it's unclear, or is doing too much.
edited Dec 11 '13 at 21:15
answered Dec 7 '13 at 17:10
John Deters
83659
83659
add a comment |
add a comment |
up vote
2
down vote
In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE
(under Using Wildcard Characters as Literals). LIKE '5[%]'
should be translated as regex '^5%$'
.
I think a more natural (and equivalent) translation would be^5[%]$
.
– svick
Dec 7 '13 at 18:14
add a comment |
up vote
2
down vote
In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE
(under Using Wildcard Characters as Literals). LIKE '5[%]'
should be translated as regex '^5%$'
.
I think a more natural (and equivalent) translation would be^5[%]$
.
– svick
Dec 7 '13 at 18:14
add a comment |
up vote
2
down vote
up vote
2
down vote
In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE
(under Using Wildcard Characters as Literals). LIKE '5[%]'
should be translated as regex '^5%$'
.
In LIKE patterns, the interpretation of characters between square brackets is more literal than normal. What SQL dialect are you targeting? See, for example, the T-SQL documentation for LIKE
(under Using Wildcard Characters as Literals). LIKE '5[%]'
should be translated as regex '^5%$'
.
answered Dec 7 '13 at 17:25
200_success
127k15148410
127k15148410
I think a more natural (and equivalent) translation would be^5[%]$
.
– svick
Dec 7 '13 at 18:14
add a comment |
I think a more natural (and equivalent) translation would be^5[%]$
.
– svick
Dec 7 '13 at 18:14
I think a more natural (and equivalent) translation would be
^5[%]$
.– svick
Dec 7 '13 at 18:14
I think a more natural (and equivalent) translation would be
^5[%]$
.– svick
Dec 7 '13 at 18:14
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f36861%2fconvert-sql-like-to-regex%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What if it starts with and endsWith? I don't understand the need for startsWith to check that it doesn't also endWith
– Cruncher
Dec 7 '13 at 16:16
LIKE '%abc%' (i.e. contains) converts to ".*?abc.*?" which captures abcdef and 123abcdef The regex ^abc$ (which is what would happen without those tests) only captures strings both starting and ending with abc.
– dkackman
Dec 7 '13 at 16:26