What does double slash (//) directory mean in robots.txt?
up vote
2
down vote
favorite
You will get the following output with:
curl https://www.ibm.com/robots.txt
I delete many lines, keeping only part of it.
User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
# Added to block site mirroring
User-agent: HTTrack
Disallow: /
#
I understand that /
means root directory, but what does double slash //
directory mean here in robots.txt
?
linux home-directory
add a comment |
up vote
2
down vote
favorite
You will get the following output with:
curl https://www.ibm.com/robots.txt
I delete many lines, keeping only part of it.
User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
# Added to block site mirroring
User-agent: HTTrack
Disallow: /
#
I understand that /
means root directory, but what does double slash //
directory mean here in robots.txt
?
linux home-directory
2
It could be a typo, I can't find a single reference to a double slash in any of the official Robot Exclusion documents.
– Michael Frank
Nov 27 at 1:51
@MichaelFrank Typo or a coding fluke made by an automated system generating arobots.txt
on demand.
– JakeGould
Nov 27 at 2:00
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
You will get the following output with:
curl https://www.ibm.com/robots.txt
I delete many lines, keeping only part of it.
User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
# Added to block site mirroring
User-agent: HTTrack
Disallow: /
#
I understand that /
means root directory, but what does double slash //
directory mean here in robots.txt
?
linux home-directory
You will get the following output with:
curl https://www.ibm.com/robots.txt
I delete many lines, keeping only part of it.
User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
# Added to block site mirroring
User-agent: HTTrack
Disallow: /
#
I understand that /
means root directory, but what does double slash //
directory mean here in robots.txt
?
linux home-directory
linux home-directory
edited Nov 27 at 1:52
JakeGould
30.9k1093137
30.9k1093137
asked Nov 27 at 1:44
scrapy
1885
1885
2
It could be a typo, I can't find a single reference to a double slash in any of the official Robot Exclusion documents.
– Michael Frank
Nov 27 at 1:51
@MichaelFrank Typo or a coding fluke made by an automated system generating arobots.txt
on demand.
– JakeGould
Nov 27 at 2:00
add a comment |
2
It could be a typo, I can't find a single reference to a double slash in any of the official Robot Exclusion documents.
– Michael Frank
Nov 27 at 1:51
@MichaelFrank Typo or a coding fluke made by an automated system generating arobots.txt
on demand.
– JakeGould
Nov 27 at 2:00
2
2
It could be a typo, I can't find a single reference to a double slash in any of the official Robot Exclusion documents.
– Michael Frank
Nov 27 at 1:51
It could be a typo, I can't find a single reference to a double slash in any of the official Robot Exclusion documents.
– Michael Frank
Nov 27 at 1:51
@MichaelFrank Typo or a coding fluke made by an automated system generating a
robots.txt
on demand.– JakeGould
Nov 27 at 2:00
@MichaelFrank Typo or a coding fluke made by an automated system generating a
robots.txt
on demand.– JakeGould
Nov 27 at 2:00
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
This seems like a mistake:
Disallow: //
The thing is that the robots.txt
spec—as outlined here—clearly states:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
But some people claim that is not the case such as this site that states that Google can handle pattern matching:
Pattern matching: At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character.
But regardless of that, the //
means a literal directory of a directory with no name attached to that directory since there is no wildcard (*
) globbing or anything there. And //
just seems odd.
My guess is it’s a mistake of some sort. Yes, an IBM webmaster can make mistakes! But I would also guess that the robots.txt
is automatically generated by some system and somehow a path such as /*/
was converted to //
when the robots.txt
was automatically generated by the system.
Either that, or the entry is there specifically to prevent mistake URLs with a redundant slash from being indexed.
– grawity
Nov 27 at 5:51
@grawity Fair enough but I am not too sure what the benefit would be to have a URL that isexample.com//thing
as some odd method of obscuring data from crawlers.
– JakeGould
Nov 27 at 16:28
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1378614%2fwhat-does-double-slash-directory-mean-in-robots-txt%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
This seems like a mistake:
Disallow: //
The thing is that the robots.txt
spec—as outlined here—clearly states:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
But some people claim that is not the case such as this site that states that Google can handle pattern matching:
Pattern matching: At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character.
But regardless of that, the //
means a literal directory of a directory with no name attached to that directory since there is no wildcard (*
) globbing or anything there. And //
just seems odd.
My guess is it’s a mistake of some sort. Yes, an IBM webmaster can make mistakes! But I would also guess that the robots.txt
is automatically generated by some system and somehow a path such as /*/
was converted to //
when the robots.txt
was automatically generated by the system.
Either that, or the entry is there specifically to prevent mistake URLs with a redundant slash from being indexed.
– grawity
Nov 27 at 5:51
@grawity Fair enough but I am not too sure what the benefit would be to have a URL that isexample.com//thing
as some odd method of obscuring data from crawlers.
– JakeGould
Nov 27 at 16:28
add a comment |
up vote
1
down vote
accepted
This seems like a mistake:
Disallow: //
The thing is that the robots.txt
spec—as outlined here—clearly states:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
But some people claim that is not the case such as this site that states that Google can handle pattern matching:
Pattern matching: At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character.
But regardless of that, the //
means a literal directory of a directory with no name attached to that directory since there is no wildcard (*
) globbing or anything there. And //
just seems odd.
My guess is it’s a mistake of some sort. Yes, an IBM webmaster can make mistakes! But I would also guess that the robots.txt
is automatically generated by some system and somehow a path such as /*/
was converted to //
when the robots.txt
was automatically generated by the system.
Either that, or the entry is there specifically to prevent mistake URLs with a redundant slash from being indexed.
– grawity
Nov 27 at 5:51
@grawity Fair enough but I am not too sure what the benefit would be to have a URL that isexample.com//thing
as some odd method of obscuring data from crawlers.
– JakeGould
Nov 27 at 16:28
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
This seems like a mistake:
Disallow: //
The thing is that the robots.txt
spec—as outlined here—clearly states:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
But some people claim that is not the case such as this site that states that Google can handle pattern matching:
Pattern matching: At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character.
But regardless of that, the //
means a literal directory of a directory with no name attached to that directory since there is no wildcard (*
) globbing or anything there. And //
just seems odd.
My guess is it’s a mistake of some sort. Yes, an IBM webmaster can make mistakes! But I would also guess that the robots.txt
is automatically generated by some system and somehow a path such as /*/
was converted to //
when the robots.txt
was automatically generated by the system.
This seems like a mistake:
Disallow: //
The thing is that the robots.txt
spec—as outlined here—clearly states:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
But some people claim that is not the case such as this site that states that Google can handle pattern matching:
Pattern matching: At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character.
But regardless of that, the //
means a literal directory of a directory with no name attached to that directory since there is no wildcard (*
) globbing or anything there. And //
just seems odd.
My guess is it’s a mistake of some sort. Yes, an IBM webmaster can make mistakes! But I would also guess that the robots.txt
is automatically generated by some system and somehow a path such as /*/
was converted to //
when the robots.txt
was automatically generated by the system.
answered Nov 27 at 1:58
JakeGould
30.9k1093137
30.9k1093137
Either that, or the entry is there specifically to prevent mistake URLs with a redundant slash from being indexed.
– grawity
Nov 27 at 5:51
@grawity Fair enough but I am not too sure what the benefit would be to have a URL that isexample.com//thing
as some odd method of obscuring data from crawlers.
– JakeGould
Nov 27 at 16:28
add a comment |
Either that, or the entry is there specifically to prevent mistake URLs with a redundant slash from being indexed.
– grawity
Nov 27 at 5:51
@grawity Fair enough but I am not too sure what the benefit would be to have a URL that isexample.com//thing
as some odd method of obscuring data from crawlers.
– JakeGould
Nov 27 at 16:28
Either that, or the entry is there specifically to prevent mistake URLs with a redundant slash from being indexed.
– grawity
Nov 27 at 5:51
Either that, or the entry is there specifically to prevent mistake URLs with a redundant slash from being indexed.
– grawity
Nov 27 at 5:51
@grawity Fair enough but I am not too sure what the benefit would be to have a URL that is
example.com//thing
as some odd method of obscuring data from crawlers.– JakeGould
Nov 27 at 16:28
@grawity Fair enough but I am not too sure what the benefit would be to have a URL that is
example.com//thing
as some odd method of obscuring data from crawlers.– JakeGould
Nov 27 at 16:28
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1378614%2fwhat-does-double-slash-directory-mean-in-robots-txt%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
It could be a typo, I can't find a single reference to a double slash in any of the official Robot Exclusion documents.
– Michael Frank
Nov 27 at 1:51
@MichaelFrank Typo or a coding fluke made by an automated system generating a
robots.txt
on demand.– JakeGould
Nov 27 at 2:00