How to download part of a site with httrack including assets?
I would like to download http://www.example.com/foobar
and every HTML page linked from there, where the URL starts with http://www.example.com/foobar
. I would like to download every non-HTML asset linked as well, regardless of their path. I tried:
httrack http://www.example.com/foobar -mime:text/html +http://www.example.com/foobar*
and also added:
+http://www.example.com/foobar +http://www.example.com/foobar/*
but this resulted in no pages downloaded whatsoever.
httrack
|
show 5 more comments
I would like to download http://www.example.com/foobar
and every HTML page linked from there, where the URL starts with http://www.example.com/foobar
. I would like to download every non-HTML asset linked as well, regardless of their path. I tried:
httrack http://www.example.com/foobar -mime:text/html +http://www.example.com/foobar*
and also added:
+http://www.example.com/foobar +http://www.example.com/foobar/*
but this resulted in no pages downloaded whatsoever.
httrack
Why did you add-mime:text/html
? This says to ignore HTML files, and without them there are no links and nothing to download.
– harrymc
Dec 17 '18 at 12:49
Well, I hoped the ` +example.com/foobar*` will override it. The problem ishttp://www.example.com/foobar
is a HTML page,http://www.example.com/foobar/baz
is another and so on. I do not know how to tell httrack to download these + assets anywhere.
– chx
Dec 17 '18 at 17:23
Without themime
part, do you get anything?
– harrymc
Dec 17 '18 at 17:25
Well, I either don't get assets if I try-*
instead (because they are not under/foobar
) or it tries to download the entire site.
– chx
Dec 17 '18 at 17:52
1
@JoseManuelGomezAlvarez: The poster already solved the problem using wget. Nothing left to do here.
– harrymc
Dec 19 '18 at 20:58
|
show 5 more comments
I would like to download http://www.example.com/foobar
and every HTML page linked from there, where the URL starts with http://www.example.com/foobar
. I would like to download every non-HTML asset linked as well, regardless of their path. I tried:
httrack http://www.example.com/foobar -mime:text/html +http://www.example.com/foobar*
and also added:
+http://www.example.com/foobar +http://www.example.com/foobar/*
but this resulted in no pages downloaded whatsoever.
httrack
I would like to download http://www.example.com/foobar
and every HTML page linked from there, where the URL starts with http://www.example.com/foobar
. I would like to download every non-HTML asset linked as well, regardless of their path. I tried:
httrack http://www.example.com/foobar -mime:text/html +http://www.example.com/foobar*
and also added:
+http://www.example.com/foobar +http://www.example.com/foobar/*
but this resulted in no pages downloaded whatsoever.
httrack
httrack
edited Dec 17 '18 at 12:46
harrymc
255k14265566
255k14265566
asked Dec 14 '18 at 11:49
chxchx
2,56121435
2,56121435
Why did you add-mime:text/html
? This says to ignore HTML files, and without them there are no links and nothing to download.
– harrymc
Dec 17 '18 at 12:49
Well, I hoped the ` +example.com/foobar*` will override it. The problem ishttp://www.example.com/foobar
is a HTML page,http://www.example.com/foobar/baz
is another and so on. I do not know how to tell httrack to download these + assets anywhere.
– chx
Dec 17 '18 at 17:23
Without themime
part, do you get anything?
– harrymc
Dec 17 '18 at 17:25
Well, I either don't get assets if I try-*
instead (because they are not under/foobar
) or it tries to download the entire site.
– chx
Dec 17 '18 at 17:52
1
@JoseManuelGomezAlvarez: The poster already solved the problem using wget. Nothing left to do here.
– harrymc
Dec 19 '18 at 20:58
|
show 5 more comments
Why did you add-mime:text/html
? This says to ignore HTML files, and without them there are no links and nothing to download.
– harrymc
Dec 17 '18 at 12:49
Well, I hoped the ` +example.com/foobar*` will override it. The problem ishttp://www.example.com/foobar
is a HTML page,http://www.example.com/foobar/baz
is another and so on. I do not know how to tell httrack to download these + assets anywhere.
– chx
Dec 17 '18 at 17:23
Without themime
part, do you get anything?
– harrymc
Dec 17 '18 at 17:25
Well, I either don't get assets if I try-*
instead (because they are not under/foobar
) or it tries to download the entire site.
– chx
Dec 17 '18 at 17:52
1
@JoseManuelGomezAlvarez: The poster already solved the problem using wget. Nothing left to do here.
– harrymc
Dec 19 '18 at 20:58
Why did you add
-mime:text/html
? This says to ignore HTML files, and without them there are no links and nothing to download.– harrymc
Dec 17 '18 at 12:49
Why did you add
-mime:text/html
? This says to ignore HTML files, and without them there are no links and nothing to download.– harrymc
Dec 17 '18 at 12:49
Well, I hoped the ` +example.com/foobar*` will override it. The problem is
http://www.example.com/foobar
is a HTML page, http://www.example.com/foobar/baz
is another and so on. I do not know how to tell httrack to download these + assets anywhere.– chx
Dec 17 '18 at 17:23
Well, I hoped the ` +example.com/foobar*` will override it. The problem is
http://www.example.com/foobar
is a HTML page, http://www.example.com/foobar/baz
is another and so on. I do not know how to tell httrack to download these + assets anywhere.– chx
Dec 17 '18 at 17:23
Without the
mime
part, do you get anything?– harrymc
Dec 17 '18 at 17:25
Without the
mime
part, do you get anything?– harrymc
Dec 17 '18 at 17:25
Well, I either don't get assets if I try
-*
instead (because they are not under /foobar
) or it tries to download the entire site.– chx
Dec 17 '18 at 17:52
Well, I either don't get assets if I try
-*
instead (because they are not under /foobar
) or it tries to download the entire site.– chx
Dec 17 '18 at 17:52
1
1
@JoseManuelGomezAlvarez: The poster already solved the problem using wget. Nothing left to do here.
– harrymc
Dec 19 '18 at 20:58
@JoseManuelGomezAlvarez: The poster already solved the problem using wget. Nothing left to do here.
– harrymc
Dec 19 '18 at 20:58
|
show 5 more comments
1 Answer
1
active
oldest
votes
I still have no idea how to do this with httrack (although I really would like to understand how httrack filters work but apparently that's not going to happen, everyone just repeats the same useless manual page) but I was able to solve my problem with wget although not as asked. See, I actually know where the assets reside and so I was able to do this:
wget -rkpEI foobar/,assetpath1/,assetpath2/ https://www.example.com/foobar
This worked. More or less. To be fair, I later needed to loop every file and redownload them one by one -- the -k
option of wget
when downloading a single file makes every link an absolute URL which is really helpful for later sed
work.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1383560%2fhow-to-download-part-of-a-site-with-httrack-including-assets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I still have no idea how to do this with httrack (although I really would like to understand how httrack filters work but apparently that's not going to happen, everyone just repeats the same useless manual page) but I was able to solve my problem with wget although not as asked. See, I actually know where the assets reside and so I was able to do this:
wget -rkpEI foobar/,assetpath1/,assetpath2/ https://www.example.com/foobar
This worked. More or less. To be fair, I later needed to loop every file and redownload them one by one -- the -k
option of wget
when downloading a single file makes every link an absolute URL which is really helpful for later sed
work.
add a comment |
I still have no idea how to do this with httrack (although I really would like to understand how httrack filters work but apparently that's not going to happen, everyone just repeats the same useless manual page) but I was able to solve my problem with wget although not as asked. See, I actually know where the assets reside and so I was able to do this:
wget -rkpEI foobar/,assetpath1/,assetpath2/ https://www.example.com/foobar
This worked. More or less. To be fair, I later needed to loop every file and redownload them one by one -- the -k
option of wget
when downloading a single file makes every link an absolute URL which is really helpful for later sed
work.
add a comment |
I still have no idea how to do this with httrack (although I really would like to understand how httrack filters work but apparently that's not going to happen, everyone just repeats the same useless manual page) but I was able to solve my problem with wget although not as asked. See, I actually know where the assets reside and so I was able to do this:
wget -rkpEI foobar/,assetpath1/,assetpath2/ https://www.example.com/foobar
This worked. More or less. To be fair, I later needed to loop every file and redownload them one by one -- the -k
option of wget
when downloading a single file makes every link an absolute URL which is really helpful for later sed
work.
I still have no idea how to do this with httrack (although I really would like to understand how httrack filters work but apparently that's not going to happen, everyone just repeats the same useless manual page) but I was able to solve my problem with wget although not as asked. See, I actually know where the assets reside and so I was able to do this:
wget -rkpEI foobar/,assetpath1/,assetpath2/ https://www.example.com/foobar
This worked. More or less. To be fair, I later needed to loop every file and redownload them one by one -- the -k
option of wget
when downloading a single file makes every link an absolute URL which is really helpful for later sed
work.
answered Dec 20 '18 at 19:18
chxchx
2,56121435
2,56121435
add a comment |
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1383560%2fhow-to-download-part-of-a-site-with-httrack-including-assets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Why did you add
-mime:text/html
? This says to ignore HTML files, and without them there are no links and nothing to download.– harrymc
Dec 17 '18 at 12:49
Well, I hoped the ` +example.com/foobar*` will override it. The problem is
http://www.example.com/foobar
is a HTML page,http://www.example.com/foobar/baz
is another and so on. I do not know how to tell httrack to download these + assets anywhere.– chx
Dec 17 '18 at 17:23
Without the
mime
part, do you get anything?– harrymc
Dec 17 '18 at 17:25
Well, I either don't get assets if I try
-*
instead (because they are not under/foobar
) or it tries to download the entire site.– chx
Dec 17 '18 at 17:52
1
@JoseManuelGomezAlvarez: The poster already solved the problem using wget. Nothing left to do here.
– harrymc
Dec 19 '18 at 20:58