Creating a csv file using scrapy
I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:
scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the CSV file.
However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement
while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess
to execute the script from within an IDE.
My question: Isn't it a better idea for me to follow the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
add a comment |
I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:
scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the CSV file.
However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement
while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess
to execute the script from within an IDE.
My question: Isn't it a better idea for me to follow the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
add a comment |
I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:
scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the CSV file.
However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement
while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess
to execute the script from within an IDE.
My question: Isn't it a better idea for me to follow the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:
scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the CSV file.
However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement
while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess
to execute the script from within an IDE.
My question: Isn't it a better idea for me to follow the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
python python-3.x web-scraping scrapy
edited Dec 16 at 17:44
Reinderien
2,241617
2,241617
asked Dec 16 at 15:12
robots.txt
212
212
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
add a comment |
Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
Making use of signals
might be a wise attempt here. When the scraping process is done, the spider_closed()
method is invoked and thus the DictWriter()
will be open once and when the writing is finished, it will be closed automatically because of the with statement
. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O
issues.
The following script represents what I told you so far:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import signals
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
itemlist =
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self):
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209766%2fcreating-a-csv-file-using-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
add a comment |
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
add a comment |
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
answered Dec 16 at 15:30
alecxe
14.8k53478
14.8k53478
add a comment |
add a comment |
Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
Making use of signals
might be a wise attempt here. When the scraping process is done, the spider_closed()
method is invoked and thus the DictWriter()
will be open once and when the writing is finished, it will be closed automatically because of the with statement
. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O
issues.
The following script represents what I told you so far:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import signals
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
itemlist =
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self):
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
add a comment |
Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
Making use of signals
might be a wise attempt here. When the scraping process is done, the spider_closed()
method is invoked and thus the DictWriter()
will be open once and when the writing is finished, it will be closed automatically because of the with statement
. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O
issues.
The following script represents what I told you so far:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import signals
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
itemlist =
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self):
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
add a comment |
Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
Making use of signals
might be a wise attempt here. When the scraping process is done, the spider_closed()
method is invoked and thus the DictWriter()
will be open once and when the writing is finished, it will be closed automatically because of the with statement
. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O
issues.
The following script represents what I told you so far:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import signals
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
itemlist =
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self):
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
Making use of signals
might be a wise attempt here. When the scraping process is done, the spider_closed()
method is invoked and thus the DictWriter()
will be open once and when the writing is finished, it will be closed automatically because of the with statement
. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O
issues.
The following script represents what I told you so far:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import signals
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
itemlist =
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self):
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
edited Dec 17 at 13:01
answered Dec 17 at 12:52
asmitu
1269
1269
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209766%2fcreating-a-csv-file-using-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown