Creating a csv file using scrapy












2














I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:



scrapy crawl  torrentdata -o outputfile.csv -t csv


I get a blank line in every alternate row in the CSV file.



However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.




My question: Isn't it a better idea for me to follow the way I tried below?




This is the working script:



import scrapy
from scrapy.crawler import CrawlerProcess
import csv

class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =

def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)

with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)

c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()









share|improve this question





























    2














    I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:



    scrapy crawl  torrentdata -o outputfile.csv -t csv


    I get a blank line in every alternate row in the CSV file.



    However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.




    My question: Isn't it a better idea for me to follow the way I tried below?




    This is the working script:



    import scrapy
    from scrapy.crawler import CrawlerProcess
    import csv

    class TorrentSpider(scrapy.Spider):
    name = "torrentdata"
    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
    itemlist =

    def parse(self, response):
    for record in response.css('.browse-movie-bottom'):
    items = {}
    items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
    items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
    self.itemlist.append(items)

    with open("outputfile.csv","w", newline="") as f:
    writer = csv.DictWriter(f,['Name','Year'])
    writer.writeheader()
    for data in self.itemlist:
    writer.writerow(data)

    c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    })
    c.crawl(TorrentSpider)
    c.start()









    share|improve this question



























      2












      2








      2







      I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:



      scrapy crawl  torrentdata -o outputfile.csv -t csv


      I get a blank line in every alternate row in the CSV file.



      However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.




      My question: Isn't it a better idea for me to follow the way I tried below?




      This is the working script:



      import scrapy
      from scrapy.crawler import CrawlerProcess
      import csv

      class TorrentSpider(scrapy.Spider):
      name = "torrentdata"
      start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
      itemlist =

      def parse(self, response):
      for record in response.css('.browse-movie-bottom'):
      items = {}
      items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
      items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
      self.itemlist.append(items)

      with open("outputfile.csv","w", newline="") as f:
      writer = csv.DictWriter(f,['Name','Year'])
      writer.writeheader()
      for data in self.itemlist:
      writer.writerow(data)

      c = CrawlerProcess({
      'USER_AGENT': 'Mozilla/5.0',
      })
      c.crawl(TorrentSpider)
      c.start()









      share|improve this question















      I've created a script using Python in association with Scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a CSV file other than using the built-in command provided by Scrapy, because when I do this:



      scrapy crawl  torrentdata -o outputfile.csv -t csv


      I get a blank line in every alternate row in the CSV file.



      However, I thought to go in a slightly different way to achieve the same thing. Now, I get a data-laden CSV file in the right format when I run the following script. Most importantly I made use of a with statement while creating a CSV file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.




      My question: Isn't it a better idea for me to follow the way I tried below?




      This is the working script:



      import scrapy
      from scrapy.crawler import CrawlerProcess
      import csv

      class TorrentSpider(scrapy.Spider):
      name = "torrentdata"
      start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
      itemlist =

      def parse(self, response):
      for record in response.css('.browse-movie-bottom'):
      items = {}
      items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
      items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
      self.itemlist.append(items)

      with open("outputfile.csv","w", newline="") as f:
      writer = csv.DictWriter(f,['Name','Year'])
      writer.writeheader()
      for data in self.itemlist:
      writer.writerow(data)

      c = CrawlerProcess({
      'USER_AGENT': 'Mozilla/5.0',
      })
      c.crawl(TorrentSpider)
      c.start()






      python python-3.x web-scraping scrapy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 16 at 17:44









      Reinderien

      2,241617




      2,241617










      asked Dec 16 at 15:12









      robots.txt

      212




      212






















          2 Answers
          2






          active

          oldest

          votes


















          2














          By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.



          As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:



          import scrapy


          class TorrentSpider(scrapy.Spider):
          name = "torrentdata"
          start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

          def parse(self, response):
          for record in response.css('.browse-movie-bottom'):
          yield {
          "Name": record.css('.browse-movie-title::text').extract_first(default=''),
          "Year": record.css('.browse-movie-year::text').extract_first(default='')
          }


          Then, by running:



          scrapy runspider spider.py -o outputfile.csv -t csv


          (or the crawl command)



          you would have the following in the outputfile.csv:



          Name,Year
          "Faith, Love & Chocolate",2018
          Bennett's Song,2018
          ...
          Tender Mercies,1983
          You Might Be the Killer,2018





          share|improve this answer





























            0














            Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
            Making use of signals might be a wise attempt here. When the scraping process is done, the spider_closed() method is invoked and thus the DictWriter() will be open once and when the writing is finished, it will be closed automatically because of the with statement. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O issues.



            The following script represents what I told you so far:



            import scrapy
            from scrapy.crawler import CrawlerProcess
            from scrapy import signals
            import csv

            class TorrentSpider(scrapy.Spider):
            name = "torrentdata"
            start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
            itemlist =

            @classmethod
            def from_crawler(cls, crawler):
            spider = super().from_crawler(crawler)
            crawler.signals.connect(spider.spider_closed, signals.spider_closed)
            return spider

            def spider_closed(self):
            with open("outputfile.csv","w", newline="") as f:
            writer = csv.DictWriter(f,['Name','Year'])
            writer.writeheader()
            for data in self.itemlist:
            writer.writerow(data)

            def parse(self, response):
            for record in response.css('.browse-movie-bottom'):
            items = {}
            items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
            items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
            self.itemlist.append(items)

            c = CrawlerProcess({
            'USER_AGENT': 'Mozilla/5.0',
            })
            c.crawl(TorrentSpider)
            c.start()





            share|improve this answer























              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "196"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209766%2fcreating-a-csv-file-using-scrapy%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.



              As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:



              import scrapy


              class TorrentSpider(scrapy.Spider):
              name = "torrentdata"
              start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

              def parse(self, response):
              for record in response.css('.browse-movie-bottom'):
              yield {
              "Name": record.css('.browse-movie-title::text').extract_first(default=''),
              "Year": record.css('.browse-movie-year::text').extract_first(default='')
              }


              Then, by running:



              scrapy runspider spider.py -o outputfile.csv -t csv


              (or the crawl command)



              you would have the following in the outputfile.csv:



              Name,Year
              "Faith, Love & Chocolate",2018
              Bennett's Song,2018
              ...
              Tender Mercies,1983
              You Might Be the Killer,2018





              share|improve this answer


























                2














                By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.



                As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:



                import scrapy


                class TorrentSpider(scrapy.Spider):
                name = "torrentdata"
                start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

                def parse(self, response):
                for record in response.css('.browse-movie-bottom'):
                yield {
                "Name": record.css('.browse-movie-title::text').extract_first(default=''),
                "Year": record.css('.browse-movie-year::text').extract_first(default='')
                }


                Then, by running:



                scrapy runspider spider.py -o outputfile.csv -t csv


                (or the crawl command)



                you would have the following in the outputfile.csv:



                Name,Year
                "Faith, Love & Chocolate",2018
                Bennett's Song,2018
                ...
                Tender Mercies,1983
                You Might Be the Killer,2018





                share|improve this answer
























                  2












                  2








                  2






                  By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.



                  As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:



                  import scrapy


                  class TorrentSpider(scrapy.Spider):
                  name = "torrentdata"
                  start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

                  def parse(self, response):
                  for record in response.css('.browse-movie-bottom'):
                  yield {
                  "Name": record.css('.browse-movie-title::text').extract_first(default=''),
                  "Year": record.css('.browse-movie-year::text').extract_first(default='')
                  }


                  Then, by running:



                  scrapy runspider spider.py -o outputfile.csv -t csv


                  (or the crawl command)



                  you would have the following in the outputfile.csv:



                  Name,Year
                  "Faith, Love & Chocolate",2018
                  Bennett's Song,2018
                  ...
                  Tender Mercies,1983
                  You Might Be the Killer,2018





                  share|improve this answer












                  By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.



                  As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:



                  import scrapy


                  class TorrentSpider(scrapy.Spider):
                  name = "torrentdata"
                  start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

                  def parse(self, response):
                  for record in response.css('.browse-movie-bottom'):
                  yield {
                  "Name": record.css('.browse-movie-title::text').extract_first(default=''),
                  "Year": record.css('.browse-movie-year::text').extract_first(default='')
                  }


                  Then, by running:



                  scrapy runspider spider.py -o outputfile.csv -t csv


                  (or the crawl command)



                  you would have the following in the outputfile.csv:



                  Name,Year
                  "Faith, Love & Chocolate",2018
                  Bennett's Song,2018
                  ...
                  Tender Mercies,1983
                  You Might Be the Killer,2018






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Dec 16 at 15:30









                  alecxe

                  14.8k53478




                  14.8k53478

























                      0














                      Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
                      Making use of signals might be a wise attempt here. When the scraping process is done, the spider_closed() method is invoked and thus the DictWriter() will be open once and when the writing is finished, it will be closed automatically because of the with statement. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O issues.



                      The following script represents what I told you so far:



                      import scrapy
                      from scrapy.crawler import CrawlerProcess
                      from scrapy import signals
                      import csv

                      class TorrentSpider(scrapy.Spider):
                      name = "torrentdata"
                      start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
                      itemlist =

                      @classmethod
                      def from_crawler(cls, crawler):
                      spider = super().from_crawler(crawler)
                      crawler.signals.connect(spider.spider_closed, signals.spider_closed)
                      return spider

                      def spider_closed(self):
                      with open("outputfile.csv","w", newline="") as f:
                      writer = csv.DictWriter(f,['Name','Year'])
                      writer.writeheader()
                      for data in self.itemlist:
                      writer.writerow(data)

                      def parse(self, response):
                      for record in response.css('.browse-movie-bottom'):
                      items = {}
                      items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
                      items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
                      self.itemlist.append(items)

                      c = CrawlerProcess({
                      'USER_AGENT': 'Mozilla/5.0',
                      })
                      c.crawl(TorrentSpider)
                      c.start()





                      share|improve this answer




























                        0














                        Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
                        Making use of signals might be a wise attempt here. When the scraping process is done, the spider_closed() method is invoked and thus the DictWriter() will be open once and when the writing is finished, it will be closed automatically because of the with statement. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O issues.



                        The following script represents what I told you so far:



                        import scrapy
                        from scrapy.crawler import CrawlerProcess
                        from scrapy import signals
                        import csv

                        class TorrentSpider(scrapy.Spider):
                        name = "torrentdata"
                        start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
                        itemlist =

                        @classmethod
                        def from_crawler(cls, crawler):
                        spider = super().from_crawler(crawler)
                        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
                        return spider

                        def spider_closed(self):
                        with open("outputfile.csv","w", newline="") as f:
                        writer = csv.DictWriter(f,['Name','Year'])
                        writer.writeheader()
                        for data in self.itemlist:
                        writer.writerow(data)

                        def parse(self, response):
                        for record in response.css('.browse-movie-bottom'):
                        items = {}
                        items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
                        items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
                        self.itemlist.append(items)

                        c = CrawlerProcess({
                        'USER_AGENT': 'Mozilla/5.0',
                        })
                        c.crawl(TorrentSpider)
                        c.start()





                        share|improve this answer


























                          0












                          0








                          0






                          Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
                          Making use of signals might be a wise attempt here. When the scraping process is done, the spider_closed() method is invoked and thus the DictWriter() will be open once and when the writing is finished, it will be closed automatically because of the with statement. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O issues.



                          The following script represents what I told you so far:



                          import scrapy
                          from scrapy.crawler import CrawlerProcess
                          from scrapy import signals
                          import csv

                          class TorrentSpider(scrapy.Spider):
                          name = "torrentdata"
                          start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
                          itemlist =

                          @classmethod
                          def from_crawler(cls, crawler):
                          spider = super().from_crawler(crawler)
                          crawler.signals.connect(spider.spider_closed, signals.spider_closed)
                          return spider

                          def spider_closed(self):
                          with open("outputfile.csv","w", newline="") as f:
                          writer = csv.DictWriter(f,['Name','Year'])
                          writer.writeheader()
                          for data in self.itemlist:
                          writer.writerow(data)

                          def parse(self, response):
                          for record in response.css('.browse-movie-bottom'):
                          items = {}
                          items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
                          items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
                          self.itemlist.append(items)

                          c = CrawlerProcess({
                          'USER_AGENT': 'Mozilla/5.0',
                          })
                          c.crawl(TorrentSpider)
                          c.start()





                          share|improve this answer














                          Although I'm not an expert on this, I thought to come up with a solution which I've been following quite some time.
                          Making use of signals might be a wise attempt here. When the scraping process is done, the spider_closed() method is invoked and thus the DictWriter() will be open once and when the writing is finished, it will be closed automatically because of the with statement. That said there is hardly any chance for your script to be slower, if you can get rid of Disk I/O issues.



                          The following script represents what I told you so far:



                          import scrapy
                          from scrapy.crawler import CrawlerProcess
                          from scrapy import signals
                          import csv

                          class TorrentSpider(scrapy.Spider):
                          name = "torrentdata"
                          start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,10)] #get something within list
                          itemlist =

                          @classmethod
                          def from_crawler(cls, crawler):
                          spider = super().from_crawler(crawler)
                          crawler.signals.connect(spider.spider_closed, signals.spider_closed)
                          return spider

                          def spider_closed(self):
                          with open("outputfile.csv","w", newline="") as f:
                          writer = csv.DictWriter(f,['Name','Year'])
                          writer.writeheader()
                          for data in self.itemlist:
                          writer.writerow(data)

                          def parse(self, response):
                          for record in response.css('.browse-movie-bottom'):
                          items = {}
                          items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
                          items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
                          self.itemlist.append(items)

                          c = CrawlerProcess({
                          'USER_AGENT': 'Mozilla/5.0',
                          })
                          c.crawl(TorrentSpider)
                          c.start()






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Dec 17 at 13:01

























                          answered Dec 17 at 12:52









                          asmitu

                          1269




                          1269






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Code Review Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209766%2fcreating-a-csv-file-using-scrapy%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Сан-Квентин

                              Алькесар

                              Josef Freinademetz