Wrapping a web scraper in a RESTful API












2














The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.



This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.



The code operates as follows:




  1. The API is called with the desired program (broadcast) date as the data payload

  2. The web scraper is then called to scrape the appropriate webpage for that given date

  3. The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information


To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.



I will include a few sections of code for the web scraper and API only. The full repository can be found here.



You can clone the repo, build the container, and run it to test the program:



git clone https://github.com/25Postcards/rfi_jeff_api
sudo docker build . -t jeff_api:latest
sudo docker run -p 8000:8000 jeff_api


API



from flask_restplus import Namespace, Resource, fields

from core import jeff_scraper
from core import jeff_logger
from core import jeff_validators

api = Namespace('web', description='Operations on the RFI website.')

# This model definition is required so it can be registered to the API docs
transcriptions_model = api.model('Transcription', {
'program_date': fields.String(required=True),
'encoding': fields.String,
'title': fields.String,
'article': fields.List(fields.String),
'url': fields.Url('trans_pd_ep'),
'status_code': fields.String,
'error_message': fields.String
})

@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
@api.param('program_date',
'The program date for the broadcast. Accepted date format DDMMYYYY.')
@api.doc(model=transcriptions_model)
class Transcriptions(Resource):
"""A Transcriptions resource.
"""
def get(self, program_date):
"""Gets the transcription from the scrapper.

Args:
program_date (str): A string representing the program date

Returns:
validate_date_errors (dict): A dict of errors raised by the
validator for the transcriptions schema.
data (dict): A dict of attributes from the jeff transcriptions object
containing the program date, title, article, etc. (see schema).
"""
# Create validator, validate input
ts = jeff_validators.TranscriptionsSchema()
validate_date_errors = ts.validate({'program_date': program_date})
if validate_date_errors:
return validate_date_errors

# Create scrapper, scrape page
jt = jeff_scraper.JeffTranscription(program_date)

# Serialise JeffTranscription object to serialised (formatted) dict
# according to Transcriptions Schema
data, errors = ts.dump(jt)
return data


Web scraper



import requests
import logging

from bs4 import BeautifulSoup

from core import jeff_errors
from core import jeff_logger

class JeffTranscription(object):
"""Represents a transcription from the rfi jeff website.

Attributes:
program_date (str): A string for the program date of the broadcast,
accepted date format DDMMYYYY.
title (str): A string for the title of the transcription.
article (list(str)): A list of strings for each paragraph in the
transcription article.
encoding (str): A string defining the encoding of the transcription.
is_error (bool): A boolean indicating if an error occurred.
error_message (str): A string for error messages generated whilst requesting
the webpage or whilst parsing the content.
status_code (str): A string indicating the http status code for responses
from the rfi jeff website.
url (str): The URL for the rfi jeff webpage for the transcription.
"""

def __init__(self, program_date):
"""Inits JeffTranscription with the program date."""
self.program_date = program_date
self._makeURL()

self.title = None
self.article = None
self.encoding = None

self.is_error = False
self.error_message = None
self.status_code = None

self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')

try:
page_response = self._getPageResponse()
page_content = page_response.content
self._scrapePage(page_content)

except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
self._handleScraperErrors(e)

def _handleScraperErrors(self, e):
"""Handles errors raised by the methods.

Sets the is_error and error_message attributes.

Args:
e: An error object raised by the class methods
"""

self.is_error = True
self.error_message = e.message
self._jeff_scraper_logger.logger.error(self.error_message)

def _makeURL(self):
"""Makes the url for the RFI JEFF Website."""
RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
'langue-francaise/journal-en-francais-facile-'
RFI_JEFF_END_URL = '-20h00-gmt'
self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL

def _getPageResponse(self):
"""Gets the response from the webpage.

Returns:
A requests.response object.

Raises:
ScraperConnectionError: A connection error occurred.
ScraperTimeoutError: A timeout error occurred.
ScraperHTTPError: An HTTP Error occurred.
"""
try:
HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
page_response = requests.get(self.url, headers=HEADERS, timeout=5)
self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
self.status_code = page_response.status_code
self.encoding = page_response.encoding
page_response.raise_for_status()

except requests.exceptions.ConnectionError as e:
raise jeff_errors.ScraperConnectionError from e

except requests.exceptions.Timeout as e:
raise jeff_errors.ScraperTimeoutError from e

except requests.exceptions.HTTPError as e:
raise jeff_errors.ScraperHTTPError from e

return page_response

def _scrapePage(self, page_content):
"""Parses the html content from the webpage response.

Uses the Beautiful Soup library to parse the webpage content to extract
the title of the broadcast and the transcription. The transcription
is a series of paragraph elements within an article element that has a
class attribute defined by ARTICLE_CLASS.

Example Webpage HTML
view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
Article element at line 518
First paragraph element at line 532

Args:
page_content: The content of the webpage as HTML

Raises:
ScraperParserError: An error occurred parsing the page content.
"""

try:
ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
bs = BeautifulSoup(page_content, "html.parser")

title_tag = bs.find('title')
title_text = title_tag.get_text()

# Find all the p elements within the article element that has the
# ARTICLE_CLASS class attribute. Remove newline characters and
# unwanted unicode characters from the p element's text fields.
# Create a list of strings, one list element for each paragraph.
article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
for p_tag in article_p_tags]

self._jeff_scraper_logger.logger.info('Page content parsed')

self.title = title_text
self.article = article_p_text

except Exception as e:
raise jeff_errors.ScraperParserError() from e


Notes/Questions:




  • I've stuck to the Google Python style guide where possible.

  • The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

  • The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

  • The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

  • Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?










share|improve this question





























    2














    The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.



    This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.



    The code operates as follows:




    1. The API is called with the desired program (broadcast) date as the data payload

    2. The web scraper is then called to scrape the appropriate webpage for that given date

    3. The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information


    To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.



    I will include a few sections of code for the web scraper and API only. The full repository can be found here.



    You can clone the repo, build the container, and run it to test the program:



    git clone https://github.com/25Postcards/rfi_jeff_api
    sudo docker build . -t jeff_api:latest
    sudo docker run -p 8000:8000 jeff_api


    API



    from flask_restplus import Namespace, Resource, fields

    from core import jeff_scraper
    from core import jeff_logger
    from core import jeff_validators

    api = Namespace('web', description='Operations on the RFI website.')

    # This model definition is required so it can be registered to the API docs
    transcriptions_model = api.model('Transcription', {
    'program_date': fields.String(required=True),
    'encoding': fields.String,
    'title': fields.String,
    'article': fields.List(fields.String),
    'url': fields.Url('trans_pd_ep'),
    'status_code': fields.String,
    'error_message': fields.String
    })

    @api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
    @api.param('program_date',
    'The program date for the broadcast. Accepted date format DDMMYYYY.')
    @api.doc(model=transcriptions_model)
    class Transcriptions(Resource):
    """A Transcriptions resource.
    """
    def get(self, program_date):
    """Gets the transcription from the scrapper.

    Args:
    program_date (str): A string representing the program date

    Returns:
    validate_date_errors (dict): A dict of errors raised by the
    validator for the transcriptions schema.
    data (dict): A dict of attributes from the jeff transcriptions object
    containing the program date, title, article, etc. (see schema).
    """
    # Create validator, validate input
    ts = jeff_validators.TranscriptionsSchema()
    validate_date_errors = ts.validate({'program_date': program_date})
    if validate_date_errors:
    return validate_date_errors

    # Create scrapper, scrape page
    jt = jeff_scraper.JeffTranscription(program_date)

    # Serialise JeffTranscription object to serialised (formatted) dict
    # according to Transcriptions Schema
    data, errors = ts.dump(jt)
    return data


    Web scraper



    import requests
    import logging

    from bs4 import BeautifulSoup

    from core import jeff_errors
    from core import jeff_logger

    class JeffTranscription(object):
    """Represents a transcription from the rfi jeff website.

    Attributes:
    program_date (str): A string for the program date of the broadcast,
    accepted date format DDMMYYYY.
    title (str): A string for the title of the transcription.
    article (list(str)): A list of strings for each paragraph in the
    transcription article.
    encoding (str): A string defining the encoding of the transcription.
    is_error (bool): A boolean indicating if an error occurred.
    error_message (str): A string for error messages generated whilst requesting
    the webpage or whilst parsing the content.
    status_code (str): A string indicating the http status code for responses
    from the rfi jeff website.
    url (str): The URL for the rfi jeff webpage for the transcription.
    """

    def __init__(self, program_date):
    """Inits JeffTranscription with the program date."""
    self.program_date = program_date
    self._makeURL()

    self.title = None
    self.article = None
    self.encoding = None

    self.is_error = False
    self.error_message = None
    self.status_code = None

    self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')

    try:
    page_response = self._getPageResponse()
    page_content = page_response.content
    self._scrapePage(page_content)

    except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
    jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
    self._handleScraperErrors(e)

    def _handleScraperErrors(self, e):
    """Handles errors raised by the methods.

    Sets the is_error and error_message attributes.

    Args:
    e: An error object raised by the class methods
    """

    self.is_error = True
    self.error_message = e.message
    self._jeff_scraper_logger.logger.error(self.error_message)

    def _makeURL(self):
    """Makes the url for the RFI JEFF Website."""
    RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
    'langue-francaise/journal-en-francais-facile-'
    RFI_JEFF_END_URL = '-20h00-gmt'
    self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL

    def _getPageResponse(self):
    """Gets the response from the webpage.

    Returns:
    A requests.response object.

    Raises:
    ScraperConnectionError: A connection error occurred.
    ScraperTimeoutError: A timeout error occurred.
    ScraperHTTPError: An HTTP Error occurred.
    """
    try:
    HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
    page_response = requests.get(self.url, headers=HEADERS, timeout=5)
    self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
    self.status_code = page_response.status_code
    self.encoding = page_response.encoding
    page_response.raise_for_status()

    except requests.exceptions.ConnectionError as e:
    raise jeff_errors.ScraperConnectionError from e

    except requests.exceptions.Timeout as e:
    raise jeff_errors.ScraperTimeoutError from e

    except requests.exceptions.HTTPError as e:
    raise jeff_errors.ScraperHTTPError from e

    return page_response

    def _scrapePage(self, page_content):
    """Parses the html content from the webpage response.

    Uses the Beautiful Soup library to parse the webpage content to extract
    the title of the broadcast and the transcription. The transcription
    is a series of paragraph elements within an article element that has a
    class attribute defined by ARTICLE_CLASS.

    Example Webpage HTML
    view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
    Article element at line 518
    First paragraph element at line 532

    Args:
    page_content: The content of the webpage as HTML

    Raises:
    ScraperParserError: An error occurred parsing the page content.
    """

    try:
    ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
    bs = BeautifulSoup(page_content, "html.parser")

    title_tag = bs.find('title')
    title_text = title_tag.get_text()

    # Find all the p elements within the article element that has the
    # ARTICLE_CLASS class attribute. Remove newline characters and
    # unwanted unicode characters from the p element's text fields.
    # Create a list of strings, one list element for each paragraph.
    article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
    article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
    for p_tag in article_p_tags]

    self._jeff_scraper_logger.logger.info('Page content parsed')

    self.title = title_text
    self.article = article_p_text

    except Exception as e:
    raise jeff_errors.ScraperParserError() from e


    Notes/Questions:




    • I've stuck to the Google Python style guide where possible.

    • The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

    • The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

    • The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

    • Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?










    share|improve this question



























      2












      2








      2







      The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.



      This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.



      The code operates as follows:




      1. The API is called with the desired program (broadcast) date as the data payload

      2. The web scraper is then called to scrape the appropriate webpage for that given date

      3. The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information


      To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.



      I will include a few sections of code for the web scraper and API only. The full repository can be found here.



      You can clone the repo, build the container, and run it to test the program:



      git clone https://github.com/25Postcards/rfi_jeff_api
      sudo docker build . -t jeff_api:latest
      sudo docker run -p 8000:8000 jeff_api


      API



      from flask_restplus import Namespace, Resource, fields

      from core import jeff_scraper
      from core import jeff_logger
      from core import jeff_validators

      api = Namespace('web', description='Operations on the RFI website.')

      # This model definition is required so it can be registered to the API docs
      transcriptions_model = api.model('Transcription', {
      'program_date': fields.String(required=True),
      'encoding': fields.String,
      'title': fields.String,
      'article': fields.List(fields.String),
      'url': fields.Url('trans_pd_ep'),
      'status_code': fields.String,
      'error_message': fields.String
      })

      @api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
      @api.param('program_date',
      'The program date for the broadcast. Accepted date format DDMMYYYY.')
      @api.doc(model=transcriptions_model)
      class Transcriptions(Resource):
      """A Transcriptions resource.
      """
      def get(self, program_date):
      """Gets the transcription from the scrapper.

      Args:
      program_date (str): A string representing the program date

      Returns:
      validate_date_errors (dict): A dict of errors raised by the
      validator for the transcriptions schema.
      data (dict): A dict of attributes from the jeff transcriptions object
      containing the program date, title, article, etc. (see schema).
      """
      # Create validator, validate input
      ts = jeff_validators.TranscriptionsSchema()
      validate_date_errors = ts.validate({'program_date': program_date})
      if validate_date_errors:
      return validate_date_errors

      # Create scrapper, scrape page
      jt = jeff_scraper.JeffTranscription(program_date)

      # Serialise JeffTranscription object to serialised (formatted) dict
      # according to Transcriptions Schema
      data, errors = ts.dump(jt)
      return data


      Web scraper



      import requests
      import logging

      from bs4 import BeautifulSoup

      from core import jeff_errors
      from core import jeff_logger

      class JeffTranscription(object):
      """Represents a transcription from the rfi jeff website.

      Attributes:
      program_date (str): A string for the program date of the broadcast,
      accepted date format DDMMYYYY.
      title (str): A string for the title of the transcription.
      article (list(str)): A list of strings for each paragraph in the
      transcription article.
      encoding (str): A string defining the encoding of the transcription.
      is_error (bool): A boolean indicating if an error occurred.
      error_message (str): A string for error messages generated whilst requesting
      the webpage or whilst parsing the content.
      status_code (str): A string indicating the http status code for responses
      from the rfi jeff website.
      url (str): The URL for the rfi jeff webpage for the transcription.
      """

      def __init__(self, program_date):
      """Inits JeffTranscription with the program date."""
      self.program_date = program_date
      self._makeURL()

      self.title = None
      self.article = None
      self.encoding = None

      self.is_error = False
      self.error_message = None
      self.status_code = None

      self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')

      try:
      page_response = self._getPageResponse()
      page_content = page_response.content
      self._scrapePage(page_content)

      except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
      jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
      self._handleScraperErrors(e)

      def _handleScraperErrors(self, e):
      """Handles errors raised by the methods.

      Sets the is_error and error_message attributes.

      Args:
      e: An error object raised by the class methods
      """

      self.is_error = True
      self.error_message = e.message
      self._jeff_scraper_logger.logger.error(self.error_message)

      def _makeURL(self):
      """Makes the url for the RFI JEFF Website."""
      RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
      'langue-francaise/journal-en-francais-facile-'
      RFI_JEFF_END_URL = '-20h00-gmt'
      self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL

      def _getPageResponse(self):
      """Gets the response from the webpage.

      Returns:
      A requests.response object.

      Raises:
      ScraperConnectionError: A connection error occurred.
      ScraperTimeoutError: A timeout error occurred.
      ScraperHTTPError: An HTTP Error occurred.
      """
      try:
      HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
      page_response = requests.get(self.url, headers=HEADERS, timeout=5)
      self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
      self.status_code = page_response.status_code
      self.encoding = page_response.encoding
      page_response.raise_for_status()

      except requests.exceptions.ConnectionError as e:
      raise jeff_errors.ScraperConnectionError from e

      except requests.exceptions.Timeout as e:
      raise jeff_errors.ScraperTimeoutError from e

      except requests.exceptions.HTTPError as e:
      raise jeff_errors.ScraperHTTPError from e

      return page_response

      def _scrapePage(self, page_content):
      """Parses the html content from the webpage response.

      Uses the Beautiful Soup library to parse the webpage content to extract
      the title of the broadcast and the transcription. The transcription
      is a series of paragraph elements within an article element that has a
      class attribute defined by ARTICLE_CLASS.

      Example Webpage HTML
      view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
      Article element at line 518
      First paragraph element at line 532

      Args:
      page_content: The content of the webpage as HTML

      Raises:
      ScraperParserError: An error occurred parsing the page content.
      """

      try:
      ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
      bs = BeautifulSoup(page_content, "html.parser")

      title_tag = bs.find('title')
      title_text = title_tag.get_text()

      # Find all the p elements within the article element that has the
      # ARTICLE_CLASS class attribute. Remove newline characters and
      # unwanted unicode characters from the p element's text fields.
      # Create a list of strings, one list element for each paragraph.
      article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
      article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
      for p_tag in article_p_tags]

      self._jeff_scraper_logger.logger.info('Page content parsed')

      self.title = title_text
      self.article = article_p_text

      except Exception as e:
      raise jeff_errors.ScraperParserError() from e


      Notes/Questions:




      • I've stuck to the Google Python style guide where possible.

      • The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

      • The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

      • The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

      • Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?










      share|improve this question















      The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.



      This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.



      The code operates as follows:




      1. The API is called with the desired program (broadcast) date as the data payload

      2. The web scraper is then called to scrape the appropriate webpage for that given date

      3. The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information


      To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.



      I will include a few sections of code for the web scraper and API only. The full repository can be found here.



      You can clone the repo, build the container, and run it to test the program:



      git clone https://github.com/25Postcards/rfi_jeff_api
      sudo docker build . -t jeff_api:latest
      sudo docker run -p 8000:8000 jeff_api


      API



      from flask_restplus import Namespace, Resource, fields

      from core import jeff_scraper
      from core import jeff_logger
      from core import jeff_validators

      api = Namespace('web', description='Operations on the RFI website.')

      # This model definition is required so it can be registered to the API docs
      transcriptions_model = api.model('Transcription', {
      'program_date': fields.String(required=True),
      'encoding': fields.String,
      'title': fields.String,
      'article': fields.List(fields.String),
      'url': fields.Url('trans_pd_ep'),
      'status_code': fields.String,
      'error_message': fields.String
      })

      @api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
      @api.param('program_date',
      'The program date for the broadcast. Accepted date format DDMMYYYY.')
      @api.doc(model=transcriptions_model)
      class Transcriptions(Resource):
      """A Transcriptions resource.
      """
      def get(self, program_date):
      """Gets the transcription from the scrapper.

      Args:
      program_date (str): A string representing the program date

      Returns:
      validate_date_errors (dict): A dict of errors raised by the
      validator for the transcriptions schema.
      data (dict): A dict of attributes from the jeff transcriptions object
      containing the program date, title, article, etc. (see schema).
      """
      # Create validator, validate input
      ts = jeff_validators.TranscriptionsSchema()
      validate_date_errors = ts.validate({'program_date': program_date})
      if validate_date_errors:
      return validate_date_errors

      # Create scrapper, scrape page
      jt = jeff_scraper.JeffTranscription(program_date)

      # Serialise JeffTranscription object to serialised (formatted) dict
      # according to Transcriptions Schema
      data, errors = ts.dump(jt)
      return data


      Web scraper



      import requests
      import logging

      from bs4 import BeautifulSoup

      from core import jeff_errors
      from core import jeff_logger

      class JeffTranscription(object):
      """Represents a transcription from the rfi jeff website.

      Attributes:
      program_date (str): A string for the program date of the broadcast,
      accepted date format DDMMYYYY.
      title (str): A string for the title of the transcription.
      article (list(str)): A list of strings for each paragraph in the
      transcription article.
      encoding (str): A string defining the encoding of the transcription.
      is_error (bool): A boolean indicating if an error occurred.
      error_message (str): A string for error messages generated whilst requesting
      the webpage or whilst parsing the content.
      status_code (str): A string indicating the http status code for responses
      from the rfi jeff website.
      url (str): The URL for the rfi jeff webpage for the transcription.
      """

      def __init__(self, program_date):
      """Inits JeffTranscription with the program date."""
      self.program_date = program_date
      self._makeURL()

      self.title = None
      self.article = None
      self.encoding = None

      self.is_error = False
      self.error_message = None
      self.status_code = None

      self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')

      try:
      page_response = self._getPageResponse()
      page_content = page_response.content
      self._scrapePage(page_content)

      except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
      jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
      self._handleScraperErrors(e)

      def _handleScraperErrors(self, e):
      """Handles errors raised by the methods.

      Sets the is_error and error_message attributes.

      Args:
      e: An error object raised by the class methods
      """

      self.is_error = True
      self.error_message = e.message
      self._jeff_scraper_logger.logger.error(self.error_message)

      def _makeURL(self):
      """Makes the url for the RFI JEFF Website."""
      RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
      'langue-francaise/journal-en-francais-facile-'
      RFI_JEFF_END_URL = '-20h00-gmt'
      self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL

      def _getPageResponse(self):
      """Gets the response from the webpage.

      Returns:
      A requests.response object.

      Raises:
      ScraperConnectionError: A connection error occurred.
      ScraperTimeoutError: A timeout error occurred.
      ScraperHTTPError: An HTTP Error occurred.
      """
      try:
      HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
      page_response = requests.get(self.url, headers=HEADERS, timeout=5)
      self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
      self.status_code = page_response.status_code
      self.encoding = page_response.encoding
      page_response.raise_for_status()

      except requests.exceptions.ConnectionError as e:
      raise jeff_errors.ScraperConnectionError from e

      except requests.exceptions.Timeout as e:
      raise jeff_errors.ScraperTimeoutError from e

      except requests.exceptions.HTTPError as e:
      raise jeff_errors.ScraperHTTPError from e

      return page_response

      def _scrapePage(self, page_content):
      """Parses the html content from the webpage response.

      Uses the Beautiful Soup library to parse the webpage content to extract
      the title of the broadcast and the transcription. The transcription
      is a series of paragraph elements within an article element that has a
      class attribute defined by ARTICLE_CLASS.

      Example Webpage HTML
      view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
      Article element at line 518
      First paragraph element at line 532

      Args:
      page_content: The content of the webpage as HTML

      Raises:
      ScraperParserError: An error occurred parsing the page content.
      """

      try:
      ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
      bs = BeautifulSoup(page_content, "html.parser")

      title_tag = bs.find('title')
      title_text = title_tag.get_text()

      # Find all the p elements within the article element that has the
      # ARTICLE_CLASS class attribute. Remove newline characters and
      # unwanted unicode characters from the p element's text fields.
      # Create a list of strings, one list element for each paragraph.
      article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
      article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
      for p_tag in article_p_tags]

      self._jeff_scraper_logger.logger.info('Page content parsed')

      self.title = title_text
      self.article = article_p_text

      except Exception as e:
      raise jeff_errors.ScraperParserError() from e


      Notes/Questions:




      • I've stuck to the Google Python style guide where possible.

      • The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

      • The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

      • The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

      • Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?







      python web-scraping api beautifulsoup flask






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 25 at 20:40









      Jamal

      30.3k11116226




      30.3k11116226










      asked Oct 15 at 22:03









      25Postcards

      112




      112



























          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205639%2fwrapping-a-web-scraper-in-a-restful-api%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205639%2fwrapping-a-web-scraper-in-a-restful-api%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Сан-Квентин

          Алькесар

          Josef Freinademetz