Wrapping a web scraper in a RESTful API


The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.

This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.

The code operates as follows:

  1. The API is called with the desired program (broadcast) date as the data payload

  2. The web scraper is then called to scrape the appropriate webpage for that given date

  3. The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information

To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.

I will include a few sections of code for the web scraper and API only. The full repository can be found here.

You can clone the repo, build the container, and run it to test the program:

git clone https://github.com/25Postcards/rfi_jeff_api
sudo docker build . -t jeff_api:latest
sudo docker run -p 8000:8000 jeff_api


from flask_restplus import Namespace, Resource, fields

from core import jeff_scraper
from core import jeff_logger
from core import jeff_validators

api = Namespace('web', description='Operations on the RFI website.')

# This model definition is required so it can be registered to the API docs
transcriptions_model = api.model('Transcription', {
'program_date': fields.String(required=True),
'encoding': fields.String,
'title': fields.String,
'article': fields.List(fields.String),
'url': fields.Url('trans_pd_ep'),
'status_code': fields.String,
'error_message': fields.String

@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
'The program date for the broadcast. Accepted date format DDMMYYYY.')
class Transcriptions(Resource):
"""A Transcriptions resource.
def get(self, program_date):
"""Gets the transcription from the scrapper.

program_date (str): A string representing the program date

validate_date_errors (dict): A dict of errors raised by the
validator for the transcriptions schema.
data (dict): A dict of attributes from the jeff transcriptions object
containing the program date, title, article, etc. (see schema).
# Create validator, validate input
ts = jeff_validators.TranscriptionsSchema()
validate_date_errors = ts.validate({'program_date': program_date})
if validate_date_errors:
return validate_date_errors

# Create scrapper, scrape page
jt = jeff_scraper.JeffTranscription(program_date)

# Serialise JeffTranscription object to serialised (formatted) dict
# according to Transcriptions Schema
data, errors = ts.dump(jt)
return data

Web scraper

import requests
import logging

from bs4 import BeautifulSoup

from core import jeff_errors
from core import jeff_logger

class JeffTranscription(object):
"""Represents a transcription from the rfi jeff website.

program_date (str): A string for the program date of the broadcast,
accepted date format DDMMYYYY.
title (str): A string for the title of the transcription.
article (list(str)): A list of strings for each paragraph in the
transcription article.
encoding (str): A string defining the encoding of the transcription.
is_error (bool): A boolean indicating if an error occurred.
error_message (str): A string for error messages generated whilst requesting
the webpage or whilst parsing the content.
status_code (str): A string indicating the http status code for responses
from the rfi jeff website.
url (str): The URL for the rfi jeff webpage for the transcription.

def __init__(self, program_date):
"""Inits JeffTranscription with the program date."""
self.program_date = program_date

self.title = None
self.article = None
self.encoding = None

self.is_error = False
self.error_message = None
self.status_code = None

self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')

page_response = self._getPageResponse()
page_content = page_response.content

except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:

def _handleScraperErrors(self, e):
"""Handles errors raised by the methods.

Sets the is_error and error_message attributes.

e: An error object raised by the class methods

self.is_error = True
self.error_message = e.message

def _makeURL(self):
"""Makes the url for the RFI JEFF Website."""
RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
RFI_JEFF_END_URL = '-20h00-gmt'
self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL

def _getPageResponse(self):
"""Gets the response from the webpage.

A requests.response object.

ScraperConnectionError: A connection error occurred.
ScraperTimeoutError: A timeout error occurred.
ScraperHTTPError: An HTTP Error occurred.
HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
page_response = requests.get(self.url, headers=HEADERS, timeout=5)
self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
self.status_code = page_response.status_code
self.encoding = page_response.encoding

except requests.exceptions.ConnectionError as e:
raise jeff_errors.ScraperConnectionError from e

except requests.exceptions.Timeout as e:
raise jeff_errors.ScraperTimeoutError from e

except requests.exceptions.HTTPError as e:
raise jeff_errors.ScraperHTTPError from e

return page_response

def _scrapePage(self, page_content):
"""Parses the html content from the webpage response.

Uses the Beautiful Soup library to parse the webpage content to extract
the title of the broadcast and the transcription. The transcription
is a series of paragraph elements within an article element that has a
class attribute defined by ARTICLE_CLASS.

Example Webpage HTML
Article element at line 518
First paragraph element at line 532

page_content: The content of the webpage as HTML

ScraperParserError: An error occurred parsing the page content.

ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
bs = BeautifulSoup(page_content, "html.parser")

title_tag = bs.find('title')
title_text = title_tag.get_text()

# Find all the p elements within the article element that has the
# ARTICLE_CLASS class attribute. Remove newline characters and
# unwanted unicode characters from the p element's text fields.
# Create a list of strings, one list element for each paragraph.
article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
for p_tag in article_p_tags]

self._jeff_scraper_logger.logger.info('Page content parsed')

self.title = title_text
self.article = article_p_text

except Exception as e:
raise jeff_errors.ScraperParserError() from e


  • I've stuck to the Google Python style guide where possible.

  • The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

  • The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

  • The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

  • Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?

