NodeJs Crawler - How can I make it more scalable and maintainable

Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler

If you look at the index.js you will see how to use it;

const cookCrawler = require('./cookCrawler.js')



cookCrawler.getRecipeData(recipeUrl).then(data => {

    console.log(data)

})

I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.

class RecipeParser {

    async loadHtml(url) {

        this.recipeUrl = url;



        try {

            const recipeHtml = await requestP(url);

            // Load the virtual DOM 

            this.$ = cheerio.load(recipeHtml);



            return this;

        }

        catch (err) {

            console.log(err);

        }

    }



    async parseHtml(url) {

        try {

            await this.loadHtml(url)

            return this.parse()

        }

        catch(err) {

            console.log(err);

        }

    }



    getTitle(selector) {

        return this.whiteSpaceRemover(this.$(selector).text())

    }



    getRecipeInfo(selector) {        

        throw new Error('You have to implement the method getRecipeInfo!');

    }



    getIngredients(selector) {



        throw new Error('You have to implement the method getIngredients!');

    }



    getSteps(selector) {

        throw new Error('You have to implement the method getSteps!');

    }



    getRecipeImgUrl(selector) {

        return this.$(selector).attr('href')   

    }



    /**

     * Return the obj

     */

    parse() {

        return {

            recipeUrl: this.recipeUrl,

            title: this.getTitle(),

            recipeInfo: this.getRecipeInfo(),

            ingredients: this.getIngredients(),

            steps: this.getSteps(),

            recipeImgUrl: this.getRecipeImgUrl()

        }

    }



    getTxtArrayFromElements(selector) {

        const array = 

        this.$(selector).each((i, element) => {

            array.push(this.$(element).text())

        })



        return array

    }



    whiteSpaceRemover(string) {

        return string.replace(whiteSpaceRemReg, '')

    }

}



module.exports = RecipeParser

It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes

getIngredients() {

        // If the form have h3 in it, that mean the recipe have 2 recipe in it

        if(this.$('#formIngredients > h3').length) {

            let obj = {}

            // for each recipe title link the array of ingredients to it 

            this.$('#formIngredients > h3').each((i, element) => {

                    obj[this.$(element).text()] = (() => {

                        const ingredients = 

                        this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {

                            ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))

                        })



                        return ingredients

                })()

            })



            return obj

        } 

        else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')

    }

The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;



class CookCrawler  {

    static getRecipeData(url) {

        const domain = url.match(domainMatchReg).toString()

        switch(domain) {

            case 'https://www.ricardocuisine.com':

                const ricardoParser = new RicardoParser()



                return ricardoParser.parseHtml(url)

            case 'https://www.troisfoisparjour.com':

                const troisfoisparjourParse = new TroisfoisparjourParser()



                return troisfoisparjourParse.parseHtml(url)

            default: 

                console.warn('No parser exist for ths domain or wrong url.')

                break

        }

    }

}

So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.

J.R

asked 7 mins ago

Just4lol

New contributor

add a comment |

Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler

If you look at the index.js you will see how to use it;

const cookCrawler = require('./cookCrawler.js')



cookCrawler.getRecipeData(recipeUrl).then(data => {

    console.log(data)

})

class RecipeParser {

    async loadHtml(url) {

        this.recipeUrl = url;



        try {

            const recipeHtml = await requestP(url);

            // Load the virtual DOM 

            this.$ = cheerio.load(recipeHtml);



            return this;

        }

        catch (err) {

            console.log(err);

        }

    }



    async parseHtml(url) {

        try {

            await this.loadHtml(url)

            return this.parse()

        }

        catch(err) {

            console.log(err);

        }

    }



    getTitle(selector) {

        return this.whiteSpaceRemover(this.$(selector).text())

    }



    getRecipeInfo(selector) {        

        throw new Error('You have to implement the method getRecipeInfo!');

    }



    getIngredients(selector) {



        throw new Error('You have to implement the method getIngredients!');

    }



    getSteps(selector) {

        throw new Error('You have to implement the method getSteps!');

    }



    getRecipeImgUrl(selector) {

        return this.$(selector).attr('href')   

    }



    /**

     * Return the obj

     */

    parse() {

        return {

            recipeUrl: this.recipeUrl,

            title: this.getTitle(),

            recipeInfo: this.getRecipeInfo(),

            ingredients: this.getIngredients(),

            steps: this.getSteps(),

            recipeImgUrl: this.getRecipeImgUrl()

        }

    }



    getTxtArrayFromElements(selector) {

        const array = 

        this.$(selector).each((i, element) => {

            array.push(this.$(element).text())

        })



        return array

    }



    whiteSpaceRemover(string) {

        return string.replace(whiteSpaceRemReg, '')

    }

}



module.exports = RecipeParser

getIngredients() {

        // If the form have h3 in it, that mean the recipe have 2 recipe in it

        if(this.$('#formIngredients > h3').length) {

            let obj = {}

            // for each recipe title link the array of ingredients to it 

            this.$('#formIngredients > h3').each((i, element) => {

                    obj[this.$(element).text()] = (() => {

                        const ingredients = 

                        this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {

                            ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))

                        })



                        return ingredients

                })()

            })



            return obj

        } 

        else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')

    }

The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;



class CookCrawler  {

    static getRecipeData(url) {

        const domain = url.match(domainMatchReg).toString()

        switch(domain) {

            case 'https://www.ricardocuisine.com':

                const ricardoParser = new RicardoParser()



                return ricardoParser.parseHtml(url)

            case 'https://www.troisfoisparjour.com':

                const troisfoisparjourParse = new TroisfoisparjourParser()



                return troisfoisparjourParse.parseHtml(url)

            default: 

                console.warn('No parser exist for ths domain or wrong url.')

                break

        }

    }

}

J.R

asked 7 mins ago

Just4lol

New contributor

add a comment |

Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler

If you look at the index.js you will see how to use it;

const cookCrawler = require('./cookCrawler.js')



cookCrawler.getRecipeData(recipeUrl).then(data => {

    console.log(data)

})

class RecipeParser {

    async loadHtml(url) {

        this.recipeUrl = url;



        try {

            const recipeHtml = await requestP(url);

            // Load the virtual DOM 

            this.$ = cheerio.load(recipeHtml);



            return this;

        }

        catch (err) {

            console.log(err);

        }

    }



    async parseHtml(url) {

        try {

            await this.loadHtml(url)

            return this.parse()

        }

        catch(err) {

            console.log(err);

        }

    }



    getTitle(selector) {

        return this.whiteSpaceRemover(this.$(selector).text())

    }



    getRecipeInfo(selector) {        

        throw new Error('You have to implement the method getRecipeInfo!');

    }



    getIngredients(selector) {



        throw new Error('You have to implement the method getIngredients!');

    }



    getSteps(selector) {

        throw new Error('You have to implement the method getSteps!');

    }



    getRecipeImgUrl(selector) {

        return this.$(selector).attr('href')   

    }



    /**

     * Return the obj

     */

    parse() {

        return {

            recipeUrl: this.recipeUrl,

            title: this.getTitle(),

            recipeInfo: this.getRecipeInfo(),

            ingredients: this.getIngredients(),

            steps: this.getSteps(),

            recipeImgUrl: this.getRecipeImgUrl()

        }

    }



    getTxtArrayFromElements(selector) {

        const array = 

        this.$(selector).each((i, element) => {

            array.push(this.$(element).text())

        })



        return array

    }



    whiteSpaceRemover(string) {

        return string.replace(whiteSpaceRemReg, '')

    }

}



module.exports = RecipeParser

getIngredients() {

        // If the form have h3 in it, that mean the recipe have 2 recipe in it

        if(this.$('#formIngredients > h3').length) {

            let obj = {}

            // for each recipe title link the array of ingredients to it 

            this.$('#formIngredients > h3').each((i, element) => {

                    obj[this.$(element).text()] = (() => {

                        const ingredients = 

                        this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {

                            ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))

                        })



                        return ingredients

                })()

            })



            return obj

        } 

        else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')

    }

The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;



class CookCrawler  {

    static getRecipeData(url) {

        const domain = url.match(domainMatchReg).toString()

        switch(domain) {

            case 'https://www.ricardocuisine.com':

                const ricardoParser = new RicardoParser()



                return ricardoParser.parseHtml(url)

            case 'https://www.troisfoisparjour.com':

                const troisfoisparjourParse = new TroisfoisparjourParser()



                return troisfoisparjourParse.parseHtml(url)

            default: 

                console.warn('No parser exist for ths domain or wrong url.')

                break

        }

    }

}

J.R

asked 7 mins ago

Just4lol

New contributor

Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler

If you look at the index.js you will see how to use it;

const cookCrawler = require('./cookCrawler.js')



cookCrawler.getRecipeData(recipeUrl).then(data => {

    console.log(data)

})

class RecipeParser {

    async loadHtml(url) {

        this.recipeUrl = url;



        try {

            const recipeHtml = await requestP(url);

            // Load the virtual DOM 

            this.$ = cheerio.load(recipeHtml);



            return this;

        }

        catch (err) {

            console.log(err);

        }

    }



    async parseHtml(url) {

        try {

            await this.loadHtml(url)

            return this.parse()

        }

        catch(err) {

            console.log(err);

        }

    }



    getTitle(selector) {

        return this.whiteSpaceRemover(this.$(selector).text())

    }



    getRecipeInfo(selector) {        

        throw new Error('You have to implement the method getRecipeInfo!');

    }



    getIngredients(selector) {



        throw new Error('You have to implement the method getIngredients!');

    }



    getSteps(selector) {

        throw new Error('You have to implement the method getSteps!');

    }



    getRecipeImgUrl(selector) {

        return this.$(selector).attr('href')   

    }



    /**

     * Return the obj

     */

    parse() {

        return {

            recipeUrl: this.recipeUrl,

            title: this.getTitle(),

            recipeInfo: this.getRecipeInfo(),

            ingredients: this.getIngredients(),

            steps: this.getSteps(),

            recipeImgUrl: this.getRecipeImgUrl()

        }

    }



    getTxtArrayFromElements(selector) {

        const array = 

        this.$(selector).each((i, element) => {

            array.push(this.$(element).text())

        })



        return array

    }



    whiteSpaceRemover(string) {

        return string.replace(whiteSpaceRemReg, '')

    }

}



module.exports = RecipeParser

getIngredients() {

        // If the form have h3 in it, that mean the recipe have 2 recipe in it

        if(this.$('#formIngredients > h3').length) {

            let obj = {}

            // for each recipe title link the array of ingredients to it 

            this.$('#formIngredients > h3').each((i, element) => {

                    obj[this.$(element).text()] = (() => {

                        const ingredients = 

                        this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {

                            ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))

                        })



                        return ingredients

                })()

            })



            return obj

        } 

        else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')

    }

The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;



class CookCrawler  {

    static getRecipeData(url) {

        const domain = url.match(domainMatchReg).toString()

        switch(domain) {

            case 'https://www.ricardocuisine.com':

                const ricardoParser = new RicardoParser()



                return ricardoParser.parseHtml(url)

            case 'https://www.troisfoisparjour.com':

                const troisfoisparjourParse = new TroisfoisparjourParser()



                return troisfoisparjourParse.parseHtml(url)

            default: 

                console.warn('No parser exist for ths domain or wrong url.')

                break

        }

    }

}

J.R

javascript node.js web-scraping

asked 7 mins ago

Just4lol

New contributor

asked 7 mins ago

Just4lol

New contributor

asked 7 mins ago

Just4lol

New contributor

asked 7 mins ago

Just4lol

asked 7 mins ago

Just4lol

New contributor

Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Just4lol is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214509%2fnodejs-crawler-how-can-i-make-it-more-scalable-and-maintainable%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

Just4lol is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Just4lol is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfrktyl