How does Vue.js compare against ReactJS and Angular in runtime performance?

It requires the developer to engage in critical thinking, discuss with the experts, make wise decisions, and implement the idea successfully. The developer must be prepared to research on the various…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Parsing site with Node JS

Sometimes we have a situation where there is some useful data that you want to use from any kind of web resource, let’s say a site, and you need somehow to get it for personal use.

Recently I’ve been in such a situation. I had to get data about the cat breeds (breed name and photo) and load it into my PostgreSQL. I didn’t find any kind of open database, but I found a web page on Wikipedia with the breeds and decided to parse the breeds from there.

One thing to point out: I’m mainly using .net, but for the purpose of parsing I prefer Node JS, so the code will be in JS, but fortunately for almost all languages you can find all the necessary libraries to be able to download and parse the HTML.

First of all, we should investigate whether the data source is parsable, and the main criteria for that is whether the web page is server-side rendered(SSR) or it dynamically loads data.

Usually, top search pages do have SSR because the SEO feature which lets them be on the top search is almost impossible without SSR.

You can google the difference between Server-Side and Client-Side rendering, but in simple words, in the case of Client-Side rendering at the time of downloading HTML — you will not have the necessary data to parse, because it is loaded dynamically later. So in that case we need workarounds. Fortunately, there are rare cases.

In my case, this is the Wikipedia page that I want to parse:

We actually can see that all the data was given from the first request, so this page was server-side rendered.

Then let’s decide what we would like to parse from it. Actually, for my startup I needed the name of the breed and a photo.

Now go and inspect where those items are presented in our HTML by Google Chrome tools.

Click on the selected item and investigate HTML on the right bar.

As you can see we are dealing with standard HTML table structure, so for the name parsing all we need to do is to parse necessary <td>’s for each <tr> that we are iterating.

The image is located on one of the <td>’s as well. The most important thing here is to check whether we can download the photo by that URL.

Yes, by going to that link I am able to get the picture(don’t worry, the link in the picture is different because of Wikipedia redirections).

Create a folder of the project and create a main.js file.
Then open terminal from the root of that project and type there

npm init

Put whatever you want when asked in the terminal.

Then, we need axios for HTTP requests, node-html-parser for ability to parse downloaded HTML, fs.promises for async file system calls, google-translate-api for translation from Ukrainian to English and Russian(it is not required).

To install the packages run the following:

npm install axios node-html-parser @vitalets/google-translate-api — save

Write the following code:

Inside the code, we defined necessary constans, loaded the necessary modules. On top of that, we created an async main function and called it. It is needed to be able to write async/await constructs for future purposes and cleaner code.

Then we should download HTML by URL and parse necessary items.

How can we check how to access items inside the HTML?
Basically, there are a few ways to do it: XPath or native JS functions like querySelectorAll.
I chose the second option because it is easier to implement and check the validity.

So, how to write the query selector for the name? It’s easy — just test it in chrome browser:

Once, you realize your QuerySelector is valid — just copy-paste it into the node JS code.

As you can see, we should omit the first <tr>, because it doesn’t contain data. The data starts from the second one.

Also, we should not include the last one

We define query selector in constant, and later on, we add an additional method for checking if the folder exists:

So the code to select all the rows:

Let’s find out how we can access breed names and photo URLs. Go back to the page and play with query selectors.

As we can see, inside the tr tag we should go to the inner <td> tag and then get <a>’s innerText.

For the photo we can access by that class:

But for the photo, it is not as easy as it could be. We can see that the class name for the photo element is ‘.image-lazy-loaded’. I suspect that picture is not included in the response by the time we download the HTML, but is lazy-loaded when the user browses the page.
How to check it?

We can download actual HTML by cURL or by Postman, and check HTML.

And for sure there is no element with ‘.image-lazy-loaded’ class. To do a workaround, we can get element ‘.image’ then get <noscript> with its first child and parse the src attribute from it.

Add to the main:

As I said for breedNameUa we get the first td > a element’s inner text.

We create a path for the photo and check whether we already have that photo or not(we can run this application multiple times, and there is no need to download if we already have the photo because the photo names that are locally saved are unique).

Then we check whether we have our photoColumn

Because there are rows without the photos.

If a row does have a photo then we get <imgNode> from <noScript> and get the src attribute from here. Later on, we create an URL with that parsed src and download the photo by that URL.

Now let’s create the download function with the Wikipedia Agent.

We use Wikipedia User-Agent because after some tries Wikipedia starts sending an error. I googled that error and found out that it requires that agent.

The last thing to do is to write our breed to the specific file and do the translation to English.

Along with function of newline (because it may be platform-specific):

Let’s run our solution with

node main.js command

The whole code of the solution:

And the results:

For sure, translation is not as accurate as it could be, but it is not a problem of parsing, it is a problem of data processing.

Please contact me on any of my social media if you would like to get the codebase.

Thank you for the attention, please follow me on my social media:

Add a comment

Related posts:

Sete meses de ti

Este foi talvez o período que mais nos custou aos 3. Passamos algum tempo separados e sentimos a ausência uns dos outros. Juro filha que, a menos que seja imprescindível, não mais nos afastaremos…

How About IOST x NASA?

NASA is seeking Blockchain-based solutions to manage their internal systems and space mission management: https://www.sbir.gov/sbirsearch/detail/1547975 A significant challenge in MBSE is knowing…

BigBlueButton Deployment as Fast as Possible

COVID-19 is slowly taking over the world, and one of the major mean of stopping the virus becomes closing down academic institutions. Actually, most schools remain open, while switching to fully…