python |
Pythonタグが付けられた新着投稿 - Qiita |
そのオープンソースプロジェクト、公開前に大事な情報コミットしてない? |
https://qiita.com/TakutoYoshikai/items/aeb5f956b8f9c313a909
|
APIトークンや秘密鍵などの文字列を検索するツールこのesightを使用して、gitのコミット履歴を展開した後、正規表現でランダムな文字列を検索するツールshibaを開発した。 |
2022-01-10 15:12:53 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
ECサイトランキング |
https://teratail.com/questions/377453?rss=all
|
ECサイトランキング前提・実現したいことrailsで練習としてECサイトを作っている初心者です。 |
2022-01-10 15:55:14 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
python 動画を一定時間ごとに保存 |
https://teratail.com/questions/377452?rss=all
|
dtnownbspdatetimedatetimenowで取得している時間に一定時間を足した値以上に現在時刻がなった場合・現在の動画ファイルをclose・次のファイルをオープンがしたいです。 |
2022-01-10 15:50:12 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
vagrantfileのprovisionで呼出すshellから別のShellを呼び出せない |
https://teratail.com/questions/377451?rss=all
|
vagrantfileのprovisionで呼出すshellから別のShellを呼び出せないよろしくお願いいたします。 |
2022-01-10 15:44:39 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
【PHP】本番環境で、header("Location: ");で指定したリンクに画面遷移しない。 |
https://teratail.com/questions/377450?rss=all
|
ローカル環境では、headerquotLocationnbspquotを使用してログイン後の画面遷移が出来たのですが、本番環境では、画面遷移しないので解決方法のご教授をお願いします。 |
2022-01-10 15:42:58 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
【C言語】Make10「8115」が解けない(整数型) |
https://teratail.com/questions/377449?rss=all
|
【C言語】Make「」が解けない整数型C言語で入力した数字でが作れるかどうかを確認するMake作りました。 |
2022-01-10 15:33:58 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
strオブジェクトのメソッドの使い方について strip()がうまく動かない。 |
https://teratail.com/questions/377448?rss=all
|
何がだめなのでしょうかstripが見つからないと言っていると思うのですが、理解ができません。 |
2022-01-10 15:32:18 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
randomな4ケタの整数、+1の整数を、複数セットでprintする |
https://teratail.com/questions/377447?rss=all
|
randomなケタの整数、の整数を、複数セットでprintするランダムに選ばれたケタの整数、及び、それぞれにを加えたケタの整数をセットprintさせたいと思っています。 |
2022-01-10 15:25:00 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
【GAS】複数条件がある場合のステータス集計を自動化しようとしており、2021/12/12の投稿時から、変更したい内容があるのですが、どのように変更すれば良いか分からないためどなたか教えてください。 |
https://teratail.com/questions/377446?rss=all
|
|
2022-01-10 15:20:56 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
Python if文の動作がワンテンポ遅れる |
https://teratail.com/questions/377445?rss=all
|
Pythonif文の動作がワンテンポ遅れるpythonについてです。 |
2022-01-10 15:16:48 |
Program |
[全てのタグ]の新着質問一覧|teratail(テラテイル) |
WordPressで自作テーマを導入しようとしているが、画像が読み込まれない。 |
https://teratail.com/questions/377444?rss=all
|
WordPressで自作テーマを導入しようとしているが、画像が読み込まれない。 |
2022-01-10 15:02:46 |
Ruby |
Rubyタグが付けられた新着投稿 - Qiita |
Rails memberとcollectionの違い |
https://qiita.com/ktsgmr/items/36d7e1ec41463134fcbf
|
Railsmemberとcollectionの違い結論memberを用いることでidを用いて特定のリソースに対してアクションを実行することができ、collectionを用いることで全てのリソースに対してアクションを実行することができる。 |
2022-01-10 15:45:47 |
Ruby |
Rubyタグが付けられた新着投稿 - Qiita |
[Ruby]Stringクラスのsliceメソッド |
https://qiita.com/Shi-raCanth/items/44c8426128d8e652e879
|
|
2022-01-10 15:44:05 |
Ruby |
Rubyタグが付けられた新着投稿 - Qiita |
【Rails】ajax化とは |
https://qiita.com/mmaumtjgj/items/ef494de60a36f2e769b1
|
remotetrueを指定することによって、AjaxでHTTPリクエストを送信するように設定される。 |
2022-01-10 15:10:26 |
AWS |
AWSタグが付けられた新着投稿 - Qiita |
IT業界に転職した私のマインド、学習結果、勉強方法【AWSエンジニア未経験~1年目】 |
https://qiita.com/onipoppo_eng/items/c683f16a3261f16876e1
|
例えば、プログラミングで関数やクラスが分かるのに、実際に開発しようとしない事は新しい知識が入ってこないですし、いきなりMWの設定ファイルの中身を勉強しようとし、意味が分からないまま時間が経過してしまうのは基本的なIT用語や知識が不足しているので、勉強内容のハードルを下げた方が良いと思います。 |
2022-01-10 15:22:26 |
GCP |
gcpタグが付けられた新着投稿 - Qiita |
Google Cloud アップデート (1/6-1/12/2022) |
https://qiita.com/kenzkenz/items/912b8743dad6c297dc1e
|
リリース内容できるだけ拾うようにしているのですが、抜けていそうなリリースノートは追加するのでぜひ教えて下さい一覧ここで更新してます。 |
2022-01-10 15:11:37 |
GCP |
gcpタグが付けられた新着投稿 - Qiita |
Google Cloud アップデート (12/30/2021-1/5/2022) |
https://qiita.com/kenzkenz/items/4d0f606df91392811ec5
|
GoogleCloudアップデートSecurityCommandCenterDecSecurityHealthAnalyticsでDataprocのイメージバージョンがLogjの脆弱性影響を受けているか確認してくれるディテクタをGAで追加したよVertexAIJanPyTorchでトレーニングするのが事前ビルドされたコンテナでできるようになったよSAPJanHANAStandbyNodes用のStorageManagervが李衣rすされたよフェイルオーバ時のディスクのハンドリングが改善された。 |
2022-01-10 15:11:34 |
Git |
Gitタグが付けられた新着投稿 - Qiita |
そのオープンソースプロジェクト、公開前に大事な情報コミットしてない? |
https://qiita.com/TakutoYoshikai/items/aeb5f956b8f9c313a909
|
APIトークンや秘密鍵などの文字列を検索するツールこのesightを使用して、gitのコミット履歴を展開した後、正規表現でランダムな文字列を検索するツールshibaを開発した。 |
2022-01-10 15:12:53 |
Ruby |
Railsタグが付けられた新着投稿 - Qiita |
Rails memberとcollectionの違い |
https://qiita.com/ktsgmr/items/36d7e1ec41463134fcbf
|
Railsmemberとcollectionの違い結論memberを用いることでidを用いて特定のリソースに対してアクションを実行することができ、collectionを用いることで全てのリソースに対してアクションを実行することができる。 |
2022-01-10 15:45:47 |
Ruby |
Railsタグが付けられた新着投稿 - Qiita |
【Rails】ajax化とは |
https://qiita.com/mmaumtjgj/items/ef494de60a36f2e769b1
|
remotetrueを指定することによって、AjaxでHTTPリクエストを送信するように設定される。 |
2022-01-10 15:10:26 |
海外TECH |
DEV Community |
Web Scraping with NodeJS: a comprehensive guide [part-3] |
https://dev.to/aviyel/web-scraping-with-nodejs-a-comprehensive-guide-part-3-1b6n
|
Web Scraping with NodeJS a comprehensive guide part Shorthand technique to extract the selectors Let s look at a simpler and faster way to gather selectors to scrape data from a website This method is useful when you need to get something done quickly without having to worry too much about it The only disadvantage is that it can be more vulnerable to errors So let s get this party started Previously when we wanted to scrape something from the IMDB scraper we went to the website used the inspector to look at the HTML structure and then built a specific selector for that type of data Let s look at an example If we wanted to extract the title we used a section with a class of ipc page section and then div with a class of TitleBlock followed by other tags elements and then we needed to get the text but this is a very long method section ipc page section gt div gt div gt hSo let s see what we can scrape for this example let s say we wanted to get the movie plot so what we can do is right click inspect as we did previously and what we can see right here is that the element that we want to scrape is a span containing some text After that what we want to do is scrape everything and this is the simplest way to do it without overcomplicating things so right click on the element that we want to scrape click on copy and then copy selector by right clicking on the selector and copying it This implies that the browser will construct the selector for you and will apply its best reasoning to provide you with the results you desire Let s try it out in our code editor as well so go to the code and create a new variable called moviePlot and then do it the same way we did before copying everything and pasting it right here We want to get the text and also trim it so now that we have the variable let s set a breakpoint right at it or simply console log it to see what the moviePlot variable is spitting out let moviePlot next gt main gt div ipc page content container ipc page content container full BaseLayout NextPageContentContainer sc qjf fWxmdE gt section ipc page background ipc page background base TitlePage StyledPageBackground wzlr dDUGgO gt section gt div nth child gt section gt section gt div Hero MediaContentContainer Video kvkd kmTkgc gt div Hero ContentContainer kvkd eaUohq gt div Hero MetaContainer Video kvkd kNqsIK gt div GenresAndPlot ContentParent cump bFvaWW Hero GenresAndPlotContainer kvkd twqaW gt p gt span GenresAndPlot TextContainerBreakpointL cump gwuUFD text trim console log moviePlot So as you can see all we had to do was copy and paste the selector generated by the Chrome browser Of course this is a lot easier than writing the selectors ourselves and before we go any further let s talk about the disadvantages of using this method versus the others we ve discussed previously There is no actual problem with using this method no one is stopping you but there is one thing to consider As you can see this selector is much larger than the others we constructed by hand This is because the browser works its magic and tries to offer you the selector for the element it feels is the finest However there are far more selectors than are required The disadvantage of having a lot of elements in your query is that the chances of it failing in the future are much higher As you can see from what we wrote before we used about one two or even three or four elements to scrape something but in this case we are using five six and seven elements to get that exact text If one of those elements from the website changes the query stops working So depending on the website and the circumstance it s preferable to have two or three elements in our query selector than seven or even more Always remember the simpler the better So here s the question you might want to ask yourself when would I choose one more than the other So if you need to get something done quickly for any reason or you know that a particular website doesn t change very often you may use this copy and paste selection method The beautiful thing about this method is that you don t even need to know how to write CSS selectors to build a scraper and this method may seem way easier and faster Scraping and extracting the images locallyIn this segment we ll learn how to more precisely download anything from a website and save it as a file We ll go over how to get the image posters for all the movies that you wish to scrape First a brief reminder that we ll be using the request library rather than the request promise library we ve been using up to this point So before we go any further let s implement it right at the top in the import area of the code and just import the library index jsconst requestPromise require request promise const cheerio require cheerio const fs require fs const Parser require jsoncsv const request require request Also don t forget to update the variable name from request to requestPromise The reason we re using the request library rather than the request promise library is that the request library is the original library dedicated to straightforward processing of the types of requests we make in Node The request promise library that we previously used is just a wrapper around the original request library and it allows you to use the ES syntax with async await instead of going into callback hell with the original library This is the main reason we used request promise rather than the request library itself for our scraper Now before we get started on actually developing the scrapper for obtaining and storing images posters let s perform some fast preparation So instead of an array of just strings transform it into an array of objects So instead of just a URL as a string make it an object and pass it URL prop and then close it and then URL prop again for the next one and close it const URLS url url Okay to make this work with our current code we ll simply alter the url because instead of passing in a string we ll just pass in the object and access the url of this object which will be the url that we defined Also the last section must be commented out because it is unnecessary and not required as of now Let s start actually writing some code so let s say you want to download something from a website say an image a word document music or a pdf you will basically need to create a file stream within node js so that it can put data into it and build the actual file so let s start and create a variable let s say it imageExtractFile which equals to file system and attach createWriteStream function and we only need to pass in a parameter related to the actual path of the file that we want to create so keep in mind that we are in a for loop that iterates over each of the URLs that we are using so we must use a dynamic file name we can t just use some name jpg because it will be overwritten in the second call and we only have one file let imageExtractFile fs createWriteStream some name jpg Before we go any further we need to send in an id or anything for the actual file that we re going to utilize so let s go to the top and add id and simply pass in the movie name of this movie or something unique name like the godfather and the godfather Now we can use these ids to create the actual file with these names so let s go back and do that right here We want to create a file with the name of the movie dot id and then we re going to put it as a jpeg We need to specify it because we don t know what type of file it is when we re actually downloading it and this will make it easier because we can find out what type of file it is from either the URL or from the request response but for now we ll presume it ll be a jpeg because we already know movie posters are jpeg and we ll just pass in dot jpg Now that we have this we can test it out However before you do that your code should look like this let imageExtractFile fs createWriteStream movieUrl id jpg and we should expect to see two files created on the system with the ids of the “the godfather jpg and “the godfather jpg so let s do this quickly to make sure everything works and then let s return to the files where we find the godfather jpg and the godfather jpeg which is an empty file because we didn t pass any data into it Now we need to take care of the request that goes to the server for the actual movie poster and then stream the data into our newly created files so let s look at a quick example of how to stream data into a file You need to make the same request as before pass in the URL and then use the pipe function to pass in the data to the newly created file that we previously did also with “createWriteStream This is very simple so let s get started First let s create a new variable called streamImage and then we ll use the request library to do the same thing we did previously in the URI passing in the movieUrl imagePoster which holds the image s actual link and then for the headers just copy paste what we did above copy all of the Copy all of the headers and paste them below We only need to remove the HOST part of the code because the host isn t the IMDB main site URL instead the posters are using a S bucket from Amazon which will cause some issues so we ll just remove it Also don t forget to add the gzip All right right now we just want to pipe it out so pipe it and then specify the actual imageExtractFile Finally here s what your code should look like now index jsconst requestPromise require request promise const cheerio require cheerio const fs require fs const Parser require jsoncsv const request require request const URLS id the godfather url id the godfather url async gt let moviesParseData for let movieUrl of URLS const response await requestPromise uri movieUrl url headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response console log response let title section ipc page section gt div gt div gt h text trim let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice let releaseYear li ipc inline list item gt a text slice let moviesGenres let movieGenresData li data testid storyline genres a href search title genres let moviePlot next gt main gt div ipc page content container ipc page content container full BaseLayout NextPageContentContainer sc qjf fWxmdE gt section ipc page background ipc page background base TitlePage StyledPageBackground wzlr dDUGgO gt section gt div nth child gt section gt section gt div Hero MediaContentContainer Video kvkd kmTkgc gt div Hero ContentContainer kvkd eaUohq gt div Hero MetaContainer Video kvkd kNqsIK gt div GenresAndPlot ContentParent cump bFvaWW Hero GenresAndPlotContainer kvkd twqaW gt p gt span GenresAndPlot TextContainerBreakpointL cump gwuUFD text trim movieGenresData each i elm gt let genre elm text moviesGenres push genre moviesParseData push title rating imagePoster ratingAmount releaseYear moviesGenres console log Title title Movie rating rating Movie genres moviesGenres Total rating amount ratingAmount Release year releaseYear Image source imagePoster console log moviePlot let imageExtractFile fs createWriteStream movieUrl id jpg let streamImage request url movieUrl imagePoster headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true pipe imageExtractFile const csvDataKeys title rating imagePoster ratingAmount const jsoncsvParser new Parser fields csvDataKeys const csv jsoncsvParser parse moviesParseData fs writeFileSync movies csv csv utf console log csv We actually have the code written and ready to go but first comment out the second movie because we only want to run it for the first movie The reason for this is because we re in a for loop and what s happening is that it s making the first request for the first movie waiting for it to finish because we use the await syntax and then it s going to make a request for the imagePoster is going to get saved to the file but it will not wait for the entire process to complete before continuing instead it will immediately return to the top and make the second request after which it will return to the bottom and which will collide Hence run it for a single movie and then deal with the issue later Let s put it to the test and see how it goes so fire up the debugger Now that we have the actual image downloaded it appears that the godfather jpg is the actual image poster that we intended to extract It completes the task as planned Promisify callback functions Let s keep going with the image downloading part that we left off and see if we can address the problem we re having so the main problem was that the downloading process starts but it doesn t wait for it to finish before continuing So before we go any further let s deep dive into “what is promisifying so promisifying is the process of converting a non promise function based on callbacks into a promise returning function So let s start fixing this issue by default in node js you can create your promise here is a quick example let promisifyStuff await new Promise resolve reject gt let alien true if alien resolve true else reject false Let s imagine we re waiting for a new promise and this function only has two parameters resolve and reject Let s say we have a variable called alien and we re going to build a quick if statement that says if an alien is true resolve with true and if not reject with false Let s put everything into a variable and place a breakpoint at the end so we can observe what value that variable is spitting out Let s run it quickly and test the value We have true since we checked to see if the alien is true and if it is we resolve the promise using the function true passing a string inside resolve and then running it again this promisifyStuff variable will be the exact string Let s see if we can quickly test it for the false as well and see what we get when it rejects Let s add a random string message in there and run it again and we now receive an unhandled promise rejection error That s because we didn t catch the actual error The reject throws an error which we need to capture by wrapping it in a try catch then catching the error and console log the error and if we test again the false message should be console logged try let promisifyStuff await new Promise resolve reject gt let alien false if alien resolve true else reject false catch error console log error This was only a quick introduction to javascript promises and it is highly advised that you investigate them further Let s return to our objective and start implementing this into what we need to accomplish Basically we need to wrap around this new promise thing into our stream request and let s get started right away We ll just await a new Promise with resolve reject param and put it at the top of the streamImage and then we ll end wrap the promise Now we need to figure out when the actual stream is completed We can figure this out by adding an on listener to an event The event that we need to listen to is finish since after the request library is finished with the actual request it will throw a finished event which we need to grab onto Now we ll just open up a function and declare a console We ll log some message or whatever custom message you want and then we ll specify the resolve which we ll say is true or you can leave them empty because we don t have to catch the message and we don t have to utilize it so leaving it empty is fine This is what your code should look like index jsconst requestPromise require request promise const cheerio require cheerio const fs require fs const Parser require jsoncsv const request require request const URLS id the godfather url id the godfather url async gt let moviesParseData for let movieUrl of URLS const response await requestPromise uri movieUrl url headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response console log response let title section ipc page section gt div gt div gt h text trim let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice let releaseYear li ipc inline list item gt a text slice let moviesGenres let movieGenresData li data testid storyline genres a href search title genres let moviePlot next gt main gt div ipc page content container ipc page content container full BaseLayout NextPageContentContainer sc qjf fWxmdE gt section ipc page background ipc page background base TitlePage StyledPageBackground wzlr dDUGgO gt section gt div nth child gt section gt section gt div Hero MediaContentContainer Video kvkd kmTkgc gt div Hero ContentContainer kvkd eaUohq gt div Hero MetaContainer Video kvkd kNqsIK gt div GenresAndPlot ContentParent cump bFvaWW Hero GenresAndPlotContainer kvkd twqaW gt p gt span GenresAndPlot TextContainerBreakpointL cump gwuUFD text trim movieGenresData each i elm gt let genre elm text moviesGenres push genre moviesParseData push title rating imagePoster ratingAmount releaseYear moviesGenres console log Title title Movie rating rating Movie genres moviesGenres Total rating amount ratingAmount Release year releaseYear Image source imagePoster console log moviePlot let imageExtractFile fs createWriteStream movieUrl id jpg await new Promise resolve reject gt let streamImage request url imagePoster headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true pipe imageExtractFile on finish gt console log Movie Poster Image downloaded resolve const csvDataKeys title rating imagePoster ratingAmount const jsoncsvParser new Parser fields csvDataKeys const csv jsoncsvParser parse moviesParseData fs writeFileSync movies csv csv utf console log csv If we execute this the scraper will go to the first movie and ask for the details parse them and then go to the streamImage and start downloading and waiting for it to finish before repeating the process for the second movie Let s run through the procedure quickly and see what happens We should see a message that says Movie Poster Image downloaded and then another message that says the same thing Finally we re almost done but we need to handle the fact that it can fail and throw an error so let s create another on listener and just listen for an error and then we ll have a function that takes an error parameter and then we ll reject the promise with the actual error that we get and now that we ve rejected the promise with the actual error that we get we just need to catch it so simply specify the catch method then we ll receive the error and then we ll console log the error to record and display the error message itself await new Promise resolve reject gt let streamImage request url imagePoster headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true pipe imageExtractFile on finish gt console log Movie Poster Image downloaded resolve on error err gt console log err reject error catch err gt console log err Finally we can identify which specific movie an error occurred in by simply replacing the single quotes with backticks allowing us to use the javascript syntax inside it to designate the individual “movie id of the error so that future debugging becomes really very simple So this is how the final code should look like index jsconst requestPromise require request promise const cheerio require cheerio const fs require fs const Parser require jsoncsv const request require request const URLS id the godfather url id the godfather url async gt let moviesParseData for let movieUrl of URLS const response await requestPromise uri movieUrl url headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response console log response let title section ipc page section gt div gt div gt h text trim let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice let releaseYear li ipc inline list item gt a text slice let moviesGenres let movieGenresData li data testid storyline genres a href search title genres let moviePlot next gt main gt div ipc page content container ipc page content container full BaseLayout NextPageContentContainer sc qjf fWxmdE gt section ipc page background ipc page background base TitlePage StyledPageBackground wzlr dDUGgO gt section gt div nth child gt section gt section gt div Hero MediaContentContainer Video kvkd kmTkgc gt div Hero ContentContainer kvkd eaUohq gt div Hero MetaContainer Video kvkd kNqsIK gt div GenresAndPlot ContentParent cump bFvaWW Hero GenresAndPlotContainer kvkd twqaW gt p gt span GenresAndPlot TextContainerBreakpointL cump gwuUFD text trim movieGenresData each i elm gt let genre elm text moviesGenres push genre moviesParseData push title rating imagePoster ratingAmount releaseYear moviesGenres console log Title title Movie rating rating Movie genres moviesGenres Total rating amount ratingAmount Release year releaseYear Image source imagePoster console log moviePlot let imageExtractFile fs createWriteStream movieUrl id jpg await new Promise resolve reject gt let streamImage request url imagePoster headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true pipe imageExtractFile on finish gt console log Movie Poster Image of movieUrl id is downloaded resolve on error err gt console log err reject error catch err gt console log movieUrl id image download error err const csvDataKeys title rating imagePoster ratingAmount const jsoncsvParser new Parser fields csvDataKeys const csv jsoncsvParser parse moviesParseData fs writeFileSync movies csv csv utf console log csv Let s do a final check to see if we have a great and curated console error message So far everything is working fine and we ve learnt a lot and gone to the depths of scraping from the ground up The complete source code is available here Conclusion In this article we learned about scraping from the ground up including Nodejs and the fundamentals of javascript why and when to scrape a website the most common problems with scraping different scraping methods such as to request method and browser automation method and finally how to scrape the data from the IMDB website in extensive detail as well as how to export scraped data into CSV and JSON files If you wish to pursue a career in web scraping this article may be very useful Follow aviyelHQ or sign up on Aviyel for early access if you are a project maintainer contributor or just an Open Source enthusiast Join Aviyel s Discord gt Aviyel s worldTwitter gt |
2022-01-10 06:55:49 |
海外TECH |
DEV Community |
Web Scraping with NodeJS: a comprehensive guide [part-2] |
https://dev.to/aviyel/web-scraping-with-nodejs-a-comprehensive-guide-part-2-43lk
|
Web Scraping with NodeJS a comprehensive guide part Dealing with websites that have compression encoded in them let s take a closer look at what happened and why it didn t work This is a frequent mistake that rookie scrapers make when playing with requests it happens to everyone You may have also noticed the gibberish encoded response this is due to the IMDB s usage of the gzip compression algorithm This is done to reduce the size of the response data so that it may be returned faster and save a lot of bandwidth at the same time Let s get started with this We can address this by adding an extra parameter to the request function called gzip and setting it to true So this will inform the request library to decompress this actual request from IMDB using gzip compression Let s fast test it out and see if it works so run your scrapper again As you can see we already have a readable lt DOCTYPE gt HTML response as well as everything we had before and the title and ratings work perfectly Let s go back to the dev tools which we already have open and search for the content encoding on the response headers from the website which you can see here it states content encoding with the value of gzip indicating that this website indeed uses gzip compression algorithm An in depth look at jQuery selectors and data parsingLet s learn a little more about jquery selectors and where to go for more information about them It s quite normal and acceptable to not remember all of them Jquery selectors This is why you should always have the documentation on hand Let s go to the browser and search for jquery selectors As you can see there are a lot of results Let s go to the w school link It s not the official one but it s a very popular website that has all of the selectors that we need This is what we ll use to search for selectors that we need to scrape some parts of website Now let s get started scraping Go to IMDB movie page and open the dev tools Now that we have the movie open our editor running and our dev tools open we can start scraping There may be some errors but we ll fix them as we go By using this method you ll be able to see the entire thought process behind it so let s get started First we have the title and the movie rating Let s assume we want to retrieve the movie poster To do so go to the movie poster and check elements looking for anything unique about it such as an id or a class name that isn t repeated or just discover the selector So the first thing that pops up is the div with the class name of ipc poster which we can easily get and then access the element and then the image within it First let s clear everything in the console and because this website already has jquery implemented we can check the jquery code that we re going to write directly in the console If the website doesn t have jquery implemented you can inject it and also check it in the console In the later section of this article there is a demo showing how to inject the jquery library inside the browser itself but just so you know if you don t see the console tab with the console you can easily click on escape to toggle between them on and off Let s write the code for this image selector First we ll select the div with the class name ipc poster or ipc media and then we ll select the child within it which is an IMG tag which is the image We can now expand this element and see that we have an image so as you can see on the image below it appears that this selector is working perfectly fine Now that we have a proper and working selector we need to choose the source of the image to get the actual link of the image so for that we can use the attribute function and pass into the parenthesis the exact attribute name that we want to select in our case it will be source “src let s try it out so type attribute and then src enter and it should work Finally let s put this code to the test in our editor by copying the code from the browser console and pasting it into a new variable called imagePoster inside our code editor This should work the same so let s console log the imagePoster and remove the breakpoint because we don t need it any longer you can keep it if you want to check it So the code in index js should look like this index jsconst request require request promise const cheerio require cheerio const URL async gt const response await request uri URL headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response console log response let title section ipc page section gt div gt div gt h text let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src console log title movie has a IMDB rating of rating and the image is imagePoster Now go to the debug and start the debuggerAs you can see the title The Godfather has a rating and we also have a link to the image source Before we go any further if you have a white space issue in the movie title you can easily remove it by using the trim after the text method which will also eliminate the white space between the text let title section ipc page section gt div gt div gt h text trim Let s keep going and proceed to the next step Okay so now that we have the title the poster and the rating let s see how much amounts of rating this movie has Let s right click on the rating amount and inspect the element first What we can see is that there is a div with another tag of div with the class AggregateRatingButton TotalRatingAmount but nothing useful We need to go a little higher because we only have the ipc button text class at the moment So we can finally check this and grab onto it acquire the link and then obtain the text within it As a result let s clear the console and start writing code for it So first get the div with the class ipc button text then get the last child which is another div tag and finally the last div child so we can finally obtain the text and click enter and check if it works and as you can see from the image above it works as you can see it displays the precise rating number Now let s quickly put this into action so open the editor and create a new variable called ratingAmount then paste the following code into it let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice We can immediately test it by logging the ratingAmount in the console and hit play to the debugger As you can see it s functioning perfectly so let s continue and try to get the movie s release year as well To do so repeat the previous method and inspect the element once more to see what we have So there s a list tag with the class ipc inline list item and inside it we ll select a tag and quickly test it in our console to see whether it s functioning Finally it works and we can easily add it to our code editor and use it in our code Let s copy this and check that it s the text that we want but in your case you might see some white space that can be easily removed using the trim function and we ll also use slice to chuck the exact year only that we need So make a variable called releaseYear paste the code in it and slice it up let releaseYear li ipc inline list item gt a text slice Let s put it to the test right away It works perfectly and we have the title the rating the poster the number of ratings and the year of release Now that this is starting to look pretty good let s attempt something a little more difficult let s try to get all of the movie s genres in an array rather than as text we want to have an array with crime and drama Again inspect the element and check it out as you can see in the image below As you can see all of the links to certain genres such as crime have a genre link in common so this prefix search title genres is present right here in the drama and we can latch on to it and choose all of the links that have the prefix of the href tag search title genres Let s try it out and see if it works if it does that s great if it doesn t and we re receiving more links than we need then it is because the IMDB page can have numerous links with these types of prefixes So first let s empty off the terminal tab and start writing the code We want to get all the li tags with data testid storyline genres that have a tag with href equal to search title genres but we don t want them to start with the prefix genres only but with search title genres So let s head over to the jquery selectors documentation and look for the starting keyword As you can see we have something right here all elements with the title attribute value beginning with tom All we have to do now is add a sign in front of our equal sign and it will get all the elements that start with the search title genres prefix Let s add it and then close it and get all the text just for testing purposes As you can see it s not displaying the results properly To fix this we need to narrow down the search and try to select only the links within a specific parameter more specifically within these li links We can easily add this selector so that we can refine our search to get only the links within this parameter You might be wondering how we do this First let s clear the console Now right in front of our selector we need to add the li selector for the data testid of storyline genres and this says that we want to select all the href elements that start with search title genres that are contained inside this list and let s test it out Copy and paste the following code into the console li data testid storyline genres a href search title genres textAs you can see the selector works perfectly and crime drama is perfect for the time being But how do we put this into an array instead of using the text function We ll use each function which will iterate through all of the selectors that it finds and by the way you can check the documentation for cheerio on GitHub Let s try it out but this time instead of text we ll use each function which has two parameters one for the index and the other for the actual element we ll utilize close it and now that we ve done that we ll make an array called movieGenres and start it as an empty array and then we ll access the specific element and get their text and insert it into the movieGenres array We can do this easily by making it like genre equal access the element and then the text inside it and then we ll just push it to the array movieGenres and then let s console log movieGenres and yeah it works perfectly we have an array of crime and drama This is what your code should look like let moviesGenres li data testid storyline genres a href search title genres each function i elm let genre elm text moviesGenres push genre console log moviesGenres If your code throws an error or doesn t work you ll need to load jquery into the browser To do so open the console and paste the following code var script document createElement script script src document getElementsByTagName head appendChild script Again test the previous code in the console you should see something similar to this Finally let s add the code to our scraper and see if it works Also copy this code and paste it into the code editor then console log movieGenres and run it again As you can see it works well and we have an array of crime drama parsed precisely like in our chrome console This is what your index js source code should look like index jsconst request require request promise const cheerio require cheerio const URL async gt const response await request uri URL headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response let title section ipc page section gt div gt div gt h text trim let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice let releaseYear li ipc inline list item gt a text slice let moviesGenres let movieGenresData li data testid storyline genres a href search title genres movieGenresData each i elm gt let genre elm text moviesGenres push genre console log Title title Movie rating rating Movie genres moviesGenres Total rating amount ratingAmount Release year releaseYear Image source imagePoster Scraping multiple movies and saving JSON to a fileIn this section we ll rapidly learn a new way for scraping several movies from an array of given URLs and saving the data we collect as JSON in a file so let s get started The first thing you ll want to do is turn the constant into an array So now we have multiple URLs let s open another movie link on the IMDB website copy the URL convert it to an array and input another movie and that s all there is to it const URLS So right now we have two movies that we want to scrape for information To do so simply go inside the defined asynchronous function and follow paste the code as shown below index jsconst request require request promise const cheerio require cheerio const URLS async gt for let url of URLS const response await request uri url headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response let title section ipc page section gt div gt div gt h text trim let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice let releaseYear li ipc inline list item gt a text slice let moviesGenres let movieGenresData li data testid storyline genres a href search title genres movieGenresData each i elm gt let genre elm text moviesGenres push genre console log Title title Movie rating rating Movie genres moviesGenres Total rating amount ratingAmount Release year releaseYear Image source imagePoster This loop will iterate over each of the URLs that we created previously Now that the movie s URL is going to be URLs instead of URL we need to alter the URI and assign URLs and of course this should work We should give it a try right now and see what we have As you can see the first request is made and it outputs The Godfather the rating and everything else we ve got and then it waits for the rest of the request to finish and then it makes the second request to The Godfather Part movie and it outputs it to the console so everything is working as it should Let s prepare the data and save it as JSON into another variable Let s start by defining moviesParseData as an empty array then push in that array exactly before the console log so that we have an array with all of the scraped movie results We re using moviesParseData to add another entry to this array and we want it to be an object So let s push the title the rating the poster the number of ratings the year of release and of course the genres As a side note this is a shorthand strategy but for the time being let s keep it simple because it looks cleaner let moviesParseData moviesParseData push title rating imagePoster ratingAmount releaseYear moviesGenres Okay now let s test what we did We should anticipate this array to be populated with all of the details of both movies Let s comment out all of the console log and then console log moviesParseData to access all of the movies We need to put a breakpoint in that console log so that the debugger stops right at that console log so now let s test it out and it does indeed stops right at the breakpoint line all right then let s take a look at the console so we have an array hence expand the array and then we have the length of two objects exactly what we expected as you can see the title The Godfather and The Godfather part and its rating poster and everything else Right now we have two movies that we scraped and we ll continue by saving this data to a file First let s close the debugger and to save files to the system we ll use a native library from node js called fs file system Currently what we want to do is import the library which we can easily do without having to install anything because it s already implemented inside the node js itself const fs require fs Now that we ve got the library and the movie s data in an array we can create the method to save it to a file with ease We re going to use the writeFileSync fs method Three parameters are required by this function The first is the actual path to the file that you want to write the second is the content and the third is the type of information that you want to save the actual encoding Let s save it as movies json and the content inside it to be the moviesParseData in actual string form so we ll use JSON stringify and pass in the moviesParseData array This will transform the array into an actual JSON string and finally we are going to save it in utf encoding format fs writeFileSync movies json JSON stringify moviesParseData utf Before we start testing this let s look at why we used the writeFileSync function We re using this function with the sync after it because it tells node js that we want to wait for this line to finish before moving on to the next one which is similar to writing await in front of it but it doesn t work like this because the fs writeFile function doesn t return a promise so await and async only work with promises The next step is to start run the debugger and we should expect it to produce a movies json file in our directory So check it out and everything is just as we expected it to be We have the movies json file and all of the JSON material is placed in this file This is what your final code should look like index jsconst request require request promise const cheerio require cheerio const fs require fs const URLS async gt let moviesParseData for let url of URLS const response await request uri url headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response console log response let title section ipc page section gt div gt div gt h text trim let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice let releaseYear li ipc inline list item gt a text slice let moviesGenres let movieGenresData li data testid storyline genres a href search title genres movieGenresData each i elm gt let genre elm text moviesGenres push genre moviesParseData push title rating imagePoster ratingAmount releaseYear moviesGenres console log Title title Movie rating rating Movie genres moviesGenres Total rating amount ratingAmount Release year releaseYear Image source imagePoster fs writeFileSync movies json JSON stringify moviesParseData utf console log moviesParseData Exporting scraped data to CSV file Let s first learn how to export the scraped data to a CSV file if you re working with them and need to use them in any way A CSV file is a comma separated value file in which the first line represents the keys and the subsequent lines represent the rows with the values for those specific keys with each value separated by a comma While creating a CSV file is not a difficult task it is preferable to use a well maintained library that does the job with two or three lines of code rather than writing and maintaining your own So you can easily develop a CSV generator but if you want to get things done quickly and easily which is what most people want you should use something like jsoncsv which is a npm package that you can use straight with node So go to jsoncsv and then to its npm and github repository As you can see we can install it by simply typing the following command into our code editor s terminal npm i jsoncsvNow that we have this library we can finally utilize it in our code so go to the index js file Close the terminal and instead of saving the data as JSON like we did before we want to convert it into CSV data and then save it to the file so for now comment out the console log and return to the documentation to see how to import use this library as shown in the image below Copy the import command and paste it at the top of the code inside your code editor Now let s return to the documentation and look at an example We have a great set of examples that could work for our situation So we have an array of objects and we want to convert them to CSV data In our case we have two movie lists with multiple data in each of them and we re going to convert those to CSV data By looking at this example you d need to specify the fields of the JSON object that you want to have converted and then basically input your data into it So let s go ahead and test this out ourselves So first right before the console log define a constant variable named csvDataKeys and let s say we only want the title and the rating imagePoster and rating amount only from the data we have hence copy and paste the exact following code into your index js file right at the bottom before console log const csvDataKeys title rating imagePoster ratingAmount const jsoncsvParser new Parser fields csvDataKeys const csv jsoncsvParser parse moviesParseData We ll use the moviesParseData array to hold all the data and if we did everything correctly we should have a working CSV and instead of logging moviesParseData console log the actual CSV data so let s start and run the debugger and see what we have Of course as you can see it works properly we have the title and the rating imagePoster and ratingAmount as the keys and the actual results of the moviesParseData The only thing left is to save them in a file which we can do easily with the command fsWriteFileSync but instead of writing the data as movies json we ll write it as CSV so uncomment the line and replace it with movies csv and instead of JSON stringify output the CSV and save the file and your final code should look like this index jsconst request require request promise const cheerio require cheerio const fs require fs const Parser require jsoncsv const URLS async gt let moviesParseData for let url of URLS const response await request uri url headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari gzip true let cheerio load response console log response let title section ipc page section gt div gt div gt h text trim let rating div ipc button text gt div gt div nth child gt div gt span text slice let imagePoster div ipc media gt img ipc image attr src let ratingAmount div ipc button text gt div last child gt div last child gt div last child text slice let releaseYear li ipc inline list item gt a text slice let moviesGenres let movieGenresData li data testid storyline genres a href search title genres movieGenresData each i elm gt let genre elm text moviesGenres push genre moviesParseData push title rating imagePoster ratingAmount releaseYear moviesGenres console log Title title Movie rating rating Movie genres moviesGenres Total rating amount ratingAmount Release year releaseYear Image source imagePoster const csvDataKeys title rating imagePoster ratingAmount const jsoncsvParser new Parser fields csvDataKeys const csv jsoncsvParser parse moviesParseData fs writeFileSync movies csv csv utf console log csv After that we should have a file generated so let s run the debugger and sure enough we have a movies csv file with all of the data we require One small tip if you don t want to provide any fields or if you don t want to specify all of the fields and you want to obtain everything from the array you can simply remove everything from the array and the code should look like this const jsoncsvParser new Parser const csv jsoncsvParser parse moviesParseData fs writeFileSync movies csv csv utf console log csv Follow aviyelHQ or sign up on Aviyel for early access if you are a project maintainer contributor or just an Open Source enthusiast Join Aviyel s Discord gt Aviyel s worldTwitter gt |
2022-01-10 06:55:20 |
海外TECH |
DEV Community |
Web Scraping with NodeJS: a comprehensive guide [part-1] |
https://dev.to/aviyel/web-scraping-with-nodejs-a-comprehensive-guide-part-1-16g8
|
Web Scraping with NodeJS a comprehensive guide part This blog is about the Node js course on web scraping it is divided into three part series where you will learn how to scrape any type of website and by using a real world example This blog will teach you strategies and practices that you won t find anywhere else You ll be able to quickly grasp all of the ideas and move from the examples and also you will be able to make your scraper by the end This blog post is aimed to assist anyone interested in learning web scraping with NodeJS because the issue with any other blog tutorials is a little challenging and you don t always find everything you need in one place RequirementsThe sole and only requirement you need is that you have to have a basic understanding of Javascript or are at least familiar with it as this will be the only programming language we are going to utilize I would also recommend reviewing the async and await syntax in es and higher as we ll be using it a lot ToolsAlso if you re wondering what sorts of tools we will be using then we will be utilizing vscode which is a free code editor that also supports NodeJS debugging without the need to install extra complicated plugins We ll also utilize a variety of libraries as well as different libraries but we ll mostly use puppeteer which was built and is maintained by the Google team and also we will be using NighmareJS as well Tools and Project Setup Before we begin creating our scraper program we must first set up our environments by downloading our editors and NodeJS as well as complete some basic project setups before we can begin writing code So first go to the official VS code website code visualstudio com and download and install the VS code editor which is available free of charge The key reason to use Visual Studio Code is that it is free and comes with a built in debugger for Nodejs making debugging a hundred or even thousand times easier After installing our VS code editor we must ensure that we have Nodejs installed to run our Nodejs application on our machine Nodejs can be downloaded from nodejs org and is compatible with both macOS and Windows with a simple installation process All you need to do now is download the package and follow the simple installation instructions Also if you want to see what version of Nodejs you have installed go to the terminal command line and type node v which will tell you what version of Nodejs is currently installed on your computer If this command doesn t work make sure you restart your computer after installing Nodejs Now that you ve completed this go to our VScode and open an empty folder in it then open the terminal inside it and run the npm init y command to initialize our project So now you can see that a new package json file has been produced within that folder and we can just start installing and using our libraries right away and start making use of them Also inside that folder create a new file called index js which will be our entry file and now we re ready to write some code inside it This is how your file structure should now look Simple IMDB movie scraper We ll be building a simple IMDB scraper that parses the data from the IMDB movie page This is just one of many real world examples we ve included in this particular section of the blog to give you gist and an idea of what can be accomplished in a short amount of time so don t worry about getting too many details in this example because we ll go into more depth in the upcoming examples So we ll use NodeJS to send a direct request to the IMDB website and expect a response that looks exactly like the image below To begin navigate to the IMDB website right click and then select View page source As you can see this is the exact HTML content with Javascript and CSS that we will scrape using our NodeJS scraper Following that we will use Cheerio JS a Nodejs library that can handle HTML content and has a syntax that is nearly identical to jQuery It would be preferable if you were already familiar with the jQuery syntax To begin we must import both the request promise library which is in charge of managing requests to the IMDB website and the Cheerio JS library which will be used to parse the HTML contents Now make sure you re in the index js file that we previously created and begin importing the actual libraries inside it const request require request promise const cheerio require cheerio The next thing we re going to do is go to the IMDB website and copy the URL of whatever movie you want because we re going to send the request to that specific URL so simply create a variable named URL and paste that specific copied URL link inside it const URL Because we cannot write the asynchronous function in the index unless it is declared as async we must now write the simple async function that will be accessed when the NodeJS scrapper is fired Simply create a nameless asynchronous function that will be executed automatically Before we write anything inside that async function we must first install the request response and cheerio libraries To do so go to the terminal and type the following command npm install cheerio request promiseAfter installing the package it should look something similar to this So now that we have everything installed and ready to go we can use the request library To do so create a variable called response and inside it simply wait for the request to be completed while also passing the URL as its parameters To test the response simply console log it we should be able to see the raw response So to run it go to the debugging tab and press the run button As you can see it worked we got what we needed which means the script was successful and we can now begin passing our response to the cheerio library and using it to go through each of the HTML properties and find out exactly what we need First let s get rid of the console log and implement the cheerio library let cheerio load response We simply created a dollar variable responsible for the cheerio with the actual IMDB response Now we can begin writing the scraping for the movie title First go to the movie that you want to scrap So right click on the title and select Inspect Element Then we have the div element and inside it there is an h as well as span element which contains the title of the movie as well as the ratings of the movie We can select the element by using the jQuery syntax like selector as shown in the code below let title section ipc page section gt div gt div gt h text let rating div ipc button text gt div gt div nth child gt div gt span text slice console log title movie has an IMDB rating of rating If you select the debug option again you should see something similar to this So now that you have enough information to get started with web scraping let s delve into much more detail Why and when should you scrape a website So before you begin creating a scrapper for your website you should ask yourself What data do I need from that particular website Do they have an API for that Do they provide all the information that I need Do they have any limitations that will stall your results Do I have to pay to send the request to their server So if you ever find yourself in a situation like this where you believe you will not benefit from an official API due to the reasons stated above or if there is a specific website that does not even have an API you should consider creating a scrapper What we did previously is a perfect example where we wrote a straightforward IMDB scrapper Because IMDB does not have an official API that is accessible to the public we relied on scraping the data Of course the scrapper that we wrote is very basic but it demonstrated the possibility and power of scrapping with Nodejs To give you a hot tip we will explore the IMDB scrapper and write an even more complex scrapper later on Before we begin we must understand when it is appropriate to scrape data from a website Keep in mind that web scraping is not always an ethical solution nor is it always a black hat solution it falls somewhere in the middle That is to say web scraping is not illegal but it can get you in trouble if you violate someone else s website or organizational policies So before you plan to scrape a website you should look at the terms and services of that particular website and see if they have anything related to scraping the data from their website if they do it means they may not want you to do it and if they don t it means they don t care if you scrape their website or not Also before you scrape you should ask for permission to scrape it Also before you start scraping other people s or companies websites you should respect their data by using official APIs whenever possible not spamming their website with an excessive number of requests and finally if you want to monetize the scrapped data always seek legal advice and make sure what you re doing with it is completely legal The most significant issue with scrapingThe most difficult and inconvenient aspect of web scraping is the maintenance and stability of the scrapper These are the issues that you may have to deal with when building a scrapper Scrapers can be useful for a variety of things such as extracting data parsing and so on Let s say you wrote a scrapper and it works fine until it doesn t and you encounter some random error which is exactly the problem so it can work for one day one month or even one year before failing The main issue with this is that the website that you are currently scraping can constantly change its structure can change and their system can change and also the URLs of the website As a result you have no control over it and your scrapper may fail at any time due to this issue When writing or coding scrappers the logic and workflow are based on the current website you are attempting to scrape and its structure so if the website decides to change its entire structure you may have to change the entire structure as well as the logic of the scrapper as well Of course if you still want to make it work you may be wondering how to solve this type of problem The short answer is that you cannot change this type of problem because you have no control over the website you are attempting to scrape you must simply deal with the problem that arises This is why you must learn how to develop a scrapper quickly and efficiently as well as how to debug and fix problems This type of problem can occur on both a small and large scale so you must be prepared at all times Request Method with the assistance of the Request LibraryIn this section we ll go over the request promise library what you can do with it and when it s best to use it So what exactly are we able to do with the request library We re incorporating this library into our earlier project that we did We use this library because the request library allows us to submit requests to the server in the simplest and fastest way possible Before we begin let s look at some examples So when you visit a website a basic GET request to the server is sent first followed by the initial content the HTML response So with the request library you can do the same thing but instead of using the browser you can write the action in Nodejs and it will do everything for you Let s take another example when you want to login and enter your username and password into a specific website a POST request is sent to the server which sends the details of your entered account to the server for confirmation This can also be done manually in Nodejs by simulating every or any request the browser makes to any website all we have to do is provide the right parameters to it In the case of the IMDB scraper we used it as a GET request to obtain HTML and parse it Benefits and Drawbacks of Request LibrarySince you control every parameter that you send to the server it can be a little overwhelming at times Let s use the previously described login process as an example So as previously described the login process can consist of a single simple POST request to the server with the username and password depending on the website followed by a single response with some cookies or a token in such case the request method is ideal or the login system can consist of multiple requests on a simple login form on some websites can automatically send multiple requests for security reasons or because of how they were originally built on and In that case you do not want to use the request library but of course it is feasible but it is very time consuming and can be extremely frustrating and many things can go wrong such as missing simple parameters in the request headers and the server you are currently attempting to reach refuses to accept it It all depends on the situation but it is strongly discouraged to use this library if you have a large number of requests to send Hence if the website is more complex and automatically sends AJAX requests with different parameters and tokens the best method would be to use the headless browser which we will cover in detail in the following upcoming sections Therefore only in simpler times you should use the request library but when the website has loads of security behind it and is dynamically rendered you should probably use another method or even a headless browser method Scraping with a browser automation approachIn this section we ll deep dive into browser automation and how it might be applied for developing a scraper But first let s define browser automation Browser automation in our case with the help of NodeJs essentially means controlling the browser using code Now that you know that certain browser engines support this you can see that you can t just automate your regular browser instead you ll need a browser that allows you to manage it using code and we ll look at a plethora of examples in the upcoming topics ahead Benefits and drawbacks of employing browser automation Before we get started let s quickly go over the benefits and drawbacks of using Browser automation For starters it s much more beginner friendly and it s very easy to understand the action of the steps that you need to take because they re the same as when you browse the internet all you have to do is write the specific code and scripts that your automated browser will follow In most circumstances implementing the scrapper with the automated browser is much cleaner and you may wind up writing less code than you would with the request approach but this of course depends on the page that needs to be scraped and what you need from it The first disadvantage of this request approach is that you are essentially relying on the API availability of the individual browser you are automating Others browsers have limited capabilities and some aren t very stable and some aren t even updated anymore which is why you should be cautious and thoroughly study the browser before using it in your scrapper So before you decide whatever type of browser you want to automate the browser s documentation will usually give detailed information When is it appropriate to use browser automation for a scraping project To begin you must understand that there is no right or incorrect option Any website may be done using requests and the other way around It all depends on how long it will take how much code you will write and how successful it will be The browser automated scrapper will use more bandwidth and resources to load the page content from the website than the request method because the browser will load every CSS file every javascript file and every image that is on the website whereas the request method will only get the HTML code for the website itself and will not load the external contents like files and libraries So if bandwidth and a few milliseconds of delay aren t important to you browser automation is an excellent and perfect option for you Browser automation makes things a lot easier while also saving you a lot of time Browser automation librariesBefore you begin you must first decide which libraries to use There are two excellent libraries available Puppeteer and NightmareJS There are many more libraries although many of them are closed or abandoned Puppeteer is built on the Chrome browser and is also known as a headless version of Chrome It was created specifically for automation testing and testing chrome extensions among other things but in our case we will be using this library for scraping This library is developed and maintained by the Google Chrome team and is a fully functional and up to date headless browser NightmareJS on the other hand is the electron browser s driver It s a lot of fun to learn and even more fun to use but it s not particularly suitable for complex scrappers When compared to the puppeteer library it has a lot of limitations One of the library s biggest flaws is that it doesn t allow numerous tabs and links to open at once As a result libraries like this may break your scrapper or drive you to make compromises when you need them So before you start scraping let s go over a few things you might need to know When you re running the scrapper and testing it you can turn on the visual browser to see each action as it happens in real time This helps you understand and debug when you have a problem or when you re building a new scrapper A competent headless browser will provide you with practically all APIs allowing you to automate almost everything a user can do but by using the power of only coding and programming IMDB scraper using a request methodIn this segment of the course we ll delve a little deeper into the IMDB scraper that we constructed in the first session We ll make it a little more complex as we go and we ll learn new things along the way With the request method we ll learn how to spoof or fake user headers So the main question is why do we need to spoof them It s because we want it to appear that the scraper is a browser that s making the request Request headers are extra parameters that the browser sends to the server automatically They usually contain cookie information such as whether you re logged in or not and other types of browser information So let s get started with how to check these First open the browser and right click to open the developer tools Right now we need to go to the network tab to see all of the requests that are happening in this tab We may see a number of requests and their types here There s the document as well as images graphics style sheets javascript and a whole lot more Let s take a look at the initial request that s being made as you can see here We can see the general information and the response headers but we need to look at the request headers which are a little farther down Now we need to go to GitHub s request promise website and look at their documentation to see how we can include those in our request Here gt What we need to look for is a way to add those extra parameters throughout the request and if we look closely enough we ll see the header options We ll copy the header and paste it into our VS code editor Right now we only have the URL as a parameter and we need to alter it so it s an actual object so delete the URL and build an object but we ll still need to input the URI URL and then paste in the headers option const request require request promise const cheerio require cheerio const URL async gt const response await request uri URL headers User Agent Request Promise let cheerio load response console log response let title section ipc page section gt div gt div gt h text let rating div ipc button text gt div gt div nth child gt div gt span text slice console log title movie has a IMDB rating of rating Right now we have precisely what we had before with the addition of a user parameter to the request with the value request promise Of course request promise isn t a user agent but it s something we can simply adjust based on the documentation provided Let s go ahead and obtain the request headers now Let s go back to the dev tools and look at the first request making sure it s the one that s responsible for the IMDB page and not an image or a javascript file Then just like before look at the request header and copy everything for now Copy everything and return it to the editor What we have now are the request headers that are sent by the browser when we enter the IMDB page What we need to do now is convert them all to javascript objects and pass them on instead of the previous useragent Let s format them properly and replace them indent them properly Now we have control over all of the requests that are sent by the browser are being sent by us Finally we only need to worry about the cookie because we don t need it in this situation so let s erase it and we re done index jsconst request require request promise const cheerio require cheerio const URL async gt const response await request uri URL headers accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q accept encoding gzip deflate br accept language en IN en US q en q cache control no cache pragma no cache sec ch ua Not A Brand v Chromium v Google Chrome v sec ch ua mobile sec ch ua platform Android sec fetch dest document sec fetch mode navigate sec fetch site same origin sec fetch user sec gpc upgrade insecure requests user agent Mozilla Linux Android Nexus Build MRAN AppleWebKit KHTML like Gecko Chrome Mobile Safari let cheerio load response console log response let title section ipc page section gt div gt div gt h text let rating div ipc button text gt div gt div nth child gt div gt span text slice console log title movie has a IMDB rating of rating Now we have the user agent which is specific to the computer that you re using to code this You can simply modify this on the user agent and check phoney ones on the internet and paste them right here you don t need to paste in your actual browser info Finally let s put this to the test to see if it still works Go to the debug tab and select debug play Now let s hope for the best and head to the debug console where you can see that it does not work as it does not print the movie s title or rating So we ll use what we learned before and set a debugger right at the console log line to see what s going on Let s run it again and it stops right at the console log line and we can see what variables we have We have the rating which is an empty string and the title which is also an empty string which means it didn t find the selectors we were looking for because the response changed as you can see and it is completely nonsensical So when we requested with only the URL all of the other options were default but now that we ve added our own everything is the default We get this response because we forgot to add the gzip option to some of the default parameters for the request function Follow aviyelHQ or sign up on Aviyel for early access if you are a project maintainer contributor or just an Open Source enthusiast Join Aviyel s Discord gt Aviyel s worldTwitter gt |
2022-01-10 06:53:59 |
海外TECH |
DEV Community |
The flow of building a full-stack web application |
https://dev.to/itsrakesh/the-flow-of-building-a-full-stack-web-application-447g
|
The flow of building a full stack web applicationWhen I am new to web development I often have confusion on how everything works together how to connect all the things how a full stack web app is built from scratch how to plan everything what all the necessary stuff and so many If you are also a new web dev or learning web then I am sure you have also lots of confusion So in this blog I try to reduce that confusion up to some point by giving you a high level overview of building a full stack web application Before we startBefore reading this blog there are a few things to note This is my preferred way of doing things you may not need to do exactly like this This flow is for small or intermediate level applications Let s get started Phase StrategyBefore jumping into the project ask yourself a few questions Why I am doing this Who are my audience What tech stack do I need How much time it may take How much budget it may take Phase Design and PrototypeNo one starts a project without having a proper design even if you do you will end up creating a big mess So get a design first Find inspiration from websites like Dribbble Behance and then use tools like Figma Adobe XD Sketch etc to make a design and prototype Read my blog about web design principles Web design principles with examples Rakesh Potnuru・Nov ・ min read webdev beginners design Phase DevelopmentDevelopment is again divided into three sub phases Frontend DevelopmentSo you got the design what next Start building Frontend with any framework or library Widely used frontend frameworks libraries ReactAngularVueMust go through this Front End Checklist and make sure everything is ok TestingTesting is as important as building Use a testing library to test the frontend Make web site responsive Use tools like ngrok to share your localhost and get feedback Backend DevelopmentFrontend is ready Let s start building the backend and REST API Widely used backend frameworks libraries NodeJSPHPRoR TestingUse tools like Postman or Hoppscotch to test your API Use jest for NodeJS PHPUnit for PHP and so on Now connect frontend and backend Test whole application This may help you Different ways to connect react frontend and node backend Rakesh Potnuru・Nov ・ min read webdev beginners node react DeploymentDevelopment done Now containerize your application with docker set up CI CD pipelines and do all the DevOps stuff Or if your website is small and simple just directly deploy on hosting services Phase ExecutionNow comes the final phase and most important phase Recommended things to do after developing your application Submit your site to Search console Setup Google AnalyticsTrack events with Facebook pixel if you want to create ad campaigns Set up goals Monetize your website That s it This is the high level overview of building a full stack application ️Tool of the week️SquooshSquoosh is a free image size reducer tool while maintaining high quality I hope you find this useful Follow for more weekly awesomeness Twitter |
2022-01-10 06:25:54 |
海外TECH |
DEV Community |
How to sanitize your urls in angular |
https://dev.to/codever/how-to-sanitize-your-urls-in-angular-2127
|
How to sanitize your urls in angularProject codever File backup bookmarks dialog component tsOn Codever you can backup your bookmarks and snippets When backing up you are presented a dialog where you can choose to display the items in browser For that we use a blob URL we know for sure it is safe To not get the unsafe prefix in your generated html I use the DomSanitizer and its method bypassSecurityTrustUrl as shown in the example below export class BackupBookmarksDialogComponent implements OnInit backupType string bookmarks snippets blobUrl any sanitizedBlobUrl any filename string constructor private dialogRef MatDialogRef lt BackupBookmarksDialogComponent gt private router Router Inject MAT DIALOG DATA data private sanitizer DomSanitizer this sanitizedBlobUrl this sanitizer bypassSecurityTrustUrl data blobUrl this blobUrl data blobUrl this backupType data backupType const currentDate new Date this filename this backupType currentDate toISOString json In the html component the sanitizedBlogUrl is injected in the href attribute of the a html element lt a href sanitizedBlobUrl download filename type button class btn btn primary btn sm mr click download gt lt i class fas fa download gt lt i gt Download lt a gt Reference Shared with ️from Codever Use copy to mine functionality to add it to your personal snippets collection |
2022-01-10 06:15:05 |
海外TECH |
DEV Community |
How to be a Web Developer in 2022 |
https://dev.to/moe23/how-to-be-a-web-developer-in-2022-3pge
|
How to be a Web Developer in With just started you might want to get started in programming You can watch the full video on YouTube Decide which platform you would like to program forWebMobileDesktop applicationOnce you decide what platform the confusion begin where do i start In this video we will break down Web For Web we need to start with the basics In web development there is routes you can take Very abstract FrontEnd Client Side Development Backend Server Side Development I know these sounds very confusion but i will do my best to break them down We will start by front end which mean building the user interface that we see when we visit our application client side development Frontend technology at it is core is formed of the followingHTMLCSSJavaScriptThese will cover how to create a web page style it and have some limited functionalitiesThese are the foundation blocks good free resources to use to get started Paid With these technology you are ready to go but if you want to take a step further there is a lot of front end framework that you could learn to increase your skill set the most famous ones areReact Angular VueJs All of these frameworks are JavaScript based all of these frameworks are highly in demand Resources React Angular VueJs Next Step is Backend If the frontend is the look and feel of our website the backend is the brain behind it its referred to as server side development in essence it is everything that the users don t see and contains behind the scenes activities that occur when performing any action on a website It focuses primarily on databases backend logic APIs Example when the user click on a button the action is switched from UI side to the backend side where all of the logic is executed There is a lot of backend technologies that you can learnASP NET C PythonJavaNodePHPRubyBackend Development SkillsAPIsServersDatabaseManaging resourcesOut of these technologies my personal favourite is ASP NET C let us discuss why let us discuss why C allows me to buildWeb ASP NET Blazor Maui mobile Xamarin MAUI desktop MAUI WPF Windows Forms Console Applications Cloud with Azure games unity AI with ML Net and NEt for apache sparkIoTit opens up alot of possibilities by simply learning language In the Net world when building web applications it is referred to as ASP Net and most recently with the release of Net called Minimal API What is ASP NET it is a free Cross platform Open source framework for building web apps and services with NET and C you can build applications on Windows Mac or Linux So why choose ASP NETrobustfastscalablesecureits open source and has a big active communityBased on my opinion if you are interested in learning more about net and c i would you will need to learn C first then jump into ASP NET since once you understand the building blocks of the language it will be easy to move forward with the more advance features recommend the following resrouceFREE An amazing video series on youtube by the one and only Scott Hanselman and Kendra Havens covering C basics list PLdofOcmZoVxKLQCHpiUWunvlJJvUiNAnother amazing free video series by Scott Hansleman and Leslie Richardson covering ASP NET basics list PLdofOcmZoWnviYduHqbmKode pWyAs well microsoft has Learn Net website which provide free courses in C ASP NET basically everything related to Net you can find a free course on to jump start your learning processAs well you can watch my video on youtube to also learn about NetPaid C fundamentals by Scot Alan RIP Please share additional links in the description below so other people will be able to benefit happy new year and good luck in your new journey |
2022-01-10 06:08:54 |
海外科学 |
BBC News - Science & Environment |
Huge fossilised ‘sea dragon’ found in Rutland reservoir |
https://www.bbc.co.uk/news/science-environment-59915689?at_medium=RSS&at_campaign=KARANGA
|
creature |
2022-01-10 06:01:17 |
海外ニュース |
Japan Times latest articles |
Bob Saget, comic who portrayed Danny Tanner on ‘Full House,’ dies at 65 |
https://www.japantimes.co.jp/culture/2022/01/10/tv/bob-saget-dies/
|
clubs |
2022-01-10 15:06:44 |
ニュース |
BBC News - Home |
Huge fossilised ‘sea dragon’ found in Rutland reservoir |
https://www.bbc.co.uk/news/science-environment-59915689?at_medium=RSS&at_campaign=KARANGA
|
creature |
2022-01-10 06:01:17 |
北海道 |
北海道新聞 |
110番「正しい利用を」 本田望結さん一日本部長 |
https://www.hokkaido-np.co.jp/article/631766/
|
本田望結 |
2022-01-10 15:02:00 |
コメント
コメントを投稿