投稿時間:2023-07-21 16:25:12 RSSフィード2023-07-21 16:00 分まとめ(26件)

カテゴリー等 サイト名等 記事タイトル・トレンドワード等 リンクURL 頻出ワード・要約等/検索ボリューム 登録日
IT ITmedia 総合記事一覧 [ITmedia PC USER] サードウェーブ、「raytrek スリープフリークス監修 DTMモデル」新モデルを発売 https://www.itmedia.co.jp/pcuser/articles/2307/21/news151.html itmediapcuser 2023-07-21 15:19:00
IT 情報システムリーダーのためのIT情報専門サイト IT Leaders アライドテレシス、マネージド型のSASEサービス「AlliedSecureWAN」 | IT Leaders https://it.impress.co.jp/articles/-/25129 アライドテレシス、マネージド型のSASEサービス「AlliedSecureWAN」ITLeadersアライドテレシスは年月日、SASEサービスパッケージ「SASEマネージドサービスAlliedSecureWAN」を発表した。 2023-07-21 15:35:00
AWS AWS Japan Blog AWS Organizations のメンバーアカウントを他の組織へ移行する: Part 1 https://aws.amazon.com/jp/blogs/news/aws-organizations-moving-an-organization-member-account-to-another-organization-part-1/ AWSOrganizationsのメンバーアカウントを他の組織へ移行するPart第部本ブログでは、Organizationsのある組織から別の組織にAWSアカウントを移行する際に、ガイダンスと考慮が必要なOrganizationsのさまざまな機能について説明します。 2023-07-21 06:06:21
Ruby Rubyタグが付けられた新着投稿 - Qiita rbenv install が上手くいかない時の対処法 https://qiita.com/shingo-kumada/items/a8a48709912d47a71297 cdusrlocals 2023-07-21 15:37:25
技術ブログ Developers.IO Steampipeを利用してAWSリソースのリレーションを可視化してみた https://dev.classmethod.jp/articles/steampipe-aws-relationship-graphs/ steampipe 2023-07-21 06:32:31
海外TECH DEV Community Scraping Using Python in Step by step approach https://dev.to/scofieldidehen/scraping-using-python-in-step-by-step-approach-49l6 Scraping Using Python in Step by step approachScraping should be about extracting content from HTML It sounds simple but has many obstacles The first one is to obtain the said HTML For that we ll use Python to avoid detection That might require bypassing anti bot systems Web scraping without getting blocked using Python or any other tool isn t a walk in the park Websites tend to protect their data and access Defensive systems could take many possible actions Stay with us to learn how to mitigate their impact or directly bypass bot detection using Requests or Playwright Note when testing at scale never use your home IP directly A small mistake or slip and you will get banned Also this is for information purposes and make sure you use it on an authorized website or platform PrerequisitesFor the code to work you ll need Python installed Some systems have it pre installed After that install all the necessary libraries by running pip install pip install requests playwright npx playwright installIP Rate LimitThe most basic security system is to ban or throttle requests from the same IP It means a regular user wouldn t request a hundred pages in a few seconds so that connection will be tagged as dangerous import requests response requests get  print response json origin   xyz IP rate limits work similarly to API rate limits but there is usually no public information about them We can t know for sure how many requests we can make safely Our Internet Service Provider assigns us our IP which we can t affect or mask The solution is to change it We can t modify a machine s IP but we can use different machines Datacenters might have different IPs although that isn t a real solution Proxies are They take an incoming request and relay it to the final destination It does no processing there But that is enough to mask our IP and bypass the block since the target website will see the proxy s IP Rotating ProxiesThere are free proxies even though we don t recommend them They might work for testing but aren t reliable We can use some of those for testing as seen in some examples Now we have a different IP and our home connection is safe and sound Good But what if they block the proxy s IP We re back to the initial position We won t go into detail about free proxies Just use the next one on the list Change them frequently since their lifespan is usually short Paid proxy services on the other hand offer IP rotation Our service would work similarly but the website would see a different IP In some cases they rotate for every request or every few minutes In any case they re much harder to ban And when it happens we ll get a new IP quickly import requests proxies http  response requests get proxies proxies  print response json origin We know about these it means bot detection services also know about them Some big companies will block traffic from known proxy IPs or datacenters For those cases there is a higher proxy level residential Residential proxies are more expensive and sometimes bandwidth limited but they offer us IPs used by regular people That implies that our mobile provider could assign us that IP tomorrow Or a friend had it yesterday They re indistinguishable from actual final users We can scrape whatever we want right The cheaper ones by default the expensive ones when necessary No not there yet We only passed the first hurdle with some more to come We must look like legitimate users to avoid being tagged as a bot or scraper User Agent HeaderThe next step would be to check our request headers The most known one is User Agent UA for short but there are many more UA follows a format we ll see later and many software tools for example GoogleBot have their own Here is what the target website will receive if we use Python Requests or cURL directly import requests  response requests get  print response json headers User Agent   python requests curl User Agent curl Many sites won t check UA but this is a huge red flag for the ones that do this We ll have to fake it Luckily most libraries allow custom headers Following the example using Requests import requests  headers User Agent Mozilla X Linux x AppleWebKit KHTML like Gecko Chrome Safari  response requests get headers headers  print response json headers User Agent Mozilla To get your current user agent visit httpbin just as the code snippet is doing and copy it Requesting all the URLs with the same UA might also trigger some alerts making the solution a bit more complicated Ideally we would have all the current possible User Agents and rotate them as we did with the IPs Since that is nearly impossible we can at least have a few There are lists of User Agents available for us to choose from import requests import random  user agents       Mozilla Windows NT Win x AppleWebKit KHTML like Gecko Chrome Safari       Mozilla X Linux x AppleWebKit KHTML like Gecko Chrome Safari       Mozilla Windows NT Win x AppleWebKit KHTML like Gecko Chrome Safari       Mozilla iPhone CPU iPhone OS like Mac OS X AppleWebKit KHTML like Gecko Mobile E       Mozilla Linux Android SM GU AppleWebKit KHTML like Gecko Chrome Mobile Safari    user agent random choice user agents  headers User Agent user agent  response requests get headers headers  print response json headers User Agent   Mozilla iPhone CPU iPhone OS like Mac OS X Keep in mind that browsers change versions quite often and this list can be obsolete in a few months If we re to use User Agent rotation a reliable source is essential We can do it by hand or use a service provider We are a step closer but there is still one flaw in the headers anti bot systems also know this trick and check other headers along with the User Agent Full Set of HeadersEach browser or even version sends different headers Check Chrome and Firefox in action       Accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q       Accept Encoding gzip deflate br       Accept Language en US en q       Host httpbin org       Sec Ch Ua Chromium v Not A Brand v Google Chrome v       Sec Ch Ua Mobile       Sec Fetch Dest document       Sec Fetch Mode navigate       Sec Fetch Site none       Sec Fetch User       Upgrade Insecure Requests       User Agent Mozilla X Linux x AppleWebKit KHTML like Gecko Chrome Safari         Accept text html application xhtml xml application xml q image webp q       Accept Encoding gzip deflate br       Accept Language en US en q       Host httpbin org       Sec Fetch Dest document       Sec Fetch Mode navigate       Sec Fetch Site none       Sec Fetch User       Upgrade Insecure Requests       User Agent Mozilla X Ubuntu Linux x rv Gecko Firefox   It means what you think it means The previous array with five User Agents is incomplete We need an array with a complete set of headers per User Agent For brevity we ll show a list with one item It s already long enough In this case copying the result from httpbin isn t enough The ideal would be to copy it directly from the source The easiest way to do it s from Firefox Chrome DevTools or your browser s equivalent Go to the Network tab visit the target website right click on the request and “Copy as cURL Then convert the cURL syntax to Python and paste the headers into the list import requests import random  headers list       authority httpbin org       cache control max age       sec ch ua Chromium v Not A Brand v Google Chrome v       sec ch ua mobile       upgrade insecure requests       user agent Mozilla X Linux x AppleWebKit KHTML like Gecko Chrome Safari       accept text html application xhtml xml application xml q image avif image webp image apng q application signed exchange v b q       sec fetch site none       sec fetch mode navigate       sec fetch user       sec fetch dest document       accept language en US en q      headers random choice headers list  response requests get headers headers  print response json headers We could add a Referer header for extra security such as Google or an internal page from the same website It would mask the fact that we always request URLs directly without interaction But be careful since adding a referrer would change more headers You don t want your Python Request script blocked by mistakes like that CookiesWe ignored the cookies above since they deserve a separate section Cookies can help you bypass some anti bots or get your requests blocked They re a powerful tool that we need to understand correctly For example cookies can track a user session and remember that user after login Websites assign each new user a cookie session There are many ways to do it but we ll try to simplify Then the user s browser will send that cookie in each request tracking the user s activity How is that a problem We use rotating proxies so each request might have a different IP from different regions or countries Anti bots can see and block that pattern since it s not a natural way for users to browse On the other hand once you bypass the anti bot solution it ll send valuable cookies Defensive systems won t check twice if the session looks legit Check out how to bypass Cloudflare for more information Will cookies help our Python Requests scripts to avoid bot detection Or will they hurt us and get us blocked The answer lies in our implementation For simple cases not sending cookies might work best There is no need to maintain a session For more advanced cases and anti bot software session cookies might be the only way to reach and scrape the final content Always taking into account that the session requests and the IP must match The same happens if we want content generated in the browser after XHR calls We ll need to use a headless browser After the initial load the JavaScript will try to get some content using an XHR call We can t do that call without cookies on a protected site How will we use headless browsers specifically Playwright to avoid detection Keep on reading Headless BrowsersSome anti bot systems will only show the content after the browser solves a JavaScript challenge And we can t use Python Requests to simulate browser behavior like that We need a browser with JavaScript execution to run and pass the challenge Selenium Puppeteer and Playwright are the most used and known libraries Avoiding them for performance reasons would be preferable and they ll make scraping slower But sometimes there is no alternative We ll see how to run Playwright The snippet below shows a simple script visiting a page that prints the sent headers The output only shows the User Agent but since it s a real browser the headers will include the entire set Accept Accept Encoding etc import json from playwright sync api import sync playwright with sync playwright as p      for browser type in p chromium p firefox p webkit          browser browser type launch          page browser new page          page goto          jsonContent json loads page inner text pre          print jsonContent headers User Agent          browser close   Mozilla X Linux x AppleWebKit KHTML like Gecko HeadlessChrome Safari   Mozilla X Linux x rv Gecko Firefox   Mozilla Macintosh Intel Mac OS X AppleWebKit KHTML like Gecko Version Safari This approach comes with its own problem take a look at the User Agents The Chromium one includes Headless Chrome which will tell the target website that it s a headless browser They might act upon that Back to the headers section we can add custom headers that will overwrite the default ones Replace the line in the previous snippet with this one and paste a valid User Agent browser new page extra http headers User Agent That is just entry level with headless browsers Headless detection is a field in itself and many people are working on it Some to detect it some to avoid being blocked As an example you can visit Pixelscan with an actual browser and a headless one To be deemed “consistent you ll need to work hard Look at the screenshot below taken when visiting Pixelscan with Playwright See the UA The one we fake is all right but they can detect that we re lying by checking the navigator JavaScript API We can pass user agent and Playwright will set the User Agent in JavaScript and the header for us Nice page browser new page user agent You can easily add Playwright stealth to your scripts for more advanced cases making detection harder It handles inconsistencies between headers and browser JavaScript APIs among other things In summary having coverage is complex but you won t need it most of the time Sites can always do more complex checks WebGL touch events or battery status You won t need those extra features unless you are trying to scrape a website that requires bypassing an anti bot solution like Akamai And for those cases that extra effort will be mandatory And demanding to be honest Geographic Limits or Geo BlockingHave you ever tried to watch CNN from outside the US That s called geo blocking Only connections from inside the US can watch CNN live We could bypass that using a Virtual Private Network VPN We can then browse as usual but the website will see a local IP thanks to the VPN The same can happen when scraping websites with geo blocking There is an equivalent for proxies geolocated proxies Some proxy providers allow us to choose from a list of countries With that activated we ll only get local IPs from the US Behavioral PatternsBlocking IPs and User Agents isn t enough these days They become unmanageable and stale in hours if not minutes We are mainly safe if we perform requests with clean IPs and real world User Agents There are more factors involved but most requests should be valid However most modern anti bot software uses machine learning and behavioral patterns not just static markers IP UA geolocation That means we would be detected if we always performed the same actions in the same order Go to the homepage Click on the “Shop button Scroll down Go to page After a few days launching the same script could result in every request being blocked Many people can perform those same actions but bots have something that makes them obvious speed With software we would execute every step sequentially while an actual user would take a second then click scroll down slowly using the mouse wheel move the mouse to the link and click Maybe there is no need to fake all that but be aware of the possible problems and know how to face them We have to think about what we want Maybe we don t need that first request since we only require the second page We could use that as an entry point not the homepage And save one request It can scale to hundreds of URLs per domain No need to visit every page in order scroll down click on the next page and start again To scrape search results once we recognize the URL pattern for pagination we only need two data points the number of items and items per page And most of the time that info is present on the first page or request import requests from bs import BeautifulSoup response requests get  soup BeautifulSoup response content html parser  pages soup select woocommerce pagination a page numbers not next  print pages get href  print pages get href One request shows us that there are pages We can now queue them Mixing with the other techniques we would scrape the content from this page and add the remaining To scrape them by bypassing anti bot systems we could Shuffle the page order to avoid pattern detection Use different IPs and User Agents so each request looks new Add delays between some of the calls Use Google as a referrer randomly We could write some snippets mixing all these but the best option in real life is to use a tool with it all like Scrapy pyspider node crawler Node js or Colly Go The idea being the snippets is to understand each problem on its own But handling everything independently would be too complicated for large scale real life projects CAPTCHAEven the best prepared request can get caught and shown a CAPTCHA Nowadays solving CAPTCHAS is achievable with solutions like Anti Captcha and Captcha but a waste of time and money The best solution is to avoid them The second best is to forget about that request and retry The exception is obvious sites that always show a CAPTCHA on the first visit We have to solve it if there is no way to bypass it And then use the session cookies to avoid being challenged again It might sound counterintuitive but waiting for a second and retrying the same request with a different IP and set of headers will be faster than solving a CAPTCHA Try it yourself and tell us about the experience Be a Good Internet CitizenWe can use several websites for testing but be careful when doing the same at scale Try to be a good internet citizen and don t cause DDoS Limit your interactions per domain Amazon can handle thousands of requests per second but not all target sites will We re always talking about “read only browsing mode Access a page and read its contents Never submit a form or perform active actions with malicious intent If we take a more active approach several other factors would matter writing speed mouse movement navigation without clicking browsing many pages simultaneously etc Bot prevention software is specifically aggressive with active actions As it should for security reasons We won t discuss this part but these actions will give them new reasons to block requests Again good citizens don t try massive logins We re talking about scraping not malicious activities Sometimes websites make data collection harder maybe not on purpose But with modern frontend tools CSS classes could change daily ruining thoroughly prepared scripts For more details read our previous entry on how to scrape data in Python ConclusionWe d like you to remember the low hanging fruits IP rotating proxies Residential proxies for challenging targets Full set headers including User Agent Bypass bot detection with Playwright when JavaScript challenge is required maybe adding the stealth module Avoid patterns that might tag you as a bot There are many more and probably more we didn t cover But with these techniques you should be able to crawl and scrape at scale After all web scraping without getting blocked with Python is possible if you know how If you find this post exciting find more exciting posts on Learnhub Blog we write everything tech from Cloud computing to Frontend Dev Cybersecurity AI and Blockchain This was written by ZenRows and Edited by Leaernhub Blog Resources Essential Python Extensions for Visual Studio Code Using Python for Web Scraping and Data Extraction Powerful Python Snippets To Automate Tasks Getting Started with Python 2023-07-21 06:26:54
海外TECH Engadget Samsung Wallet gets digital school ID support for campuses across the US https://www.engadget.com/samsung-wallet-gets-digital-school-id-support-for-campuses-across-the-us-064807761.html?src=rss Samsung Wallet gets digital school ID support for campuses across the USStudents at colleges and universities across the US will easily be able to use their Samsung phones to tap for access and purchases Samsung s Wallet app now supports digital student IDs so long as it s on the United States versions of the brand s latest devices Some of the educational institutions that have enabled ID integration for Samsung Wallet are Penn State the University of Florida Central Michigan University University of North Alabama and the Stevens Institute of Technology Apple has supported contact free student ID cards since allowing students to tap their phones to access facilities and say get food at the cafeteria Google Pay also launched student ID integration in in partnership with a company called Transact which offers solutions for tuition and other student expense payment That s the same company Samsung has teamed up with to enable this integration which means users will have to download the Transact eAccounts mobile app from Google Play as well Like other digital student ID integrations Samsung Wallet allows students to access school facilities with their phone The app s Fast Mode feature will let users tap their phone without having to unlock their screen while Power Reserve means they can use their digital ID even if their phone has switched off due to low battery reserves That said the capability to pay using NFC at on campus stores and vending machines aren t available at all of the participating institutions At the moment students can only use Samsung Wallet s digital ID support if they have a Galaxy S phone or later a Note a Galaxy Flip or Fold device or a Galaxy A Galaxy Watch support is coming this fall Samsung also says that it s working to expand the offering and make it available for students in more institutions nbsp This article originally appeared on Engadget at 2023-07-21 06:48:07
金融 JPX マーケットニュース [東証]新規上場の承認(TOKYO PRO Market):フトン巻きのジロー(株) https://www.jpx.co.jp/equities/products/tpm/issues/index.html tokyopromarket 2023-07-21 15:30:00
金融 JPX マーケットニュース [OSE]特別清算数値(2023年7月第3週限):日経225 https://www.jpx.co.jp/markets/derivatives/special-quotation/ 特別清算 2023-07-21 15:15:00
金融 JPX マーケットニュース [東証]上場廃止等の決定:堺商事(株) https://www.jpx.co.jp/news/1023/20230721-11.html 上場廃止 2023-07-21 15:10:00
金融 ニッセイ基礎研究所 IFRSサステナビリティ開示基準の概要-企業にも認められるグローバルスタンダード確立の意義 https://www.nli-research.co.jp/topics_detail1/id=75504?site=nli 以下では、国際サステナビリティ基準審議会から公表されたIFRSサステナビリティ開示基準全般的要求事項IFRSS号、気候関連開示IFRSS号について、企業の開示作業に係る負担を軽減するための措置を含めて概観し、当該基準がグローバルスタンダードとして認知されることで、企業にも恩恵が見込まれることを確認する。 2023-07-21 15:19:16
金融 ニッセイ基礎研究所 わが国の不動産投資市場規模(2023年)~「収益不動産」の資産規模は約289.5兆円(前回比+13.9兆円)。前回調査から「賃貸住宅」・「商業施設」・「物流施設」・「ホテル」が拡大する一方、「オフィス」は縮小 https://www.nli-research.co.jp/topics_detail1/id=75460?site=nli 前回調査から「オフィス」・「賃貸住宅」・「物流施設」が拡大する一方、「商業施設」・「ホテル」は縮小わが国の不動産投資市場規模ーボトムアップ・アプローチによる推計結果「収益不動産」は約兆円、「投資適格不動産」は約兆円。 2023-07-21 15:22:48
金融 日本銀行:RSS (金研ニュースレター)2023年国際コンファランス http://www.boj.or.jp/about/release_2023/rel230721a.htm 国際 2023-07-21 16:00:00
ニュース BBC News - Home Huge swings see Lib Dem and Labour wins but Tories hold Uxbridge https://www.bbc.co.uk/news/uk-politics-66264317?at_medium=RSS&at_campaign=KARANGA london 2023-07-21 06:41:52
ニュース BBC News - Home Amsterdam bans cruise ships to limit visitors and curb pollution https://www.bbc.co.uk/news/world-europe-66264226?at_medium=RSS&at_campaign=KARANGA dutch 2023-07-21 06:51:56
ニュース BBC News - Home Government borrows less than expected in June https://www.bbc.co.uk/news/business-66265452?at_medium=RSS&at_campaign=KARANGA interest 2023-07-21 06:46:18
ニュース BBC News - Home Uxbridge by-election: Ulez expansion blamed for Labour loss https://www.bbc.co.uk/news/uk-politics-66264893?at_medium=RSS&at_campaign=KARANGA victory 2023-07-21 06:38:37
ニュース BBC News - Home Newspaper headlines: Farage gets apology and King Charles gets 'pay rise' https://www.bbc.co.uk/news/blogs-the-papers-66263970?at_medium=RSS&at_campaign=KARANGA front 2023-07-21 06:25:00
ニュース BBC News - Home Nigeria 0-0 Canada: Christine Sinclair denied Fifa World Cup history by penalty save https://www.bbc.co.uk/sport/football/66259649?at_medium=RSS&at_campaign=KARANGA Nigeria Canada Christine Sinclair denied Fifa World Cup history by penalty saveChristine Sinclair misses a chance to become the first player to score at six Fifa World Cups after her penalty is saved as Canada draw with Nigeria 2023-07-21 06:17:33
ビジネス ダイヤモンド・オンライン - 新着記事 週5日出社、最大の抵抗勢力は上層部 - WSJ発 https://diamond.jp/articles/-/326550 抵抗勢力 2023-07-21 15:11:00
マーケティング MarkeZine 今週聞かれた放送回は?Voicy週間ランキングTOP3【7/17~7/21】 http://markezine.jp/article/detail/42854 voicy 2023-07-21 15:30:00
IT 週刊アスキー Switch『桃太郎電鉄 ~昭和 平成 令和も定番!』の累計出荷本数が400万本を突破! https://weekly.ascii.jp/elem/000/004/146/4146285/ switch 2023-07-21 15:55:00
IT 週刊アスキー 【でか盛】赤いきつね焼うどん 麺重量120g“でか盛”が進化してこの夏も登場 https://weekly.ascii.jp/elem/000/004/146/4146289/ 東洋水産 2023-07-21 15:50:00
IT 週刊アスキー 『モンハンNow』特別番組が7月27日20時より配信!リリース時期は9月の予定 https://weekly.ascii.jp/elem/000/004/146/4146284/ monsterhunternow 2023-07-21 15:45:00
IT 週刊アスキー 『エグゾプライマル』友だちを増やそうキャンペーンが開催! https://weekly.ascii.jp/elem/000/004/146/4146273/ 公式twitter 2023-07-21 15:30:00
マーケティング AdverTimes 国立美術館、「子ども連れで来館しやすい美術館」に向けた施策実施 https://www.advertimes.com/20230721/article427778/ 国立美術館 2023-07-21 06:16:54

コメント

このブログの人気の投稿

投稿時間:2021-06-17 05:05:34 RSSフィード2021-06-17 05:00 分まとめ(1274件)

投稿時間:2021-06-20 02:06:12 RSSフィード2021-06-20 02:00 分まとめ(3871件)

投稿時間:2020-12-01 09:41:49 RSSフィード2020-12-01 09:00 分まとめ(69件)