diff --git a/AGENTS.md b/AGENTS.md index 8938ad051e..c94898835a 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -160,7 +160,6 @@ sources/ └── academy/ # Educational content ├── tutorials/ # Step-by-step guides ├── webscraping/ # Web scraping courses - └── glossary/ # Terminology and definitions ``` ## Quality Standards diff --git a/docusaurus.config.js b/docusaurus.config.js index e0145425f8..072df84ebb 100644 --- a/docusaurus.config.js +++ b/docusaurus.config.js @@ -96,11 +96,6 @@ module.exports = { to: `/academy/tutorials`, activeBaseRegex: `${collectSlugs(join(__dirname, 'sources', 'academy', 'tutorials')).join('$|')}$`, }, - { - label: 'Glossary', - to: `/academy/glossary`, - activeBaseRegex: `${collectSlugs(join(__dirname, 'sources', 'academy', 'glossary')).join('$|')}$`, - }, ], }, }), diff --git a/sources/academy/glossary/concepts/css_selectors.md b/sources/academy/glossary/concepts/css_selectors.md deleted file mode 100644 index ee36b53c0e..0000000000 --- a/sources/academy/glossary/concepts/css_selectors.md +++ /dev/null @@ -1,66 +0,0 @@ ---- -title: CSS selectors -description: Learn about CSS selectors. What they are, their types, why they are important for web scraping and how to use them in browser Console with JavaScript. -sidebar_position: 8.4 -slug: /concepts/css-selectors ---- - -CSS selectors are patterns used to select [HTML elements](./html_elements.md) on a web page. They are used in combination with CSS styles to change the appearance of web pages, and also in JavaScript to access and manipulate the elements on a web page. - -> Querying of CSS selectors with JavaScript is done using [query selector functions](./querying_css_selectors.md). - -## Common types of CSS selectors - -Some of the most common types of CSS selectors are: - -### Element selector - -This is used to select elements by their tag name. For example, to select all `

` elements, you would use the `p` selector. - -```js -const paragraphs = document.querySelectorAll('p'); -``` - -### Class selector - -This is used to select elements by their class attribute. For example, to select all elements with the class of `highlight`, you would use the `.highlight` selector. - -```js -const highlightedElements = document.querySelectorAll('.highlight'); -``` - -### ID selector - -This is used to select an element by its `id` attribute. For example, to select an element with the id of `header`, you would use the `#header` selector. - -```js -const header = document.querySelector(`#header`); -``` - -### Attribute selector - -This is used to select elements based on the value of an attribute. For example, to select all elements with the attribute `data-custom` whose value is `yes`, you would use the `[data-custom="yes"]` selector. - -```js -const customElements = document.querySelectorAll('[data-custom="yes"]'); -``` - -### Chaining selectors - -You can also chain multiple selectors together to select elements more precisely. For example, to select an element with the class `highlight` that is inside a `

` element, you would use the `p.highlight` selector. - -```js -const highlightedParagraph = document.querySelectorAll('p.highlight'); -``` - -## CSS selectors in web scraping - -CSS selectors are important for web scraping because they allow you to target specific elements on a web page and extract their data. When scraping a web page, you typically want to extract specific pieces of information from the page, such as text, images, or links. CSS selectors allow you to locate these elements on the page, so you can extract the data that you need. - -For example, if you wanted to scrape a list of all the titles of blog posts on a website, you could use a CSS selector to select all the elements that contain the title text. Once you have selected these elements, you can extract the text from them and use it for your scraping project. - -Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need. - -## Resources - -- Find all the available CSS selectors and their syntax on the [MDN CSS Selectors page](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors). diff --git a/sources/academy/glossary/concepts/dynamic_pages.md b/sources/academy/glossary/concepts/dynamic_pages.md deleted file mode 100644 index e85f1e9bed..0000000000 --- a/sources/academy/glossary/concepts/dynamic_pages.md +++ /dev/null @@ -1,39 +0,0 @@ ---- -title: Dynamic pages and single-page applications -description: Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it. -sidebar_position: 8.3 -slug: /concepts/dynamic-pages ---- - -**Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it.** - ---- - -Oftentimes, web pages load additional information dynamically, long after their main body is loaded in the browser. A subset of dynamic pages takes this approach further and loads all of its content dynamically. Such style of constructing websites is called Single-page applications (SPAs), and it's widespread thanks to some popular JavaScript libraries, such as [React](https://react.dev/) or [Vue](https://vuejs.org/). - -As you progress in your scraping journey, you'll quickly realize that different websites load their content and populate their pages with data in different ways. Some pages are rendered entirely on the server, some retrieve the data dynamically, and some use a combination of both those methods. - -## How page loading works {#about-page-loading} - -The process of loading a page involves three main events, each with a designated corresponding name: - -1. `DOMContentLoaded` - The initial HTML document is loaded, which contains the HTML as it was rendered on the website's server. It also includes all of the JavaScript which will be run in the next step. -2. `load` - The page's JavaScript is executed. -3. `networkidle` - Network [XHR/Fetch requests](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest) are sent and loaded, and data from these requests is populated onto the page. Many websites load essential data this way. These requests might be sent upon certain page events as well (not just the first load), such as scrolling or clicking. - -Now that we have a solid understanding of the different stages of page-loading, and the order they happen in, we can fully understand what a dynamic page is. - -## What is dynamic content {#what-is-dynamic-content} - -Dynamic content is any content that is rendered **after** the `DOMContentLoaded` event, which means any content loaded by JavaScript during the `load` event, or after any network XHR/Fetch requests have been made. - -Sometimes, it can be quite obvious when content is dynamically being rendered. For example, take a look at this gif: - - - - -![Image](https://blog.apify.com/content/images/2022/02/dynamicLoading-1--1--2.gif) - -Here, it's very clear that new content is being generated. As we scroll down the Twitter feed, we can see the scroll bar jumping back up, signifying that more elements have been created using JavaScript. - -Other times, it's less obvious though. Content can appear to be static (non-dynamic) when it is not, or even sometimes the other way around. diff --git a/sources/academy/glossary/concepts/html_elements.md b/sources/academy/glossary/concepts/html_elements.md deleted file mode 100644 index d0c66e754a..0000000000 --- a/sources/academy/glossary/concepts/html_elements.md +++ /dev/null @@ -1,40 +0,0 @@ ---- -title: HTML elements -description: Learn about HTML elements. What they are, their types and how to work with them in a browser environment using JavaScript. -sidebar_position: 8.6 -slug: /concepts/html-elements ---- - -An HTML element is a building block of an HTML document. It is used to represent a piece of content on a web page, such as text, images, or videos. Each element is defined by a tag, which is a set of characters enclosed in angle brackets, such as `

`, ``, or `

This is a paragraph of text.

-``` - -You can also add **attributes** to an element to provide additional information or to control how the element behaves. For example, the `src` attribute is used to specify the source of an image, like this: - -```html -A description of the image -``` - -In JavaScript, you can use the **DOM** (Document Object Model) to interact with elements on a web page. For example, you can use the [`querySelector()` method](./querying_css_selectors.md) to select an element by its [CSS selector](./css_selectors.md), like this: - -```js -const myElement = document.querySelector('#myId'); -``` - -You can also use `getElementById()` method to select an element by its `id`, like this: - -```js -const myElement = document.getElementById('myId'); -``` - -You can also use `getElementsByTagName()` method to select all elements of a certain type, like this: - -```js -const myElements = document.getElementsByTagName('p'); -``` - -Once you have selected an element, you can use JavaScript to change its content, style, or behavior. - -In summary, an HTML element is a building block of a web page. It is defined by a **tag** with **attributes**, which provide additional information or control how the element behaves. You can use the **DOM** (Document Object Model) to interact with elements on a web page. diff --git a/sources/academy/glossary/concepts/http_cookies.md b/sources/academy/glossary/concepts/http_cookies.md deleted file mode 100644 index 472d1f0a86..0000000000 --- a/sources/academy/glossary/concepts/http_cookies.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -title: HTTP cookies -description: Learn a bit about what cookies are, and how they are utilized in scrapers to appear logged-in, view specific data, or even avoid blocking. -sidebar_position: 8.2 -slug: /concepts/http-cookies ---- - -**Learn a bit about what cookies are, and how they are utilized in scrapers to appear logged-in, view specific data, or even avoid blocking.** - ---- - -HTTP cookies are small pieces of data sent by the server to the user's web browser, which are typically stored by the browser and used to send later requests to the same server. Cookies are usually represented as a string (if used together with a plain HTTP request) and sent with the request under the **Cookie** [header](./http_headers.md). - -## Most common uses of cookies in crawlers {#uses-in-crawlers} - -1. To make the website show data to you as if you were a logged-in user. -2. To make the website show location-specific data (works for websites where you could set a zip code or country directly on the page, but unfortunately doesn't work for some location-based ads). -3. To make the website less suspicious of the crawler and let the crawler's traffic blend in with regular user traffic. - -For local testing, we recommend using the [**EditThisCookie**](https://chrome.google.com/webstore/detail/fngmhnnpilhplaeedifhccceomclgfbg) Chrome extension. diff --git a/sources/academy/glossary/concepts/http_headers.md b/sources/academy/glossary/concepts/http_headers.md deleted file mode 100644 index 7b5fec6b3e..0000000000 --- a/sources/academy/glossary/concepts/http_headers.md +++ /dev/null @@ -1,48 +0,0 @@ ---- -title: HTTP headers -description: Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers. -sidebar_position: 8.1 -slug: /concepts/http-headers ---- - -**Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers.** - ---- - -[HTTP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers) let the client and the server pass additional information with an HTTP request or response. Headers are represented by an object where the keys are header names. Headers can also contain certain authentication tokens. - -In general, there are 4 different paths you'll find yourself on when scraping a website and dealing with headers: - -## No headers {#no-headers} - -For some websites, you won't need to worry about modifying headers at all, as there are no checks or verifications in place. - -## Some default headers required {#needs-default-headers} - -Some websites will require certain default browser headers to work properly, such as **User-Agent** (though, this header is becoming more obsolete, as there are more sophisticated ways to detect and block a suspicious user). - -Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which would not know which data to return without knowing which exact website is requesting it. - -## Custom headers required {#needs-custom-headers} - -A custom header is a non-standard HTTP header used for a specific website. For example, an imaginary website of **cool-stuff.com** might have a header with the name **X_Cool_Stuff_Token** which is required for every single request to a product page. - -Dealing with cases like these usually isn't difficult, but can sometimes be tedious. - -## Very specific headers required {#needs-specific-headers} - -The most challenging websites to scrape are the ones that require a full set of site-specific headers to be included with the request. For example, not only would they potentially require proper **User-Agent** and **Referer** headers mentioned above, but also **Accept**, **Accept-Language**, **Accept-Encoding**, etc. with specific values. - -Another big one to mention is the **Cookie** header. We cover this in more detail within the [cookies](./http_cookies.md) lesson. - -You could use Chrome DevTools to inspect request headers, and [Insomnia](../tools/insomnia.md) or [Postman](../tools/postman.md) to test how the website behaves with or without specific headers. - -## HTTP/1.1 vs HTTP/2 headers {#http1-vs-http2} - -HTTP/1.1 and HTTP/2 headers have several differences. Here are the three key differences that you should be aware of: - -1. HTTP/2 headers do not include status messages. They only contain status codes. -2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem. -3. While HTTP/1.1 headers are case-insensitive and could be sent by the browsers with capitalized letters (e.g. **Accept-Encoding**, **Cache-Control**, **User-Agent**), HTTP/2 headers must be lower-cased (e.g. **accept-encoding**, **cache-control**, **user-agent**). - -> To learn more about the difference between HTTP/1.1 and HTTP/2 headers, check out [this](https://httptoolkit.com/blog/translating-http-2-into-http-1/) article diff --git a/sources/academy/glossary/concepts/index.md b/sources/academy/glossary/concepts/index.md deleted file mode 100644 index c61d8ed237..0000000000 --- a/sources/academy/glossary/concepts/index.md +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: Concepts -description: Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development. -sidebar_position: 18 -category: glossary -slug: /concepts ---- - -**Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.** - ---- - -You'll see some terms and concepts frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson. - -Because of this slight dilemma, and because there are no outside resources which compile all of these concepts into an educational and digestible form, we've decided to do just that. Welcome to the **Concepts** section of the Apify Academy's **Glossary**! - -> It's important to note that there is no specific order to these concepts. All of them range in their relevance and importance to your every day scraping endeavors. diff --git a/sources/academy/glossary/concepts/querying_css_selectors.md b/sources/academy/glossary/concepts/querying_css_selectors.md deleted file mode 100644 index 658554626e..0000000000 --- a/sources/academy/glossary/concepts/querying_css_selectors.md +++ /dev/null @@ -1,37 +0,0 @@ ---- -title: Querying elements -description: Learn how to query DOM elements using CSS selectors with the document.querySelector() and document.querySelectorAll() functions. -sidebar_position: 8.5 -slug: /concepts/querying-css-selectors ---- - -`document.querySelector()` and `document.querySelectorAll()` are JavaScript functions that allow you to select elements on a web page using [CSS selectors](./css_selectors.md). - -`document.querySelector()` is used to select the first element that matches the provided [CSS selector](./css_selectors.md). It returns the first matching element or null if no matching element is found. - -Here's an example of how you can use it: - -```js -const firstButton = document.querySelector('button'); -``` - -This will select the first button element on the page and store it in the variable **firstButton**. - -`document.querySelectorAll()` is used to select all elements that match the provided CSS selector. It returns a `NodeList` (a collection of elements) that can be accessed and manipulated like an array. - -Here's an example of how you can use it: - -```js -const buttons = document.querySelectorAll('button'); -``` - -This will select all button elements on the page and store them in the variable "buttons". - -Both functions can be used to access and manipulate the elements in the web page. Here's an example on how you can use it to extract the text of all buttons. - -```js -const buttons = document.querySelectorAll('button'); -const buttonTexts = buttons.forEach((button) => button.textContent); -``` - -It's important to note that when using `querySelectorAll()` in a browser environment, it returns a live `NodeList`, which means that if the DOM changes, the NodeList will also change. diff --git a/sources/academy/glossary/concepts/robot_process_automation.md b/sources/academy/glossary/concepts/robot_process_automation.md deleted file mode 100644 index 3671fe7cb6..0000000000 --- a/sources/academy/glossary/concepts/robot_process_automation.md +++ /dev/null @@ -1,42 +0,0 @@ ---- -title: What is robotic process automation (RPA) -description: Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks. -sidebar_position: 8.7 -slug: /concepts/robotic-process-automation ---- - -**Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks.** - ---- - -RPA allows you to create software (also known as **bots**), which can imitate your digital actions. You can program bots to perform repetitive tasks faster, more reliably and more accurately than humans. Plus, they can do these tasks all day, every day. - -## What can I use RPA for? {#what-can-i-use-rpa-for} - -You can [use](https://apify.com/use-cases/rpa) RPA to automate any repetitive task you perform using software. The tasks can range from [analyzing content](https://apify.com/jakubbalada/content-checker) to monitoring web pages for changes (such as changes in your competitors' pricing). - -Other use cases for RPA include filling forms or [uploading files](https://apify.com/lukaskrivka/google-sheets) while you get on with more important tasks. And it's not just simple tasks you can automate. How about [processing your invoices](https://apify.com/katerinahronik/toggl-invoice-download) or posting content across several marketing channels at once? - -## How does RPA work? {#how-does-rpa-work} - -In a traditional automation workflow, you - -1. Break a repetitive process down into [manageable chunks](https://kissflow.com/workflow/workflow-automation/an-8-step-checklist-to-get-your-workflow-ready-for-automation/), e.g. open website => log into website => click button "X" => download section "Y", etc. -2. Program a bot that does each of those chunks. -3. Execute the chunks of code in the right order (or in parallel). - -With the advance of [machine learning](https://en.wikipedia.org/wiki/Machine_learning), it is becoming possible to [record](https://www.nice.com/info/rpa-guide/process-recorder-function-in-rpa/) your workflows and analyze which can be automated. However, this technology is still not perfected and at times can even be less practical than the manual process. - -## Is RPA the same as web scraping? {#is-rpa-the-same-as-web-scraping} - -While web scraping is a kind of RPA, it focuses on extracting structured data. RPA focuses on the other tasks in browsers - everything except for extracting information. - -## Additional resources {#additional-resources} - -An easy-to-follow [video](https://www.youtube.com/watch?v=9URSbTOE4YI) on what RPA is. - -To learn about RPA in plain English, check out [this](https://enterprisersproject.com/article/2019/5/rpa-robotic-process-automation-how-explain) article. - -[This](https://www.cio.com/article/227908/what-is-rpa-robotic-process-automation-explained.html) article explains what RPA is and discusses both its advantages and disadvantages. - -You might also like to check out this article on [12 Steps to Automate Workflows](https://quandarycg.com/automating-workflows/). diff --git a/sources/academy/glossary/glossary.md b/sources/academy/glossary/glossary.md deleted file mode 100644 index 9e86844db6..0000000000 --- a/sources/academy/glossary/glossary.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -title: Why a glossary? -description: Browse important web scraping concepts, tools and topics in succinct articles explaining common web development terms in a web scraping and automation context. -sidebar_position: 16 -category: glossary -slug: /glossary ---- - -**Browse important web scraping concepts, tools and topics in succinct articles explaining common web development terms in a web scraping and automation context.** - ---- - -Web scraping comes with a lot of terms that are specific to the area. Some of them are tools and libraries, like [Playwright](../webscraping/puppeteer_playwright/index.md) or Insomnia. Others are general topics that have a special place in web scraping, like headless browsers or browser fingerprints. And some topics are related to all web development, but play a special role in web scraping, such as HTTP headers and cookies. - -When writing the academy, we very early on realized that we needed a place to reference these terms, but quickly found out that the usual tutorials and guides available all over the web weren't the most ideal. The explanations were too broad and generic and did not fit the web scraping context. With the **Apify Academy** glossary, we aim to provide you with short articles and lessons that provide the necessary web scraping context for specific terms, then link to other parts of the web for further in-depth reading. diff --git a/sources/academy/glossary/tools/apify_cli.md b/sources/academy/glossary/tools/apify_cli.md deleted file mode 100644 index 8e75bb76be..0000000000 --- a/sources/academy/glossary/tools/apify_cli.md +++ /dev/null @@ -1,46 +0,0 @@ ---- -title: The Apify CLI -description: Learn about, install, and log into the Apify CLI - your best friend for interacting with the Apify platform via your terminal. -sidebar_position: 9.1 -slug: /tools/apify-cli ---- - -**Learn about, install, and log into the Apify CLI - your best friend for interacting with the Apify platform via your terminal.** - ---- - -The [Apify CLI](/cli) helps you create, develop, build and run Apify Actors, and manage the Apify cloud platform from any computer. It can be used to automatically generate the boilerplate for different types of projects, initialize projects, remotely call Actors on the platform, and run your own projects. - -## Installing {#installing} - -To install the Apify CLI, you'll first need npm, which comes preinstalled with Node.js. Additionally, make sure you've got an Apify account, as you will need to log in to the CLI to gain access to its full potential. - -Open up a terminal instance and run the following command: - -```shell -npm i -g apify-cli -``` - -This will install the CLI via npm. - -## Logging in {#logging-in} - -After the CLI has finished installing, navigate to the [Apify Console](https://console.apify.com?asrc=developers_portal) and click on **Settings**. Then, within your account settings, click **Integrations**. The page should look like this: - -![Integrations tab on the Apify platform](./images/settings-integrations.jpg) - -> We've censored out the **User ID** in the image because it is private information which should not be shared with anyone who is not trusted. The same goes for your **Personal API Token**. - -Copy the **Personal API Token** and return to your terminal, entering this command: - -```shell -apify login -t YOUR_TOKEN_HERE -``` - -If you see a log which looks like this, - -```text -Success: You are logged in to Apify as YOUR_USERNAME! -``` - -If you see a log which looks like **Success: You are logged in to Apify as YOUR_USERNAME!**, you're in! diff --git a/sources/academy/glossary/tools/edit_this_cookie.md b/sources/academy/glossary/tools/edit_this_cookie.md deleted file mode 100644 index 47aea1f2c5..0000000000 --- a/sources/academy/glossary/tools/edit_this_cookie.md +++ /dev/null @@ -1,48 +0,0 @@ ---- -title: EditThisCookie -description: Learn how to add, delete, and modify different cookies in your browser for testing purposes using the EditThisCookie Chrome extension. -sidebar_position: 9.7 -slug: /tools/edit-this-cookie ---- - -**Learn how to add, delete, and modify different cookies in your browser for testing purposes using the EditThisCookie Chrome extension.** - ---- - -**EditThisCookie** is a Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chromewebstore.google.com/detail/editthiscookie-v3/ojfebgpkimhlhcblbalbfjblapadhbol). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain. - -![EditThisCookie popup](./images/edit-this-cookie-popup.png) - -## Functionalities {#functions} - -At the top of the popup, there is a row of buttons. From left to right, here is an explanation for each one: - -### Delete all cookies - -Clicking this button will remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again. - -### Reset - -A refresh button. - -### Add a new cookie - -Manually add a new cookie for the current domain. - -### Import cookies - -Allows you to add cookies in bulk. For example, if you have saved some cookies inside your crawler, or someone provided you with some cookies for the purpose of testing a certain website in your browser, they can be imported and automatically applied with this button. - -### Export cookies - -Copies an array of cookies associated with the current domain to the clipboard. The cookies can then be later inspected, added to your crawler, or imported by someone else using EditThisCookie. - -### Search - -Allows you to filter through cookies by name. - -### Options - -Will open a new browser tab with a bunch of EditThisCookie options. The options page allows you to tweak a few settings such as changing the export format, but you will most likely never need to change anything there. - -![EditThisCookie options](./images/edit-this-cookie-options.png) diff --git a/sources/academy/glossary/tools/images/edit-this-cookie-options.png b/sources/academy/glossary/tools/images/edit-this-cookie-options.png deleted file mode 100644 index 4d6ffe1145..0000000000 Binary files a/sources/academy/glossary/tools/images/edit-this-cookie-options.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/edit-this-cookie-popup.png b/sources/academy/glossary/tools/images/edit-this-cookie-popup.png deleted file mode 100644 index 78f20d3b9f..0000000000 Binary files a/sources/academy/glossary/tools/images/edit-this-cookie-popup.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/insomnia-cookies.png b/sources/academy/glossary/tools/images/insomnia-cookies.png deleted file mode 100644 index 4a33749978..0000000000 Binary files a/sources/academy/glossary/tools/images/insomnia-cookies.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/insomnia-interface.jpg b/sources/academy/glossary/tools/images/insomnia-interface.jpg deleted file mode 100644 index 682b79ed05..0000000000 Binary files a/sources/academy/glossary/tools/images/insomnia-interface.jpg and /dev/null differ diff --git a/sources/academy/glossary/tools/images/insomnia-manage-cookies.jpg b/sources/academy/glossary/tools/images/insomnia-manage-cookies.jpg deleted file mode 100644 index 79f877e57a..0000000000 Binary files a/sources/academy/glossary/tools/images/insomnia-manage-cookies.jpg and /dev/null differ diff --git a/sources/academy/glossary/tools/images/insomnia-proxy.png b/sources/academy/glossary/tools/images/insomnia-proxy.png deleted file mode 100644 index f1b3ef3af2..0000000000 Binary files a/sources/academy/glossary/tools/images/insomnia-proxy.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/insomnia-timeline.jpg b/sources/academy/glossary/tools/images/insomnia-timeline.jpg deleted file mode 100644 index c8af547e02..0000000000 Binary files a/sources/academy/glossary/tools/images/insomnia-timeline.jpg and /dev/null differ diff --git a/sources/academy/glossary/tools/images/js-off.png b/sources/academy/glossary/tools/images/js-off.png deleted file mode 100644 index aa191c5087..0000000000 Binary files a/sources/academy/glossary/tools/images/js-off.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/js-on.png b/sources/academy/glossary/tools/images/js-on.png deleted file mode 100644 index de5f3d98c9..0000000000 Binary files a/sources/academy/glossary/tools/images/js-on.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/modheader.jpg b/sources/academy/glossary/tools/images/modheader.jpg deleted file mode 100644 index 4948100863..0000000000 Binary files a/sources/academy/glossary/tools/images/modheader.jpg and /dev/null differ diff --git a/sources/academy/glossary/tools/images/postman-cookies-button.png b/sources/academy/glossary/tools/images/postman-cookies-button.png deleted file mode 100644 index e83c565eda..0000000000 Binary files a/sources/academy/glossary/tools/images/postman-cookies-button.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/postman-interface.png b/sources/academy/glossary/tools/images/postman-interface.png deleted file mode 100644 index 7e478fe4fd..0000000000 Binary files a/sources/academy/glossary/tools/images/postman-interface.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/postman-manage-cookies.png b/sources/academy/glossary/tools/images/postman-manage-cookies.png deleted file mode 100644 index 8e772efda2..0000000000 Binary files a/sources/academy/glossary/tools/images/postman-manage-cookies.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/postman-proxy.png b/sources/academy/glossary/tools/images/postman-proxy.png deleted file mode 100644 index 0f9d8903e4..0000000000 Binary files a/sources/academy/glossary/tools/images/postman-proxy.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/proxyman-apps-tab.png b/sources/academy/glossary/tools/images/proxyman-apps-tab.png deleted file mode 100644 index c4edb947ac..0000000000 Binary files a/sources/academy/glossary/tools/images/proxyman-apps-tab.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/proxyman-filter.png b/sources/academy/glossary/tools/images/proxyman-filter.png deleted file mode 100644 index 92cf4774f5..0000000000 Binary files a/sources/academy/glossary/tools/images/proxyman-filter.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/proxyman-results.jpg b/sources/academy/glossary/tools/images/proxyman-results.jpg deleted file mode 100644 index cb9fd9209b..0000000000 Binary files a/sources/academy/glossary/tools/images/proxyman-results.jpg and /dev/null differ diff --git a/sources/academy/glossary/tools/images/proxyman-view-request.jpg b/sources/academy/glossary/tools/images/proxyman-view-request.jpg deleted file mode 100644 index ea6aa1689e..0000000000 Binary files a/sources/academy/glossary/tools/images/proxyman-view-request.jpg and /dev/null differ diff --git a/sources/academy/glossary/tools/images/settings-integrations.jpg b/sources/academy/glossary/tools/images/settings-integrations.jpg deleted file mode 100644 index 26a72e6220..0000000000 Binary files a/sources/academy/glossary/tools/images/settings-integrations.jpg and /dev/null differ diff --git a/sources/academy/glossary/tools/images/switchyomega-auth.png b/sources/academy/glossary/tools/images/switchyomega-auth.png deleted file mode 100644 index 86e2b42308..0000000000 Binary files a/sources/academy/glossary/tools/images/switchyomega-auth.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/switchyomega-menu.png b/sources/academy/glossary/tools/images/switchyomega-menu.png deleted file mode 100644 index 3c873c777b..0000000000 Binary files a/sources/academy/glossary/tools/images/switchyomega-menu.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/switchyomega-options.png b/sources/academy/glossary/tools/images/switchyomega-options.png deleted file mode 100644 index abff2aadfc..0000000000 Binary files a/sources/academy/glossary/tools/images/switchyomega-options.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/switchyomega-proxy-profile.png b/sources/academy/glossary/tools/images/switchyomega-proxy-profile.png deleted file mode 100644 index 48e7b197ed..0000000000 Binary files a/sources/academy/glossary/tools/images/switchyomega-proxy-profile.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/switchyomega-proxy-settings.png b/sources/academy/glossary/tools/images/switchyomega-proxy-settings.png deleted file mode 100644 index e7e42745e6..0000000000 Binary files a/sources/academy/glossary/tools/images/switchyomega-proxy-settings.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/switchyomega.png b/sources/academy/glossary/tools/images/switchyomega.png deleted file mode 100644 index 35c60de17a..0000000000 Binary files a/sources/academy/glossary/tools/images/switchyomega.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/user-agent-switcher-agents.png b/sources/academy/glossary/tools/images/user-agent-switcher-agents.png deleted file mode 100644 index 4900ad2924..0000000000 Binary files a/sources/academy/glossary/tools/images/user-agent-switcher-agents.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/user-agent-switcher-config.png b/sources/academy/glossary/tools/images/user-agent-switcher-config.png deleted file mode 100644 index 390764cb77..0000000000 Binary files a/sources/academy/glossary/tools/images/user-agent-switcher-config.png and /dev/null differ diff --git a/sources/academy/glossary/tools/images/user-agent-switcher-groups.png b/sources/academy/glossary/tools/images/user-agent-switcher-groups.png deleted file mode 100644 index ddb360599f..0000000000 Binary files a/sources/academy/glossary/tools/images/user-agent-switcher-groups.png and /dev/null differ diff --git a/sources/academy/glossary/tools/index.md b/sources/academy/glossary/tools/index.md deleted file mode 100644 index 393b8dd6c5..0000000000 --- a/sources/academy/glossary/tools/index.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -title: Tools -description: Discover a variety of tools that can be used to enhance the scraper development process, or even unlock doors to new scraping possibilities. -sidebar_position: 17 -category: glossary -slug: /tools ---- - -**Discover a variety of tools that can be used to enhance the scraper development process, or even unlock doors to new scraping possibilities.** - ---- - -Here at Apify, we've found many tools, some quite popular and well-known and some niche, which can aid any developer in their scraper development process. We've compiled some of our favorite developer tools into this short section. Each tool featured here serves a specific purpose, if not multiple purposes, which are directly relevant to Web Scraping and Web Automation. - -In any lesson in the academy where a tool which was not already discussed in the course is being used, a short lesson about the tool will be featured in the **Tools** section right here in the Apify Academy's **Glossary** and referenced with a link within the lesson. diff --git a/sources/academy/glossary/tools/insomnia.md b/sources/academy/glossary/tools/insomnia.md deleted file mode 100644 index f0e9058a85..0000000000 --- a/sources/academy/glossary/tools/insomnia.md +++ /dev/null @@ -1,67 +0,0 @@ ---- -title: Insomnia -description: Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers. -sidebar_position: 9.2 -slug: /tools/insomnia ---- - -**Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.** - ---- - -Despite its name, the [Insomnia](https://insomnia.rest/download) desktop application has absolutely nothing to do with having a lack of sleep. Rather, it is a tool to build and test APIs. If you've already read about [Postman](./postman.md), you already know what Insomnia can be used for, as they both practically do the same exact things. -While Insomnia shares similarities with Postman, such as the ability to send requests with specific headers, cookies, and payloads, it has a few notable differences. One key difference is Insomnia's feature to display the entire request timeline. - -Insomnia can be downloaded from its [official website](https://insomnia.rest/download), and its features can be read about in the [official documentation](https://docs.insomnia.rest/). - -## The Insomnia interface {#insomnia-interface} - -After opening the app, you'll first need to create a new request. After creating the request, you'll see an interface that looks like this: - -![Insomnia interface](./images/insomnia-interface.jpg) - -Let's break down the main sections: - -### List of requests - -You can configure multiple requests with a custom payload, headers, cookies, parameters, etc. They are automatically saved in the list of requests until deleted. - -### Address bar - -The place where you select the type of request to send (**GET**, **POST**, **PUT**, **DELETE**, etc.), specify the URI of the request and send the request with the **Send** button. - -### Request options - -Here, you can add a request payload, specify authorization parameters, add query parameters, and attach headers to the request. - -### Response - -Where the response body is displayed after the request has been sent. Like in Postman, the request can be viewed in preview mode, pretty-printed, or in its raw form. This section also has the **Headers** and **Cookies** tabs, which respectively show the request headers and cookies. - -## Request timeline {#request-timeline} - -The one feature of Insomnia that separates it from Postman is the **Timeline**. - -![Request timeline](./images/insomnia-timeline.jpg) - -This feature allows you to see information about the request that is not present in the response body. - -## Using proxies in Insomnia {#using-proxies} - -In order to use a proxy, you need to specify the proxy's parameters in Insomnia's preferences. In preferences, scroll down to the **HTTP Network Proxy** section under the **General** tab and specify the full proxy URL there: - -![Configuring a proxy](./images/insomnia-proxy.png) - -## Managing the cookies cache {#managing-cookies-cache} - -Insomnia keeps the cookies for the requests you have already sent before. This might result in you receiving a different response within your scraper from what you're receiving in Insomnia, as a necessary cookie is not present in the request sent by the scraper. To check whether or not some cookies associated with a certain request have been cached, click on the **Cookies** button at the top of the list of requests: - -![Click on the "Cookies" button](./images/insomnia-cookies.png) - -This will bring up the **Manage cookies** window, where all cached cookies can be viewed, edited, or deleted. - -![The "Manage Cookies" tab](./images/insomnia-manage-cookies.jpg) - -## Postman or Insomnia {#postman-or-insomnia} - -The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, choose the one that has the most intuitive interface for you. diff --git a/sources/academy/glossary/tools/modheader.md b/sources/academy/glossary/tools/modheader.md deleted file mode 100644 index e8c92eac5e..0000000000 --- a/sources/academy/glossary/tools/modheader.md +++ /dev/null @@ -1,26 +0,0 @@ ---- -title: ModHeader -description: Discover a super useful Chrome extension called ModHeader, which allows you to modify your browser's HTTP request headers. -sidebar_position: 9.5 -slug: /tools/modheader ---- - -**Discover a super useful Chrome extension called ModHeader, which allows you to modify your browser's HTTP request headers.** - ---- - -If you read about [Postman](./postman.md), you might remember that you can use it to modify request headers before sending a request. This is great, but the main problem is that Postman can only make static requests - meaning, it is unable to load JavaScript or any [dynamic content](../concepts/dynamic_pages.md). - -[ModHeader](https://chrome.google.com/webstore/detail/idgpnmonknjnojddfkpgkljpfnnfcklj) is a Chrome extension which can be used to modify the HTTP headers of the requests you make with your browser. This means that, for example, if your scraper using a headless browser Puppeteer is being blocked due to an improper **User-Agent** header, you can use ModHeader to test the target website and quickly solve the issue. - -## The ModHeader interface {#interface} - -After you install the ModHeader extension, you should see it pinned in Chrome's task bar. When you click it, you'll see an interface like this pop up: - -![Modheader's interface](./images/modheader.jpg) - -Here, you can add headers, remove headers, and even save multiple collections of headers that you can toggle between (which are called **Profiles** within the extension itself). - -## Use cases {#use-cases} - -When scraping dynamic websites, sometimes some specific headers are required to access certain pages. The most popularly required headers are generally `User-Agent` and `referer`. ModHeader, and other tools like it, make it easy to test requests to these websites right in your browser before writing logic for your scraper. diff --git a/sources/academy/glossary/tools/postman.md b/sources/academy/glossary/tools/postman.md deleted file mode 100644 index d897a92dbb..0000000000 --- a/sources/academy/glossary/tools/postman.md +++ /dev/null @@ -1,64 +0,0 @@ ---- -title: Postman -description: Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers. -sidebar_position: 9.3 -slug: /tools/postman ---- - -**Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers.** - ---- - -[Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper. - -The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/introduction/overview/). - -## Understanding the interface {#understanding-the-interface} - -![A basic outline of Postman's interface](./images/postman-interface.png) - -Following four sections are essential to get familiar with Postman: - -### Tabs - -Multiple test endpoints/requests can be opened at one time, each of which will be held within its own tab. - -### Address bar - -The section in which you select the type of request to send, the URL of the request, and of course, send the request with the **Send Request** button. - -### Request options - -This is a very useful section where you can view and edit structured query parameters, as well as specify any authorization parameters, headers, or payloads. - -### Response - -After sending a request, the response's body will be found here, along with its cookies and headers. The response body can be viewed in various formats - **Pretty-Print**, **Raw**, or **Preview**. - -## Using and testing proxies {#using-proxies} - -In order to use a proxy, the proxy's server and configuration must be provided in the **Proxy** tab in Postman settings. - -![Proxy configuration in Postman settings](./images/postman-proxy.png) - -After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings needs to be un-ticked to disable it. - -## Managing the cookies cache {#managing-cookies} - -Postman keeps a cache of the cookies from all previous responses of a certain domain, which can be a blessing, but also a curse. Sometimes, you might notice that a request is going through just fine with Postman, but that your scraper is being blocked. - -More often than not in these cases, the reason is because the endpoint being reached requires a valid `cookie` header to be present when sending the request, and because of Postman's cache, it is sending a valid cookie within each request's headers, while your scraper is not. Another reason this may happen is because you are sending Postman requests without a proxy (using your local IP address), while your scraper is using a proxy that could potentially be getting blocked. - -In order to check whether there are any cookies associated with a certain request are cached in Postman, click on the **Cookies** button in any opened request tab: - -![Button to view the cached cookies](./images/postman-cookies-button.png) - -Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to **https://github.com/apify**, within this window we would be able to find cached cookies associated with github.com. Cookies can also be edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here. - -![Managing cookies in Postman with the "MANAGE COOKIES" window](./images/postman-manage-cookies.png) - -### Some alternatives to Postman {#alternatives} - -- [Hoppscotch](https://hoppscotch.io/) -- [Insomnia](./insomnia.md) -- [Testfully](https://testfully.io/) diff --git a/sources/academy/glossary/tools/proxyman.md b/sources/academy/glossary/tools/proxyman.md deleted file mode 100644 index 3d48028dc1..0000000000 --- a/sources/academy/glossary/tools/proxyman.md +++ /dev/null @@ -1,48 +0,0 @@ ---- -title: Proxyman -description: Learn about Proxyman, a tool for viewing all network requests that are coming through your system. Filter by response type, by a keyword, or by application. -sidebar_position: 9.4 -slug: /tools/proxyman ---- - -**Learn about Proxyman, a tool for viewing all network requests that are coming through your system. Filter by response type, by a keyword, or by application.** - ---- - -Though the name sounds very similar to [Postman](./postman.md), [**Proxyman**](https://proxyman.io/) is used for a different purpose. Rather than for manually sending and analyzing the responses of requests, Proxyman is a tool for macOS that allows you to view and analyze the HTTP/HTTPS requests that are going through your device. This is done by routing all of your requests through a proxy, which intercepts them and allows you to view data about them. Because it's just a proxy, the HTTP/HTTPS requests going through iOS devices, Android devices, and even iOS simulators can also be viewed with Proxyman. - -If you've already gone through the [**Locating and learning** lesson](../../webscraping/api_scraping/general_api_scraping/locating_and_learning.md) in the **API scraping** section, you can think of Proxyman as an advanced Network Tab, where you can see requests that you sometimes can't see in regular browser DevTools. - -## The basics {#the-basics} - -Though the application offers a whole lot of advanced features, there are only a few main features you'll be utilizing when using Proxyman for scraper development purposes. Let's open up Proxyman and take a look at some of the basic features: - -### Apps - -The **Apps** tab allows you to both view all of the applications on your machine which are sending requests, as well as filter requests based on application. - -![Apps tab in Proxyman](./images/proxyman-apps-tab.png) - -### Results - -Let's open up Safari and visit **apify.com**, then check back in Proxyman to see all of the requests Safari has made when visiting the website. - -![Results in Proxyman](./images/proxyman-results.jpg) - -We can see all of the requests related to us visiting **apify.com**. Then, by clicking a request, we can see a whole lot of information about it. The most important information for you, however, will usually be the request and response **headers** and **body**. - -![View a request](./images/proxyman-view-request.jpg) - -### Filtering - -Sometimes, there can be hundreds (or even thousands) of requests that appear in the list. Rather than spending your time rooting through all of them, you can use the plethora of filtering methods that Proxyman offers to find exactly what you are looking for. - -![Filter requests with the filter options](./images/proxyman-filter.png) - -## Alternatives {#alternatives} - -Since Proxyman is only available for macOS, it's only appropriate to list some alternatives to it that are accessible to our Windows and Linux friends: - -- [Burp Suite](https://portswigger.net/burp) -- [Charles Proxy](https://www.charlesproxy.com/documentation/installation/) -- [Fiddler](https://www.telerik.com/fiddler) diff --git a/sources/academy/glossary/tools/quick_javascript_switcher.md b/sources/academy/glossary/tools/quick_javascript_switcher.md deleted file mode 100644 index 543771697e..0000000000 --- a/sources/academy/glossary/tools/quick_javascript_switcher.md +++ /dev/null @@ -1,18 +0,0 @@ ---- -title: Quick JavaScript Switcher -description: Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs. -sidebar_position: 9.9 -slug: /tools/quick-javascript-switcher ---- - -**Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.** - ---- - -**Quick JavaScript Switcher** is a Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed. - -If JavaScript is enabled - clicking the button will switch it off and reload the page. The next click will re-enable JavaScript and refresh the page. This extension is useful for checking whether a certain website will work without JavaScript (and thus could be parsed without using a browser with a plain HTTP request) or not. - -![JavaScript toggled on (enabled)](./images/js-on.png) - -![JavaScript toggled off (disabled)](./images/js-off.png) diff --git a/sources/academy/glossary/tools/switchyomega.md b/sources/academy/glossary/tools/switchyomega.md deleted file mode 100644 index 60c72afdce..0000000000 --- a/sources/academy/glossary/tools/switchyomega.md +++ /dev/null @@ -1,49 +0,0 @@ ---- -title: SwitchyOmega -description: Discover SwitchyOmega, a Chrome extension to manage and switch between proxies, which is extremely useful when testing proxies for a scraper. -sidebar_position: 9.6 -slug: /tools/switchyomega ---- - -**Discover SwitchyOmega, a Chrome extension to manage and switch between proxies, which is extremely useful when testing proxies for a scraper.** - ---- - -SwitchyOmega is a Chrome extension for managing and switching between proxies which can be added in the [Chrome Webstore](https://chrome.google.com/webstore/detail/padekgcemlokbadohgkifijomclgjgif). - -After adding it to Chrome, you can see the SwitchyOmega icon somewhere amongst all your other Chrome extension icons. Clicking on it will display a menu, where you can select various different connection profiles, as well as open the extension's options. - -![The SwitchyOmega interface](./images/switchyomega.png) - -## Options {#options} - -The options page has the following: - -- General settings/interface settings (which you can keep to their default values). -- A list of proxy profiles (separate profiles can be added for different proxy groups, or for different countries for the residential proxy group, etc). -- The **New profile** button -- The main section, which shows the selected settings sub-section or selected proxy profile connection settings. - -![SwitchyOmega options page](./images/switchyomega-options.png) - -## Adding a new proxy {#adding-a-new-proxy} - -After clicking on **New profile**, you'll be greeted with a **New profile** popup, where you can give the profile a name and select the type of profile you'd like to create. To add a proxy profile, select the respective option and click **Create**. - -![Adding a proxy profile](./images/switchyomega-proxy-profile.png) - -Then, you need to fill in the proxy settings: - -![Adding proxy settings](./images/switchyomega-proxy-settings.png) - -If the proxy requires authentication, click on the lock icon and fill in the details within the popup. - -![Authenticating a proxy](./images/switchyomega-auth.png) - -Don't forget to click on **Apply changes** within the left-hand side menu under **Actions**! - -## Selecting proxy profiles {#selecting-profiles} - -And that's it! All of your proxy profiles will appear in the menu. When one is chosen, the page you are currently on will be reloaded using the selected proxy profile. - -![SwitchyOmega menu](./images/switchyomega-menu.png) diff --git a/sources/academy/glossary/tools/user_agent_switcher.md b/sources/academy/glossary/tools/user_agent_switcher.md deleted file mode 100644 index 3fa3211bcc..0000000000 --- a/sources/academy/glossary/tools/user_agent_switcher.md +++ /dev/null @@ -1,26 +0,0 @@ ---- -title: User-Agent Switcher -description: Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes. -sidebar_position: 9.8 -slug: /tools/user-agent-switcher ---- - -**Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** - ---- - -**User-Agent Switcher** is a Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups. - -![User-Agent Switcher groups](./images/user-agent-switcher-groups.png) - -Clicking on a group will display a list of possible User-Agents to set. - -![Default available Internet Explorer agents](./images/user-agent-switcher-agents.png) - -After setting the **User-Agent**, the page will be refreshed. - -## Configuration - -The extension configuration page allows you to edit the **User-Agent** list in case you want to add a specific User-Agent that isn't already provided. You can find some other options, but most likely you will never need to modify those. - -![User-Agent Switcher configuration page](./images/user-agent-switcher-config.png) diff --git a/sources/academy/platform/deploying_your_code/deploying.md b/sources/academy/platform/deploying_your_code/deploying.md index 25d530c19c..2f3185affa 100644 --- a/sources/academy/platform/deploying_your_code/deploying.md +++ b/sources/academy/platform/deploying_your_code/deploying.md @@ -51,7 +51,7 @@ That's it! The Actor should now pull its source code from the repository and aut :::info CLI prerequisite -If you don't yet have the Apify CLI, learn how to install it and log in by following along with [this brief lesson](../../glossary/tools/apify_cli.md) about it. +If you don't yet have the Apify CLI, learn how to install it and use it in [Apify CLI documentation](/cli/docs/installation). ::: diff --git a/sources/academy/platform/expert_scraping_with_apify/index.md b/sources/academy/platform/expert_scraping_with_apify/index.md index 87bb4a7178..452737ce32 100644 --- a/sources/academy/platform/expert_scraping_with_apify/index.md +++ b/sources/academy/platform/expert_scraping_with_apify/index.md @@ -22,7 +22,7 @@ Before developing a pro-level Apify scraper, there are some important things you If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5–10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to the [Using a scraping framework with Node.js](../../webscraping/scraping_basics_javascript/12_framework.md) lesson of the **Web scraping basics for JavaScript devs** course. To familiarize yourself with the Apify SDK, you can refer to the [Apify Platform](../apify_platform.md) category. -The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson](../../glossary/tools/apify_cli.md). +The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to this short lesson. ### Git {#git} diff --git a/sources/academy/platform/getting_started/apify_api.md b/sources/academy/platform/getting_started/apify_api.md index 5038f9f23e..03af085146 100644 --- a/sources/academy/platform/getting_started/apify_api.md +++ b/sources/academy/platform/getting_started/apify_api.md @@ -31,7 +31,7 @@ In this lesson, we'll only be focusing on this one endpoint, as it is the most p ::: -Now, let's move over to our favorite HTTP client (in this lesson we'll use [Insomnia](../../glossary/tools/insomnia.md) in order to prepare and send the request). +Now, let's move over to our favorite HTTP client (in this lesson we'll use Insomnia in order to prepare and send the request). ## Providing input @@ -51,7 +51,7 @@ Additional parameters can be passed to this endpoint. You can learn about them i :::caution Token security -Network components can record visited URLs, so it's more secure to send the token as a HTTP header, not as a parameter. The header should look like `Authorization: Bearer YOUR_TOKEN`. Popular HTTP clients, such as [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md), provide a convenient way to configure the Authorization header for all your API requests. +Network components can record visited URLs, so it's more secure to send the token as a HTTP header, not as a parameter. The header should look like `Authorization: Bearer YOUR_TOKEN`. Popular HTTP clients, such as Postman or Insomnia, provide a convenient way to configure the Authorization header for all your API requests. ::: diff --git a/sources/academy/sidebars.js b/sources/academy/sidebars.js index d274d0af15..5ee76a9f03 100644 --- a/sources/academy/sidebars.js +++ b/sources/academy/sidebars.js @@ -51,18 +51,4 @@ module.exports = { ], }, ], - glossary: [ - { - type: 'category', - label: 'Glossary', - collapsible: false, - className: 'section-header', - items: [ - { - type: 'autogenerated', - dirName: 'glossary', - }, - ], - }, - ], }; diff --git a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md index 40bec3fa39..296bf338e5 100644 --- a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md +++ b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md @@ -28,7 +28,7 @@ Some websites do not load any data without a browser, as they need to execute so ## Making the choice {#making-the-choice} -When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. +When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the Quick JavaScript Switcher extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using Postman or Insomnia or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. It also depends of course on whether you need to fill in some data (like a username and password) or select a location (such as entering a zip code manually). Tasks where interacting with the page is absolutely necessary cannot be done using plain HTTP scraping, and require headless browsers. In some cases, you might also decide to use a browser-based solution in order to better blend in with the rest of the "regular" traffic coming from real users. diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index 2c3bf31698..45b081b50e 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -32,7 +32,7 @@ In the vast majority of cases, this configuration should lead to success. Succes If the above tips didn't help, you can try to fiddle with the following: - Try different browsers. Crawlee & Playwright support Chromium, Firefox and WebKit out of the box. You can also try the [Brave browser](https://brave.com) which [can be configured for Playwright](https://blog.apify.com/unlocking-the-potential-of-brave-and-playwright-for-browser-automation/). -- Don't use browsers at all. Sometimes the anti-scraping protections are extremely sensitive to browser behavior but will allow plain HTTP requests (with the right headers) just fine. Don't forget to match the specific [HTTP headers](/academy/concepts/http-headers) for each request. +- Don't use browsers at all. Sometimes the anti-scraping protections are extremely sensitive to browser behavior but will allow plain HTTP requests (with the right headers) just fine. Don't forget to match the specific [HTTP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers) for each request. - Decrease concurrency. Slower scraping means you can blend in better with the rest of the traffic. - Add human-like behavior. Don't traverse the website like a bot (paginating quickly from 1 to 100). Instead, visit various types of pages, add time randomizations and you can even introduce some mouse movements and clicks. - Try Puppeteer with the [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth) plugin. Generally, Crawlee's default configuration should have stronger bypassing but some features might land first in the stealth plugin. @@ -115,7 +115,7 @@ This is the most straightforward and standard protection, which is mainly implem ### Header checking -This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific [header](../../glossary/concepts/http_headers.md) sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers. +This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific header sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers. ### URL analysis diff --git a/sources/academy/webscraping/anti_scraping/techniques/captchas.md b/sources/academy/webscraping/anti_scraping/techniques/captchas.md index 466f947a89..1f444c7530 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/captchas.md +++ b/sources/academy/webscraping/anti_scraping/techniques/captchas.md @@ -21,7 +21,7 @@ When you've hit a captcha, your first thought should not be how to programmatica Have you expended all of the possible options to make your scraper appear more human-like? Are you: - Using [proxies](../mitigation/proxies.md)? -- Making the request with the proper [headers](../../../glossary/concepts/http_headers.md) and [cookies](../../../glossary/concepts/http_cookies.md)? +- Making the request with the proper headers and cookies? - Generating and using a custom [browser fingerprint](./fingerprinting.md)? - Trying different general scraping methods (HTTP scraping, browser scraping)? If you are using browser scraping, have you tried using a different browser? diff --git a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md index 788de4d0e8..835ba22376 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md @@ -21,7 +21,7 @@ To collect a good fingerprint, websites must collect them from various places. ### From HTTP headers {#from-http-headers} -Several [HTTP headers](../../../glossary/concepts/http_headers.md) can be used to create a fingerprint about a user. Here are some of the main ones: +Several HTTP headers can be used to create a fingerprint about a user. Here are some of the main ones: 1. **User-Agent** provides information about the browser and its operating system (including its versions). 2. **Accept** tells the server what content types the browser can render and send, and **Content-Encoding** provides data about the content compression. diff --git a/sources/academy/webscraping/anti_scraping/techniques/firewalls.md b/sources/academy/webscraping/anti_scraping/techniques/firewalls.md index cf190817ab..bc33230ad7 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/firewalls.md +++ b/sources/academy/webscraping/anti_scraping/techniques/firewalls.md @@ -22,7 +22,7 @@ WAFs work on a similar premise as regular firewalls. Web admins define the rules 1. The visitor sends a request to the webpage. 2. The request is intercepted by the firewall. 3. The firewall decides if presenting a challenge (captcha) is necessary. If the user already solved a captcha in the past or nothing is suspicious, it will immediately forward the request to the application's server. -4. A captcha is presented which must be solved. Once it is solved, a [cookie](../../../glossary/concepts/http_cookies.md) is stored in the visitor's browser. +4. A captcha is presented which must be solved. Once it is solved, a cookie is stored in the visitor's browser. 5. The request is forwarded to the application's server. ![Cloudflare WAP workflow](./images/cloudflare-graphic.jpg) @@ -32,9 +32,9 @@ Since there are multiple providers, it is essential to say that the challenges a ## Bypassing web-application firewalls {#bypassing-firewalls} - Using [proxies](../mitigation/proxies.md). -- Mocking [headers](../../../glossary/concepts/http_headers.md). +- Mocking headers. - Overriding the browser's [fingerprint](./fingerprinting.md) (most effective). -- Farming the [cookies](../../../glossary/concepts/http_cookies.md) from a website with a headless browser, then using the farmed cookies to do HTTP based scraping (most performant). +- Farming the cookies from a website with a headless browser, then using the farmed cookies to do HTTP based scraping (most performant). As you likely already know, there is no solution that fits all. If you are struggling to get past a WAF provider, you can try using Firefox with Playwright. diff --git a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md index 1364ba58dd..ee5f886445 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md +++ b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md @@ -13,7 +13,7 @@ Geolocation is yet another way websites can detect and block access or show limi ## Cookies & headers {#cookies-headers} -Certain websites might use certain location-specific/language-specific [headers](../../../glossary/concepts/http_headers.md)/[cookies](../../../glossary/concepts/http_cookies.md) to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)). +Certain websites might use certain location-specific/language-specific headers/cookies to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)). On targets which are utilizing just cookies and headers to identify the location from which a request is coming from, it is pretty straightforward to make requests which appear like they are coming from somewhere else. @@ -21,7 +21,7 @@ On targets which are utilizing just cookies and headers to identify the location The oldest (and still most common) way of geolocating is based on the IP address used to make the request. Sometimes, country-specific sites block themselves from being accessed from any other country (some Chinese, Indian, Israeli, and Japanese websites do this). -[Proxies](../mitigation/proxies.md) can be used in a scraper to bypass restrictions and to make requests from a different location. Oftentimes, proxies need to be used in combination with location-specific [cookies](../../../glossary/concepts/http_cookies.md)/[headers](../../../glossary/concepts/http_headers.md). +[Proxies](../mitigation/proxies.md) can be used in a scraper to bypass restrictions and to make requests from a different location. Oftentimes, proxies need to be used in combination with location-specific cookies/headers. ## Override/emulate geolocation when using a browser-based scraper {#override-emulate-geolocation} diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md index 8afd602af8..864c8158f8 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md @@ -88,7 +88,7 @@ const response = await gotScraping({ ## Tokens {#tokens} -For our SoundCloud example, testing the endpoint from the previous section in a tool like [Postman](../../../glossary/tools/postman.md) works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud. +For our SoundCloud example, testing the endpoint from the previous section in a tool like Postman works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud. Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead. diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md index 2f4638ba71..df66bacb82 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md @@ -23,7 +23,7 @@ Cheddar website was changed and the below example no longer works there. Nonethe ::: -In order to perform introspection on our [target website](https://www.cheddar.com), we need to make a request to their GraphQL API with this introspection query using [Insomnia](../../../glossary/tools/insomnia.md) or another HTTP client that supports GraphQL: +In order to perform introspection on our [target website](https://www.cheddar.com), we need to make a request to their GraphQL API with this introspection query using Insomnia or another HTTP client that supports GraphQL: > To make a GraphQL query in Insomnia, make sure you've set the HTTP method to **POST** and the request body type to **GraphQL Query**. @@ -201,7 +201,7 @@ If the target website is smart, they will have introspection disabled. One of th ![Introspection disabled](./images/introspection-disabled.png) -In these cases, it is still possible to get some information about the API when using [Insomnia](../../../glossary/tools/insomnia.md) or [Postman](../../../glossary/tools/postman.md), due to the autocomplete that they provide. If we remember from the [Building a query](#building-a-query) section of this lesson, we were able to receive autocomplete suggestions when we entered a non-existent field into the query. Though this is not as great as seeing an entire visualization of the API in GraphQL Voyager, it can still be quite helpful. +In these cases, it is still possible to get some information about the API when using Insomnia or Postman, due to the autocomplete that they provide. If we remember from the [Building a query](#building-a-query) section of this lesson, we were able to receive autocomplete suggestions when we entered a non-existent field into the query. Though this is not as great as seeing an entire visualization of the API in GraphQL Voyager, it can still be quite helpful. ## Next up {#next} diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md index a3fad312e0..ebada4e1b3 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md @@ -122,7 +122,7 @@ const emailsToSend = [ ]; ``` -What we could do is log in 3 different times, then automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the [cookies](../../../glossary/concepts/http_cookies.md) stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. +What we could do is log in 3 different times, then automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the cookies stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. With this knowledge of cookies, it can be concluded that we can pass the cookies generated by the code above right into each new browser context that we use to send each email. That way, we won't have to run the login flow each time. diff --git a/sources/academy/webscraping/puppeteer_playwright/index.md b/sources/academy/webscraping/puppeteer_playwright/index.md index 57c62f560f..4471871294 100644 --- a/sources/academy/webscraping/puppeteer_playwright/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/index.md @@ -23,7 +23,7 @@ Both packages were developed by the same team and are very similar, which is why When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc. -Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped). Turn on the [headful mode](https://playwright.dev/docs/api/class-testoptions#test-options-headless) (`headless: false`) to see exactly what the browser is doing. +Additionally, since the requests aren't static, dynamic content can be rendered and interacted with (or, data from the dynamic content can be scraped). Turn on the [headful mode](https://playwright.dev/docs/api/class-testoptions#test-options-headless) (`headless: false`) to see exactly what the browser is doing. Browsers can also be effective for [overcoming anti-scraping measures](../anti_scraping/index.md), especially if the website is running [JavaScript browser challenges](../anti_scraping/techniques/browser_challenges.md). diff --git a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md index fee881f213..98ca24bc39 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md @@ -12,11 +12,11 @@ import TabItem from '@theme/TabItem'; --- -In a perfect world, every piece of content served on a website would be loaded instantaneously. We don't live in a perfect world though, and often times it can take anywhere between 1/10th of a second to a few seconds to load some content onto a page. Certain elements are also [generated dynamically](../../../glossary/concepts/dynamic_pages.md), which means that they are not present in the initial HTML and that they are created by scripts or data from API calls. +In a perfect world, every piece of content served on a website would be loaded instantaneously. We don't live in a perfect world though, and often times it can take anywhere between 1/10th of a second to a few seconds to load some content onto a page. Certain elements are also generated dynamically, which means that they are not present in the initial HTML and that they are created by scripts or data from API calls. Puppeteer and Playwright don't sit around waiting for a page (or specific elements) to load though - if we tell it to do something with an element that hasn't been rendered yet, it'll start trying to do it (which will result in nasty errors). We've got to tell it to wait. -> For a thorough explanation on how dynamic rendering works, give [**Dynamic pages**](../../../glossary/concepts/dynamic_pages.md) a quick readover, and check out the examples. +> For a thorough explanation on how dynamic rendering works, give **Dynamic pages** a quick readover, and check out the examples. Different events and elements can be waited for using the various `waitFor...` methods offered. @@ -56,7 +56,7 @@ Now, we won't see the error message anymore, and the first result will be succes If we remember properly, after clicking the first result, we want to console log the title of the result's page and save a screenshot into the filesystem. In order to grab a solid screenshot of the loaded page though, we should **wait for navigation** before snapping the image. This can be done with [`page.waitForNavigation()`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-pagewaitfornavigationoptions). -> A navigation is when a new [page load](../../../glossary/concepts/dynamic_pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. +> A navigation is when a new page load happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. Naively, you might immediately think that this is the way we should wait for navigation after clicking the first result: diff --git a/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md b/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md index 69902e83b3..94545b3e7d 100644 --- a/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md +++ b/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md @@ -31,7 +31,7 @@ router.addHandler(labels.PRODUCT, async ({ $ }) => { ``` -Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman](../../../glossary/tools/proxyman.md) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers: +Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up Proxyman to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers: ![View offers button](./images/view-offers-button.jpg) diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md index 805d101fe5..68239e8282 100644 --- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md +++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md @@ -20,7 +20,7 @@ Even though DevTools stands for developer tools, everyone can use them to inspec ## Elements tab {#elements-tab} -When you first open Chrome DevTools on Wikipedia, you will start on the Elements tab (In Firefox it's called the **Inspector**). You can use this tab to inspect the page's HTML on the left hand side, and its CSS on the right. The items in the HTML view are called [**elements**](../../../glossary/concepts/html_elements.md). +When you first open Chrome DevTools on Wikipedia, you will start on the Elements tab (In Firefox it's called the **Inspector**). You can use this tab to inspect the page's HTML on the left hand side, and its CSS on the right. The items in the HTML view are called **elements**. ![Elements tab in Chrome DevTools](./images/browser-devtools-elements-tab.png) diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md index d3d56c1c28..f29b8d64ac 100644 --- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md +++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md @@ -14,7 +14,7 @@ import LegacyAdmonition from '../../scraping_basics/_legacy.mdx'; --- -With the knowledge of the basics of DevTools we can finally try doing something more practical - extracting data from a website. Let's try collecting the on-sale products from the [Warehouse store](https://warehouse-theme-metal.myshopify.com/). We will use [CSS selectors](../../../glossary/concepts/css_selectors.md), JavaScript, and DevTools to achieve this task. +With the knowledge of the basics of DevTools we can finally try doing something more practical - extracting data from a website. Let's try collecting the on-sale products from the [Warehouse store](https://warehouse-theme-metal.myshopify.com/). We will use CSS selectors, JavaScript, and DevTools to achieve this task. > **Why use a Shopify demo and not a real e-commerce store like Amazon?** Because real websites are usually bulkier, littered with promotions, and they change very often. Many have multiple versions of pages, and you never know in advance which one you will get. It will be important to learn how to deal with these challenges in the future, but for this beginner course, we want to have a light and stable environment. > @@ -44,7 +44,7 @@ Now that we know how the parent element looks, we can extract its data, includin ## Selecting elements in Console {#selecting-elements} -We know how to find an element manually using the DevTools, but that's not very useful for automated scraping. We need to tell the computer how to find it as well. We can do that using JavaScript and CSS selectors. The function to do that is called [`document.querySelector()`](../../../glossary/concepts/querying_css_selectors.md) and it will find the first element in the page's HTML matching the provided [CSS selector](../../../glossary/concepts/css_selectors.md). +We know how to find an element manually using the DevTools, but that's not very useful for automated scraping. We need to tell the computer how to find it as well. We can do that using JavaScript and CSS selectors. The function to do that is called `document.querySelector()` and it will find the first element in the page's HTML matching the provided CSS selector. For example `document.querySelector('div')` will find the first `
` element. And `document.querySelector('.my-class')` (notice the period `.`) will find the first element with the class `my-class`, such as `
` or `

`. @@ -70,7 +70,7 @@ When we look more closely by hovering over the result in the Console, we find th ![Hover over a query result](./images/devtools-collection-query-hover.png) -We need a different function: [`document.querySelectorAll()`](../../../glossary/concepts/querying_css_selectors.md) (notice the `All` at the end). This function does not find only the first element, but all the elements that match the provided selector. +We need a different function: `document.querySelectorAll()` (notice the `All` at the end). This function does not find only the first element, but all the elements that match the provided selector. Run the following function in the Console: