Turning A Webpage Into Markdown (For AI Analysis) → via Apify
Ruh roh!
Missing Prerequisites
To complete this module, you need to complete the following first:
- Lesson: Apify Checkpoint / Test Out
This Apify actor scrapes a single webpage and parses to markdown. Includes browser-based scraping, smart retrying, anti-scrape block (e.g. cloudflare) circumvention, and smart proxy support to ensure a high success rate.
It also includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).
🤔 When To Use It
Whenever you want to reliably get a webpage’s content and parse it into markdown.
(I personally mostly use it for feeding data into ChatGPT for freelance cold outreach personalization & automation tasks)
😰 Why We Made it:
If you want to have ChatGPT interpret a webpage, it can be surprisingly difficult with current tooling.
- 😭 ChatGPT’s API isn’t currently web-connected
- 😿 If you try to get a page’s content via a Make automation and parse it to text/markdown, it’s unreliable and produces a lot of soft failures and rendering errors
- 🤢 If you try to use standalone tools for webpage scraping to markdown conversion, they’re expensive and also have a lot of soft failures & markdown rendering errors
- 😣 If you use the other website-crawling-to-markdown scrapers on Apify they tend to be expensive and unreliable.
That’s why we made this Actor…
💪 Why This Actor is Nifty:
😍 This actor allows you to simply plop in a big ole list of domain names, and get a huge spreadsheet of markdown content back, to do whatever you want with.
(e.g. upload to google sheets and have ChatGPT iterate through via Make automation)
🤘 Features:
- ✅ Anti-Scrape Circumvention — if you use the “Get Data Using Browser” option, we’ll be able to circumvent many blocks
- ✅ Soft-Failure Reporting — e.g. if a webpage comes back blank, we’ll mark it as a failure — not a lot of other solutions do this)
- ✅ Smart Proxy Support — we’ll run on Datacenter proxies by default, and only revert to Residential proxies when actually necessary
- ✅ Smart Retrying — we’ll auto retry on failures and rotate proxies and IPs to get you the most successful results possible
💭 Example Use Cases:
If you’re a $200k Freelancer course student, be sure to check the course training area for guidance on the below use cases and more.
Website Language Detection:
- Run this actor
- Put results into a Google Sheet
- Filter out the fails
- Add the formula
=DETECTLANGUAGE(E2)(assumingEis the markdown column) to a new column - Extend that formula to all rows in the column
- Filter results to not show languages you don’t want (e.g. filter to only show
enfor only English language websites
Cold Outreach Personalization:
(e.g. find out what kinds of products a company sells, who their audience avatar is, etc.)
- Run this actor
- Put results into a Google Sheet
- Filter out the fails
- Create a Make automation that feeds the markdown into ChatGPT for analysis
- Have ChatGPT give you its analyses back as JSON if you want multiple fields / analyses back (e.g. “type_of_products_sold,” “random_product_name,” etc.
- Parse the JSON and add each field to a column in the Google Sheet
- You can now feed these data into a line-writer ChatGPT prompt to have it rewrite a template line with this personalization data
Modes of Operation
Regardless of which mode you use it in, if you’re exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)
“Low-Hanging Fruit” Mode
The following settings are efficient and the cheapest path to data, but won’t work for a lot of websites:
- “Get Data Using Browser” option disabled
- 1GB of RAM
- Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)
Estimated Costs for “Low-Hanging Fruit” Mode:
- Est. cost per result in “Low-Hanging Fruit” Mode: $0.00025
- Est. yield on results: 84.12%
“All The Damned Fruit” Mode
The following settings have very high reliability, but are more expensive:
- “Get Data Using Browser” option enabled
- 4GB of RAM (You can often get away with 2GB – or even 1GB – of RAM, which will make it much cheaper.)
- Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)
Estimated Costs for “All The Damned Fruit” Mode:
- Est. cost per result in “All The Damned Fruit” Mode: $0.0069 CPL for residential proxies ($0.0012 CPL for datacenter)
- Est. yield on results: 93.38% for residential (91.64% datacenter)
Suggested Usage
Depending on your priorities, there are a couple ways to use this scraper. What’s your priority?
“My Priority is EASE”
(“…And I don’t care if it costs more.”)
👉 Run it with the settings from the “All The Damned Fruit” Mode from the ‘Modes of Operation” instructions right from the start.
If you’re exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)
“My Priority is COST”
(“…And I don’t care if it means there are a couple extra steps for me.”)
👉 You’ll do two separate runs — first you’ll get all the cheap Low-Hanging Fruit results you can, then you’ll re-run all the failures in the “All The Damned Fruit” Mode.
Instructions:
- Run for your full set of URLs with the
"Low-Hanging Fruit" Modesettings (You can find them in theModes of Operationsection at the top of this page) - After the run is finished, export the results to Excel format and filter the list to only show the failures
- Re-run these failures with the settings from the
"All The Damned Fruit" Modesettings (You can find them in theModes of Operationsection at the top of this page) - Export the results from both runs and merge the data manually into one sheet
All Config Options
- Maximum Content Length (Characters) — This will trim each record’s markdown output before we add it to the result set. Cuts down on spreadsheet filesize. (Our hard-set internal trim maximum is 10,000 characters)