Turning A Webpage Into Markdown (For AI Analysis) → via Apify

😴 Video not loading? Try loading it on the Muse.ai site directly

Interactive Transcript:

Want to try the interactive transcript? (It's currently a bit RAM-intensive) Yes!

Resources

Zach's "Webpage Content To Markdown" Apify Scraper(Apify)

Zach's FavsCheapApify Workflow

At A Glance...
- Tool URLhttps://apify.com/dyf/webpage-to-markdown
- What is it?A scraper Zach had built that will scrape a webpage, convert it to markdown, and add it to a spreadsheet for you to analyze with ChatGPT or whatever later
- Pros
  - Cheap (As low as $0.0002 CPL!)
  - More reliable than other website -> markdown scrapers
  - Wayyy more reliable than trying to get a page and parse it to markdown in a Make automation
  - Easy
- Cons
  - Not 100% perfect; sometimes it says a page has content but the content it got is "error page content" like a cloudflare block

Client types it is generally best for
- Company Size?
  - Larger companies where an employee is your point of contact & key decision-maker
  - Smaller companies where the owner is your point of contact & key decision-maker
- Primary Presence?
  - Online / Digital (SaaS, ecomm, course creators, agencies, etc)
  - Brick & Mortar (Gyms, retail stores, restaurants, construction, etc.)
- Primary Monetization Style?
  - Products
  - Services

Other Info:

Data Acquisition Style
- Scraped / Human Labor 2 - Semi-automatic (e.g. with AI or partial tool assistance)

Data Quality✅ High Quality / Fairly Reliable

Our Experience With This StrategyQuite familiar

How good is it for the various lead taco ingredients?
- 🐠 Raw Leads?🚫 No
- 🌪️ List-Narrowing?✅ Yes
- 🍋‍🟩 Free Personalization?😍 Top Favs
- 🧀 Biz Names?✅ Yes
- 🥑 Emails?👎 Possible, but not recommended
- 🥞 Person Names?🤨 Sometimes
- 💼 Job Titles?🤨 Sometimes
- 🧹 List Cleaning?🚫 No

Full Content:

Ruh roh!

Missing Prerequisites

To complete this module, you need to complete the following first:

Lesson: Apify Checkpoint / Test Out

This Apify actor scrapes a single webpage and parses to markdown. Includes browser-based scraping, smart retrying, anti-scrape block (e.g. cloudflare) circumvention, and smart proxy support to ensure a high success rate.

It also includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).

🤔 When To Use It

Whenever you want to reliably get a webpage’s content and parse it into markdown.

(I personally mostly use it for feeding data into ChatGPT for freelance cold outreach personalization & automation tasks)

😰 Why We Made it:

If you want to have ChatGPT interpret a webpage, it can be surprisingly difficult with current tooling.

😭 ChatGPT’s API isn’t currently web-connected
😿 If you try to get a page’s content via a Make automation and parse it to text/markdown, it’s unreliable and produces a lot of soft failures and rendering errors
🤢 If you try to use standalone tools for webpage scraping to markdown conversion, they’re expensive and also have a lot of soft failures & markdown rendering errors
😣 If you use the other website-crawling-to-markdown scrapers on Apify they tend to be expensive and unreliable.

That’s why we made this Actor…

💪 Why This Actor is Nifty:

😍 This actor allows you to simply plop in a big ole list of domain names, and get a huge spreadsheet of markdown content back, to do whatever you want with.

(e.g. upload to google sheets and have ChatGPT iterate through via Make automation)

🤘 Features:

✅ Anti-Scrape Circumvention — if you use the “Get Data Using Browser” option, we’ll be able to circumvent many blocks
✅ Soft-Failure Reporting — e.g. if a webpage comes back blank, we’ll mark it as a failure — not a lot of other solutions do this)
✅ Smart Proxy Support — we’ll run on Datacenter proxies by default, and only revert to Residential proxies when actually necessary
✅ Smart Retrying — we’ll auto retry on failures and rotate proxies and IPs to get you the most successful results possible

💭 Example Use Cases:

If you’re a $200k Freelancer course student, be sure to check the course training area for guidance on the below use cases and more.

Website Language Detection:

Run this actor
Put results into a Google Sheet
Filter out the fails
Add the formula =DETECTLANGUAGE(E2) (assuming E is the markdown column) to a new column
Extend that formula to all rows in the column
Filter results to not show languages you don’t want (e.g. filter to only show en for only English language websites

Cold Outreach Personalization:

(e.g. find out what kinds of products a company sells, who their audience avatar is, etc.)

Run this actor
Put results into a Google Sheet
Filter out the fails
Create a Make automation that feeds the markdown into ChatGPT for analysis
Have ChatGPT give you its analyses back as JSON if you want multiple fields / analyses back (e.g. “type_of_products_sold,” “random_product_name,” etc.
Parse the JSON and add each field to a column in the Google Sheet
You can now feed these data into a line-writer ChatGPT prompt to have it rewrite a template line with this personalization data

Modes of Operation

Regardless of which mode you use it in, if you’re exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)

“Low-Hanging Fruit” Mode

The following settings are efficient and the cheapest path to data, but won’t work for a lot of websites:

“Get Data Using Browser” option disabled
1GB of RAM
Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)

Estimated Costs for “Low-Hanging Fruit” Mode:

Est. cost per result in “Low-Hanging Fruit” Mode: $0.00025
Est. yield on results: 84.12%

“All The Damned Fruit” Mode

The following settings have very high reliability, but are more expensive:

“Get Data Using Browser” option enabled
4GB of RAM (You can often get away with 2GB – or even 1GB – of RAM, which will make it much cheaper.)
Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)

Estimated Costs for “All The Damned Fruit” Mode:

Est. cost per result in “All The Damned Fruit” Mode: $0.0069 CPL for residential proxies ($0.0012 CPL for datacenter)
Est. yield on results: 93.38% for residential (91.64% datacenter)

Suggested Usage

Depending on your priorities, there are a couple ways to use this scraper. What’s your priority?

“My Priority is EASE”

(“…And I don’t care if it costs more.”)

👉 Run it with the settings from the “All The Damned Fruit” Mode from the ‘Modes of Operation” instructions right from the start.

If you’re exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)

“My Priority is COST”

(“…And I don’t care if it means there are a couple extra steps for me.”)

👉 You’ll do two separate runs — first you’ll get all the cheap Low-Hanging Fruit results you can, then you’ll re-run all the failures in the “All The Damned Fruit” Mode.

Instructions:

Run for your full set of URLs with the "Low-Hanging Fruit" Mode settings (You can find them in the Modes of Operation section at the top of this page)
After the run is finished, export the results to Excel format and filter the list to only show the failures
Re-run these failures with the settings from the "All The Damned Fruit" Mode settings (You can find them in the Modes of Operation section at the top of this page)
Export the results from both runs and merge the data manually into one sheet

All Config Options

Maximum Content Length (Characters) — This will trim each record’s markdown output before we add it to the result set. Cuts down on spreadsheet filesize. (Our hard-set internal trim maximum is 10,000 characters)

Module Content

0% Complete 0/3 Steps

How to Use The Apify Webpage Markdown Scraper ($0.0002 CPL)

Use Case: Determining Website Language

Use Case: ChatGPT Analysis For Personalization

Previous Module

Back to Course

Next Lesson