Selenium Web Scraping with Python - Setup, Selectors, Waits, and Scaling

🎄 Get free swag with any December plan! Learn more. Try our new Agent endpoint! Start for free →

Get started

Ready to build?

Start getting Web Data for free and scale seamlessly as your project expands. No credit card needed.

Ninad Pathak

Dec 26, 2025

Selenium Web Scraping with Python - Setup, Selectors, Waits, and Scaling image

Back in 2016, I spent an entire year building programmatic websites using Selenium.

Quickly scaled to 50,000+ pages, scraping public databases and marketplaces, and the system worked beautifully for eight months. That was until some of the websites were updated, and my scripts broke.

You see, Selenium can be incredibly powerful and quite fragile at the same time and sometimes it may take more time maintaining Selenium than the automation is worth.

So, let’s look at web scraping with Selenium and Python, where Selenium scrapers can fail, and some better, more robust solutions to replace Selenium.

TL;DR:

Selenium 4.39+ includes automatic driver management, so no manual ChromeDriver downloads
Use WebDriverWait instead of time.sleep() for reliable dynamic content scraping
Headless mode (—headless=new) runs browsers without visible windows for production
CSS selectors and XPath locate elements; find_element() gets one, find_elements() gets all
Firecrawl offers AI-powered extraction using schemas instead of brittle CSS selectors
You can Selenium for learning scraping fundamentals and Firecrawl for production at scale

What is Selenium web scraping?

Selenium web scraping is a browser automation technique that extracts data from JavaScript-heavy websites where simple HTTP requests fail. Selenium launches a real browser, waits for dynamic content to load, and even interacts with pages exactly like a human user would.

This makes Selenium an important library for scraping modern single-page applications, infinite-scroll feeds, and sites that load data asynchronously after the initial page load.

3 page rendering methods and libraries that work best for them

Modern websites use one of three rendering approaches:

Server-side rendering (SSR): The server generates complete HTML on the backend and delivers the ready page to your browser. When you view source (Ctrl+U), you see all the content. For these sites, you can use simpler libraries like requests or BeautifulSoup.
Client-side rendering (CSR): These websites send JavaScript that builds the DOM after the page loads in your browser. Scraping libraries that only capture the initial packet cann**ot see the actual HTML content. CSR is where you need Selenium to “view” the page, allow a few seconds to load, and then pull the source code.
Hybrid rendering: A mix of both approaches.

How to install Selenium and the Selenium WebDriver

The biggest improvement in recent Selenium releases is automatic driver management. You no longer download ChromeDriver manually or manage PATH variables. Selenium Manager handles everything.

Setting up your Selenium web scraping project

There are two clean approaches for the setup. Pick whichever matches your workflow.

Option A: Classic pip setup

# Create virtual environment
python3.13 -m venv selenium_env

# Activate it
source selenium_env/bin/activate  # macOS/Linux
selenium_env\Scripts\activate     # Windows

# Install Selenium
pip install --upgrade pip
pip install selenium==4.39.0

Option B: Modern uv setup

# Initialize project
uv init selenium-scraper
cd selenium-scraper

# Add Selenium
uv add selenium

Both work perfectly. Create a scraper.py file in your project directory. That’s where your code goes.

What about Selenium browsers driver?

Managing browser drivers used to be a struggle in the early days of Selenium. Fortunately, you don’t need to download these drivers manually anymore.

Selenium 4.6+ includes Selenium Manager. It automatically detects your Chrome or Firefox version, downloads the matching driver to ~/.cache/selenium, and manages updates as well.

Just download Chrome or Firefox normally from their official sites, as Selenium uses the browser on your device.

Verify everything works

Here’s a simple Selenium test file.

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://firecrawl.dev")
print(driver.title)
driver.quit()

Run it:

python3 scraper.py

Firecrawl - The Web Data API for AI

Selenium basics: Working with the web browser

Now that Selenium’s installed, let’s actually open a browser and load a page (isn’t that what we’re here for?)

Basic browser launch using Selenium

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://firecrawl.dev")
print(driver.title)
driver.quit()

What’s happening:

webdriver.Chrome() launches Chrome. A browser window pops up.
driver.get() loads the URL. Just like typing it into your address bar.
driver.title returns the page’s <title> tag content.
driver.quit() closes the browser completely.

Run this. You’ll see Chrome open, load the page, then close immediately. That’s your first successful Selenium session.

Headless mode in Selenium

Headless mode runs the browser without a visible window, which makes it perfect for:

Servers and CI pipelines (no display)
Background jobs and cron tasks
Running multiple browsers in parallel
Avoiding screen clutter during development

Here’s how to enable headless mode in Selenium:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

opts = Options()
opts.add_argument("--headless=new")

driver = webdriver.Chrome(options=opts)
driver.get("https://firecrawl.dev")
print(driver.title)
driver.quit()

The --headless=new flag uses Chrome 109’s modern headless mode. It runs the same rendering engine as regular Chrome (just without a window), making scraping simple.

The old implementation had a distinct user agent, inconsistent font rendering, and quirky CSS behavior that caused scraping failures on many modern websites using custom fonts and layouts (which is pretty much all the websites)

Taking screenshots with Selenium

Sometimes you just want proof that the page loaded correctly. One line:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

opts = Options()
opts.add_argument("--headless=new")
opts.add_argument("--window-size=1920,1080")  # Set viewport dimensions

driver = webdriver.Chrome(options=opts)
driver.get("https://firecrawl.dev")
driver.save_screenshot("firecrawl-homepage.png")  # Creates 1920x1080 screenshot
driver.quit()

Selenium captures exactly what the browser sees: rendered content, applied CSS, and loaded images. This feature can be quite handy for debugging dynamic pages or verifying if the login flows worked.

How to find elements on a page using Selenium

Scraping doesn’t work unless you can locate the data you want. Selenium gives you multiple ways to find elements, including CSS selectors, XPath, by ID, by HTML tag name, and more.

Here’s a simple script that allows you to find elements that match a specific ID or HTML tag.

from selenium import webdriver
from selenium.webdriver.common.by import By

# Configure Chrome options
opts = webdriver.ChromeOptions()
opts.add_argument("--headless=new")
opts.add_argument("--window-size=1920,1080")

driver = webdriver.Chrome(options=opts)
driver.get("https://firecrawl.dev")

# Print the main heading
print(driver.find_element(By.TAG_NAME, "h1").text)

# Print the first 3 navigation links
for link in driver.find_elements(By.CSS_SELECTOR, "nav a")[:3]:
    print(f"{link.text}: {link.get_attribute('href')}")

# Print the CTA button text
print("CTA text:", driver.find_element(By.CSS_SELECTOR, "a[href*='playground']").text)

driver.quit()

Here’s what you’d get:

Turn websites into
LLM-ready data
CTA text: Playground

Key difference:

find_element() returns the first matching element. If nothing matches, it raises an exception (NoSuchElementException).
find_elements() returns a list of all matching elements. If nothing matches, it returns an empty list (no exception).

When to use which

Use find_element() when the element must exist (for example, a main heading, login button, or search field).
Use find_elements() for optional or repeating elements (for example, navigation links, product cards, or menu items).

Available locator types in Selenium

Strategy	Use Case	Example
`By.ID`	Fast, unique elements	`driver.find_element(By.ID, "login-btn")`
`By.CSS_SELECTOR`	Flexible, readable	`driver.find_element(By.CSS_SELECTOR, "nav a[href='/pricing']")`
`By.XPATH`	Complex queries, text matching	`driver.find_element(By.XPATH, "//a[text()='Login']")`
`By.CLASS_NAME`	Elements with class	`driver.find_element(By.CLASS_NAME, "product-card")`
`By.TAG_NAME`	All elements of type	`driver.find_elements(By.TAG_NAME, "a")`
`By.NAME`	Form fields	`driver.find_element(By.NAME, "email")`

Best practices:

Prefer IDs when available. They’re quick to find and usually unique, so you don’t have to write long nested selectors.
Use CSS selectors for attributes and structure. They’re readable and widely supported.
Use XPath when you need text matching or complex DOM traversal.
Avoid generated class names like css-19kzrtu or _a1b2c3. These are often dynamic and can change between page loads.

Testing selectors in DevTools

Chrome DevTools lets you test selectors directly in the browser to confirm you’re targeting the right elements.

For CSS:

$$('nav a[href="/pricing"]');

For XPath:

$x('//nav//a[text()="Pricing"]');

If the selector returns elements in DevTools, it’ll work in Selenium. If it’s flaky there, it’ll break in your scraper.

Using Selenium to extract data from elements

By now, you know the types of element selectors you can use in Selenium. But identifying the selector is just one part. You need to get the content from the selected element.

Getting text content

The .text property returns visible text:

element = driver.find_element(By.CLASS_NAME, "product-name")
print(element.text)

This property gives you exactly what a user sees. Perfect for titles, prices, descriptions.

Getting attributes

Use .get_attribute() for URLs, image sources, or any HTML attribute:

link = driver.find_element(By.TAG_NAME, "a")
print(link.get_attribute("href"))    # URL
print(link.get_attribute("title"))   # Tooltip text
print(link.get_attribute("class"))   # CSS classes

Clicking elements

To trigger interactions like pagination or “Load More” buttons or a “Buy now” button:

button = driver.find_element(By.CLASS_NAME, "load-more")
button.click()

Selenium executes the actual JavaScript click event, and the page responds exactly like it would for a real user.

Filling forms and logging in

To scrape behind login walls, fill the form and submit:

from selenium.webdriver.common.by import By

driver.get("https://www.firecrawl.dev/signin")

# Enter credentials
driver.find_element(By.ID, "username").send_keys("your_email")
driver.find_element(By.ID, "password").send_keys("your_password")

# Click the login button
driver.find_element(By.ID, "login-btn").click()

# Verify success (look for logout link)
try:
    driver.find_element(By.LINK_TEXT, "Logout")
    print("Login successful")
except:
    print("Login failed")

After logging in, the cookies persist throughout the session. So you can scrape dashboard pages, account data, or anything behind authentication.

Handling dynamic content and JavaScript

Everything we discussed until now worked best with HTML websites that have server-side rendering. But modern sites load content asynchronously. The page loads first, then JavaScript fills in data in a few milliseconds.

If you try scraping immediately, you’ll get empty results. So here’s how to go about scraping dynamic content:

The wrong way: `time.sleep()`

driver = webdriver.Chrome(options=options)
driver.get("https://firecrawl.dev")
time.sleep(5)  # Hope it loads in 5 seconds
products = driver.find_elements(By.CLASS_NAME, "product")

time.sleep() is unreliable. Sometimes, 5 seconds is too short. Sometimes it’s way too long. Network speed, server load, and JavaScript complexity all vary.

The right way: `WebDriverWait`

Wait dynamically until a specific condition is met:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(options=options)
driver.get("https://firecrawl.dev")

wait = WebDriverWait(driver, 10)  # Max 10 seconds
products = wait.until(
    EC.presence_of_all_elements_located((By.CLASS_NAME, "product"))
)

for product in products:
    print(product.text)

What happens:

Selenium checks every 500ms whether products exist.
Returns immediately when they appear (could be 1 second, could be 9).
Raises an exception after 10 seconds if they never show up.

This pattern prevents race conditions and keeps your scraper fast.

If you want to be more specific, you can use it with other wait conditions along with the “EC.presence_of_all_elements_located” condition, depending on your use case.

Handling infinite scroll

If you’re scraping social feeds or any website that has infinite scroll, here’s how you handle that.

import time
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
    # Scroll to bottom
    driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")

    # Wait for new content to load
    time.sleep(2)

    # Check if page height changed
    new_height = driver.execute_script("return document.documentElement.scrollHeight")

    if new_height == last_height:
        break  # No more content

    last_height = new_height

This script scrolls down on the page, waits for content to load, loops if the page size increases, and stops when the page height stops increasing (when there’s no new content being loaded).

Adding anti-detection options to Selenium scripts

Advanced websites detect Selenium through browser fingerprints like the navigator.webdriver property, inconsistent window dimensions, and automation-specific headers.

These sites may block requests, serve CAPTCHAs, or return incomplete data when they detect automated traffic.

opts = Options()
opts.add_argument("--headless=new")
opts.add_argument("--no-sandbox")
opts.add_argument("--disable-dev-shm-usage")
opts.add_argument("--disable-blink-features=AutomationControlled")

Then remove the navigator.webdriver property that screams “I’m a bot”:

driver = webdriver.Chrome(options=opts)

driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})

driver.get("https://firecrawl.dev")

This Chrome DevTools Protocol command (the driver.execute_cdp_cmd line) runs before page load, hiding Selenium’s fingerprint.

Example Project: Scraping HackerNews with Selenium

Hacker News doesn’t block scrapers and has maintained the same HTML structure for years. So, let’s use that for learning purposes.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

opts = Options()
opts.add_argument("--headless=new")

driver = webdriver.Chrome(options=opts)
driver.get("https://news.ycombinator.com/")

# Find all story rows (class 'athing')
stories = driver.find_elements(By.CLASS_NAME, "athing")

print(f"Found {len(stories)} stories:\n")

for story in stories[:5]:
    try:
        # Title and link are in a span.titleline > a
        title_link = story.find_element(By.CSS_SELECTOR, "span.titleline > a")
        title = title_link.text
        url = title_link.get_attribute("href")

        # Points and comments are in the next row
        story_id = story.get_attribute("id")
        score_row = driver.find_element(By.ID, story_id).find_element(By.XPATH, "following-sibling::tr")
        score_text = score_row.find_element(By.CLASS_NAME, "score").text

        print(f"{title}")
        print(f"{score_text} - {url}\n")

    except Exception as e:
        # Some stories don't have scores yet
        print(f"{title}\n{url}\n")
        continue

driver.quit()

If your Selenium is set up correctly, you should see your output.

How this works:

I inspected the HackerNews homepage and identified the HTML tags they use to list their stories and posts.

HackerNews homepage with Chrome devtools highlighting anchor tag

Then, I used the find_element method to fetch individual elements from the homepage and piece them together.

For instance, this line pulls the linked story using the CSS selector span.titleline > a

title_link = story.find_element(By.CSS_SELECTOR, "span.titleline > a")

Once I have all the variables populated, then what’s left is formatting and displaying them.

When this might break:

The HTML structur for HackerNews has remained the same for over a decade. But if they decide to revamp their website and the tags change, our Selenium script would need to be edited. That’s the brittleness I was referring to.

Don’t want to maintain selectors? See how Firecrawl simplifies this.

Firecrawl: A simpler, more robust web scraping solution

After eight months of scraping, the maintenance was consuming too much time, and raw Selenium would become a headache for the scale I wanted to achieve.

Back when I was scraping in 2016, I didn’t have Firecrawl, but you do now. So, let me show you how it works, and you decide if it’s a better scraper.

To give you a gist:

Selenium requires maintaining selectors, wait conditions, and browser infrastructure.
Firecrawl uses AI to extract data based on schemas you define.

How Firecrawl works (no CSS selectors required)

Firecrawl uses AI to understand and extract page structure using natural language.

So, instead of writing CSS selectors that break when designs change, you describe what data you want to extract in plain English and define a schema for the output format.

Here’s what you can do with Firecrawl’s AI:

Parse complex page layouts without manual selector mapping
Extract structured data ready for databases, APIs, or ML pipelines
Adapt automatically when websites redesign their HTML
Handle dynamic content, infinite scroll, and JavaScript-rendered pages

For even more powerful workflows, Firecrawl’s Agent endpoint can navigate multi-page workflows, fill forms, and extract data across entire user journeys.

Let’s scrape HackerNews as we did before, extracting the same elements, but this time using natural language instead of CSS selectors.

Sign up at firecrawl.dev and grab your API key from the dashboard.
Install the Python SDK:

pip install firecrawl-py python-dotenv

Store your API key in a .env file to keep it out of your code:

echo "FIRECRAWL_API_KEY=fc-your-key-here" >> .env

Then load it in Python:

from firecrawl import Firecrawl

app = Firecrawl()



print("Extracting structured data from https://news.ycombinator.com/ ...")

try:
    # Use extract to get structured data (Title, URL, Score)
    # We ask for only 5 items to save tokens/credits
    data = app.extract(
        ['https://news.ycombinator.com/'],
        schema={
            'type': 'object',
            'properties': {
                'items': {
                    'type': 'array',
                    'items': {
                        'type': 'object',
                        'properties': {
                            'title': {'type': 'string'},
                            'url': {'type': 'string'},
                            'score': {'type': 'string'}
                        },
                        'required': ['title', 'url']
                    },
                    'description': 'Top 5 stories from Hacker News'
                }
            },
            'required': ['items']
        },
        prompt="Extract the top 5 stories from the page."
    )

    # Verify we got data
    if hasattr(data, 'data') and 'items' in data.data:
        print("\n--- Extracted Stories ---\n")
        for item in data.data['items']:
            title = item.get('title', 'No Title')
            score = item.get('score', 'No Score')
            url = item.get('url', 'No URL')
            print(f"{title}")
            print(f"{score} - {url}\n")
    else:
        print("No items found in response:", data)

except Exception as e:
    print(f"Extraction failed: {e}")

Firecrawl’s AI looks at the page structure and extracts according to your schema. When sites are redesigned, the schema stays the same and you can run the same script to get the same data, as long as the data exists on the page being scraped.

If HackerNews decides to update its HTML a couple of months from now, I can continue to scrape without any changes to my script because the AI figures out how to get the data I’ve asked for in the schema.

Selenium vs. Firecrawl: How to choose the right package for your web scraping project?

Category	Selenium	Firecrawl
Setup Time	30+ min (drivers, dependencies, testing)	5 min (API key, one import)
Maintenance	Breaks on HTML changes and requires selector updates	Schema-based, adapts to redesigns automatically
Dynamic Content	Manual `WebDriverWait` configurations	Handles JavaScript rendering automatically
Scale	Requires infrastructure (proxies, browsers, memory)	Cloud-based, no infrastructure needed
Best For	Learning the fundamentals, simple one-time scrapes, UI testing	Production scraping, LLM data pipelines, multi-site scraping
Cost	Engineer time × hourly rate	Per-request pricing (predictable)

When to use Selenium

Learning web scraping fundamentals
Sites with simple, stable HTML structures

When to use Firecrawl

Scraping 10+ websites regularly
Production data pipelines feeding databases or LLMs
Sites that frequently redesign or use complex JavaScript

Pick the one that’s cheapest on your resources, and if the costs are close, factor in what engineers could build instead of maintaining scrapers.

Ready to scale beyond Selenium? Try Firecrawl for free.

Frequently Asked Questions

What is Selenium used for in web scraping projects?

Selenium automates browser interactions to scrape data from JavaScript-heavy websites where basic HTML fetching fails. It waits for dynamic content to load, and interacts with pages like a normal user would to get the data from the sections you need information from.

What’s the difference between headless and regular browser mode?

Headless mode runs Chrome without a visible window, making it perfect for servers, CI pipelines, and background jobs. It uses identical rendering to regular Chrome but consumes less memory and allows running multiple browsers in parallel.

Can Selenium handle websites with infinite scrolling websites?

Yes. With the driver.execute_script() method, Selenium lets you scroll to the bottom of the page, wait for new content, check if page height increased, and loop until the page height no longer increases. This helps when you want to scrape social media feeds or product pages.

What’s the best locator strategy for Selenium web scraping?

Prefer IDs when they’re available since they’re unique, CSS selectors to make the queries readable, and XPath for complex, nested text matching. Avoid dynamically generated class names like css-19kzrtu that change between page loads and break your scripts.

How does Firecrawl differ from Selenium for web scraping?

Firecrawl uses AI to extract data based on the schema you define in natural language. This eliminates brittle CSS selectors and IDs used with Selenium. When sites redesign, Firecrawl adapts automatically while Selenium requires manual selector updates. Better for production and multi-site scraping.

Is Firecrawl better for scraping multiple websites?

Absolutely. When scraping 10+ sites, maintaining separate Selenium scripts with unique selectors for each site becomes unmanageable. Firecrawl can use the same schema across different sites, automatically adapting to each site’s structure without custom selector configurations.

Does Firecrawl handle JavaScript-heavy websites?

Yes. Firecrawl automatically renders JavaScript, waits for dynamic content to load, handles infinite scroll, and executes AJAX requests. You don’t need to configure WebDriverWait or write custom JavaScript execution commands like you do with Selenium.