Web Scraping Change Detection with Firecrawl

Introduction to Change Detection

Today, many AI applications rely on up-to-date information. For example, RAG systems need current web data to provide accurate responses, and knowledge bases must reflect the latest product information. However, with thousands of downloaded pages, identifying when original sources have changed is difficult.

This issue affects real-world systems. A customer support chatbot may give incorrect answers if it uses outdated documentation. A financial analysis tool might miss important changes if it doesn’t capture the latest market reports. In healthcare, treatment recommendations must be based on the most current clinical guidelines and research.

To address these challenges, we’ve integrated change detection into our web scraping API. Here’s how it works:

from firecrawl import FirecrawlApp

# Initialize the app
app = FirecrawlApp()

# Check if a page has changed
base_url = "https://bullet-echo.fandom.com/wiki/Special:AllPages"

result = app.scrape_url(
    base_url,
    formats=["changeTracking", "markdown"],
)

tracking_data = result.changeTracking
print(tracking_data)

This outputs:

ChangeTrackingData(previousScrapeAt='2025-05-03T15:47:57.21162+00:00', changeStatus='changed', visibility='visible', diff=None, json=None)

The code snippet first initializes the FirecrawlApp, then scrapes the specified URL with the formats "changeTracking" and "markdown". Including "changeTracking" in the formats list enables Firecrawl to compare the current scrape with the previous version of the page, if one exists.

The resulting tracking_data object provides several key fields:

previousScrapeAt: The timestamp of the previous scrape that this page is being compared against (or None if this is the first scrape).
changeStatus: Indicates the result of the comparison. Possible values are:
- "new": The page was not previously scraped.
- "same": The page content has not changed since the last scrape.
- "changed": The page content has changed since the last scrape.
- "removed": The page was present before but is now missing.
visibility: Shows whether the page is currently discoverable ("visible") or not ("hidden").

This allows you to programmatically detect if a page has changed, is new, or has been removed, and to take action accordingly.

Note: If you are completely new to Firecrawl, check out our documentation.

In this tutorial, we’ll build a wiki monitoring system that tracks changes on a gaming wiki page hosted on Fandom. Our system will perform weekly and monthly scrapes, intelligently identifying which pages have changed and only downloading the updated content.

About the Web Dataset

Bullet Echo game screenshot showing a top-down battle royale gameplay with characters and weapons

Bullet Echo is my favorite mobile game with excellent gameplay. It features a top-down 2D multiplayer battle royale mode where fast reflexes and stealth are crucial. Game trivia and documentation is maintained by the community on Bullet Echo Fandom, which contains over 180 articles:

Screenshot of the Bullet Echo Fandom wiki's "All Pages" section showing a list of over 180 articles about the game

Our aim is to download all of those pages as markdown files and build a system that updates them on schedule as new game features are frequently added.

Building a Change Detection System Step-by-Step

Now, without further ado, let’s start building our project in incremental steps.

Step 1: Project setup and configuration

In this section, we’ll set up our project environment for the change detection system. We’ll be using uv, a fast Python package installer and resolver, to create and manage our Python project. We’ll also configure the necessary environment for using the Firecrawl API.

Install uv:
```
pip install uv
```

Create the project directory structure:

mkdir -p change-detection
cd change-detection

Initialize the project with uv:
```
uv init --python=3.12
```
This creates a basic pyproject.toml file and sets Python 3.12 as our project’s Python version.

Add project dependencies using uv:

uv add firecrawl-py pydantic python-dotenv

Create the data and code directories:
```
mkdir -p data src
```
Set up environment variables: Create a .env file in your project root to store your Firecrawl API key:
```
echo "FIRECRAWL_API_KEY=your_api_key_here" > .env
```

Create a .gitignore file:

echo ".env" > .gitignore
echo ".venv" > .gitignore

Obtain a Firecrawl API key:
- Visit Firecrawl’s website
- Sign up for an account or log in
- Navigate to your account settings
- Generate a new API key
- Copy this key to your .env file

Set up Git repository:

git init
git add .
git commit -m "Initial project setup"

Initialize repository on GitHub:
- Create a new repository on GitHub
- Push your local repository to GitHub:
```
git remote add origin https://github.com/yourusername/change-detection.git
git push -u origin main
```
- In your GitHub repository, go to Settings > Secrets and variables > Actions
- Add a new repository secret named FIRECRAWL_API_KEY with your Firecrawl API key

With our project environment set up, we’re now ready to create our data models and implement the core functionality of our change detection system in the following steps.

Step 2: Defining data models and utilities

Screenshot of the Bullet Echo wiki's "All Pages" section showing an alphabetical list of wiki articles

The first order of business is scraping the links to each individual page from the above “All pages” URL located at https://bullet-echo.fandom.com/wiki/Special:AllPages.

Let’s define the data models and utility functions that will power our change detection system. These components will help us structure data and handle common operations in our scraping workflow.

Data models for structured data extraction

We’ll start by creating models.py, which defines Pydantic models that will structure our scraped data:

from pydantic import BaseModel
from typing import List


class Article(BaseModel):
    url: str
    title: str


class ArticleList(BaseModel):
    articles: List[Article]

This code creates two key models:

Article: Represents a single wiki article with its URL and title
ArticleList: A collection of Article objects

These Pydantic models serve a crucial purpose: they define the schema for Firecrawl’s structured data extraction. When we use batch scraping with extraction, Firecrawl can populate our data models directly. This approach is much more robust than parsing HTML manually.

For example, we’ll use this model as the schema in our extraction in another script:

# Example usage in our scraping code
result = app.batch_scrape_urls(
    [BASE_URL],
    formats=["extract"],
    extract={"schema": ArticleList.model_json_schema()},
)

# Now we can access the structured data
all_articles = [a["url"] for a in result.data[0].extract["articles"]]

Utility functions for change detection

Next, we’ll create utils.py with essential helper functions:

from pathlib import Path


def is_changed(firecrawl_app, url):
    result = firecrawl_app.scrape_url(url, formats=["changeTracking", "markdown"])
    return result.changeTracking.changeStatus == "changed"

This is_changed() function is the heart of our change detection system. It:

Scrapes a URL with the changeTracking format enabled
Examines the returned changeStatus, which can be:
- "new": First time seeing this page
- "same": No changes detected
- "changed": Content has been updated
- "removed": Page no longer exists
Returns True only when content has changed

def save_markdown(markdown, path):
    with open(path, "w") as f:
        f.write(markdown)


def save_status_data(status_data, path):
    # Create the directory if it doesn't exist
    Path(path).mkdir(parents=True, exist_ok=True)

    for s in status_data.data:
        url = s.metadata.get("url")
        title = s.metadata.get("og:title")

        if url and title:
            # Get a clean title
            title = title.replace(" ", "-").replace("/", "-").replace("|", "-")

            filename = Path(path) / f"{title}.md"
            # Check if the file already exists
            if not filename.exists():
                save_markdown(s.markdown, filename)

These additional functions handle:

save_markdown(): Saving markdown content to disk
save_status_data(): Processing batch scraping results, extracting metadata, and saving only new content to appropriately named files

Together, these models and utility functions give us a powerful foundation for building our change detection system. The Pydantic models provide a clean structure for extracted data, while the utility functions handle the mechanics of change detection and file management.

In the next section, we’ll use these components to implement the core scraping logic that will monitor the Bullet Echo wiki for changes.

Step 3: Initial base URL scraping

Now let’s implement the monthly scraping logic that fetches the complete list of wiki articles and tracks changes. This script will be run once a month to ensure we capture all new and updated content.

Let’s create src/monthly_scrape.py:

import time
from pathlib import Path

from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from models import ArticleList
from utils import is_changed, save_status_data

# Load environment variables
load_dotenv()

# Configuration variables
BASE_URL = "https://bullet-echo.fandom.com/wiki/Special:AllPages"

# Data directory
DATA_DIR = Path("data")

# Files and Paths
ARTICLES_LIST_FILE = DATA_DIR / "all_articles.txt"
OUTPUT_DIRECTORY = DATA_DIR / "bullet-echo-wiki"
OUTPUT_DIRECTORY.mkdir(exist_ok=True, parents=True)

# Job Parameters
TIMEOUT_SECONDS = 180  # 3 minutes timeout
POLLING_INTERVAL = 30  # Seconds between status checks

# Initialize Firecrawl app
app = FirecrawlApp()

This initializes the Firecrawl client, which requires an API key that will be loaded from our environment variables. We’re also setting up constants for file paths and job parameters.

Now let’s examine the core Firecrawl functionality used in our functions:

def get_article_list():
    """Get the list of articles from the wiki or from the cached file."""
    if is_changed(app, BASE_URL) or not ARTICLES_LIST_FILE.exists():
        print("The wiki pages list has changed. Scraping the wiki pages list...")
        # Scrape the wiki pages list
        result = app.batch_scrape_urls(
            [BASE_URL],
            formats=["extract"],
            extract={"schema": ArticleList.model_json_schema()},
        )

        # Extract all article URLs
        all_articles = [a["url"] for a in result.data[0].extract["articles"]]
        print(f"Found {len(all_articles)} articles")

        # Write the links to a text file
        with open(ARTICLES_LIST_FILE, "w") as f:
            for article in all_articles:
                f.write(article + "\n")

        return all_articles
    else:
        print("The wiki pages list has not changed. Scraping from existing list...")
        with open(ARTICLES_LIST_FILE, "r") as f:
            return [line.strip() for line in f.readlines()]

This function uses two key Firecrawl methods:

The is_changed() utility function (which we defined) calls app.scrape_url() with the ["changeTracking", "markdown"] formats. The changeTracking format returns metadata about whether a URL has changed since it was last scraped.
The app.batch_scrape_urls() method accepts:
- A list of URLs to scrape (here just one URL)
- A formats parameter specifying output formats, here using extract
- An extract parameter containing the schema for structured data extraction derived from our Pydantic model

The batch scraping function with extraction intelligently processes the page content and extracts structured data according to our schema, eliminating the need for manual HTML parsing.

def scrape_and_monitor_articles(article_urls):
    """Scrape articles and monitor the process until completion or timeout."""
    print(f"Scraping {len(article_urls)} articles...")

    # Start the batch scrape job
    job = app.async_batch_scrape_urls(article_urls)
    start_time = time.time()

    # Monitor the job status and save results
    while True:
        status = app.check_batch_scrape_status(job.id)
        if status.status == "completed":
            print("Batch scrape completed successfully!")
            break

        # Check if timeout has been reached
        if time.time() - start_time > TIMEOUT_SECONDS:
            print(f"Timeout of {TIMEOUT_SECONDS} seconds reached. Exiting.")
            break

        # Save the partial results
        save_status_data(status, OUTPUT_DIRECTORY)

        print("Waiting for batch scrape to complete...")
        time.sleep(POLLING_INTERVAL)

This function demonstrates Firecrawl’s asynchronous batch processing:

app.async_batch_scrape_urls() initiates an asynchronous job and returns a job object with an ID
app.check_batch_scrape_status(job.id) polls the job status until completion
The returned status object contains partial results that can be processed incrementally

This approach is efficient for large batches of URLs, as it allows monitoring progress and handling partial results before the entire job completes.

The combination of change detection and structured data extraction capabilities provides an efficient way to monitor and update our wiki data, without having to redownload unchanged content.

In the next step, we’ll implement the weekly scraping script that focuses on checking individual pages for changes more frequently.

Step 4: Individual page scraping and scheduling

Now that we have our monthly scraping system in place, let’s implement a more frequent weekly check that focuses on detecting changes to individual pages without re-scraping everything.

Let’s create src/weekly_scrape.py:

from pathlib import Path

from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from utils import save_markdown

# Load environment variables
load_dotenv()

# Configuration variables
# URLs
BASE_URL = "https://bullet-echo.fandom.com/wiki/Special:AllPages"

# Data directory
DATA_DIR = Path("data")

# Files and Paths
ARTICLES_LIST_FILE = DATA_DIR / "all_articles.txt"
OUTPUT_DIRECTORY = DATA_DIR / "bullet-echo-wiki"
OUTPUT_DIRECTORY.mkdir(exist_ok=True, parents=True)

# Job Parameters
TIMEOUT_SECONDS = 180  # 3 minutes timeout
POLLING_INTERVAL = 30  # Seconds between status checks

# Initialize Firecrawl app
app = FirecrawlApp()

This setup is similar to our monthly script, but notice that we’re only importing the save_markdown utility since we’ll be handling files differently in this script.

Now let’s implement the function to read our article list:

def read_article_list():
    """Read the article list from the file."""
    if ARTICLES_LIST_FILE.exists():
        with open(ARTICLES_LIST_FILE, "r") as f:
            return [line.strip() for line in f.readlines()]
    else:
        return []

Unlike the monthly script, we don’t need to check if the article list has changed or regenerate it. We simply read the existing list created by the monthly job. If the file doesn’t exist, we return an empty list.

Now for the core function that checks each article for changes:

def scrape_and_monitor_articles(article_urls):
    """Scrape and monitor the articles."""
    for url in article_urls:
        print(f"Checking {url} for changes...")
        try:
            scrape_result = app.scrape_url(url, formats=["markdown", "changeTracking"])
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            continue

        if scrape_result.changeTracking.changeStatus == "changed":
            print(f"The article {url} has changed. Saving the new version...")
            title = (
                url.split("/")[-1]
                .replace("%27s", "'")
                .replace("_", "-")
                .replace(" ", "-")
            )
            save_markdown(scrape_result.markdown, OUTPUT_DIRECTORY / f"{title}.md")
        else:
            print(f"The article {url} has not changed.")
    print("All articles have been checked for changes.")

This function processes each article individually rather than in batch:

It loops through each URL in our article list
For each URL, it calls app.scrape_url() with ["markdown", "changeTracking"] formats
It checks the changeStatus to determine if the page has changed since the last scrape
If the page has changed, it extracts a clean title from the URL and saves the updated markdown content

This approach differs significantly from our monthly batch process:

It processes URLs sequentially instead of in batch
It only saves pages that have changed, skipping unchanged content
It directly extracts the title from the URL rather than from metadata

Finally, let’s implement the main function:

def main():
    article_urls = read_article_list()
    scrape_and_monitor_articles(article_urls)


if __name__ == "__main__":
    main()

The weekly script is more focused and lightweight than the monthly script. While the monthly script establishes a comprehensive baseline of all content, the weekly script efficiently checks for changes to individual pages and only downloads content that has been updated.

This two-tiered approach optimizes our change detection system:

Monthly scans rebuild the article list and provide a comprehensive content update
Weekly scans focus on detecting and updating only changed content

By combining these two processes, we achieve both thoroughness and efficiency in our wiki monitoring system.

In the next step, we’ll enhance our system with error handling and monitoring capabilities.

Step 5: Setting up GitHub Actions to schedule scripts

Now that we have our scraping scripts ready, we need a way to run them automatically on a regular schedule. GitHub Actions provides an excellent way to automate these tasks directly from our repository.

Let’s set up GitHub Actions workflows to run our scripts automatically:

First, create the necessary directories:
```
mkdir -p .github/workflows
```
Create the monthly scraping workflow file at .github/workflows/monthly-scrape.yml:

name: Monthly Scraping Bullet Echo Base URL

on:
  schedule:
    # Run at midnight (00:00) on the first day of every month
    - cron: "0 0 1 * *"
  workflow_dispatch: # Allow manual triggering

permissions:
  contents: write

jobs:
  scrape:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.12"
          cache: "pip"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install .

      - name: Run scraping script
        run: python src/monthly_scrape.py
        env:
          # Include any environment variables needed by the script
          # These should be set as secrets in your repository settings
          FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }}

      - name: Commit and push changes
        run: |
          git config --global user.name 'github-actions'
          git config --global user.email 'github-actions@github.com'
          git add data/
          git commit -m "Update data from monthly scrape" || echo "No changes to commit"
          git push

You must also create a weekly scraping workflow file at .github/workflows/weekly-scrape.yml:

name: Weekly Scraping Bullet Echo Base URL

on:
  schedule:
    # Run at midnight (00:00) on the first day of every week
    - cron: "0 0 * * 1"
  workflow_dispatch: # Allow manual triggering

permissions:
  contents: write

jobs:
  scrape:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.12"
          cache: 'pip'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install .

      - name: Run scraping script
        run: python src/weekly_scrape.py
        env:
          FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }}

      - name: Commit and push changes
        run: |
          git config --global user.name 'github-actions'
          git config --global user.email 'github-actions@github.com'
          git add data/
          git commit -m "Update data from weekly scrape" || echo "No changes to commit"
          git push

Let’s break down the workflow file:

Scheduling with Cron

on:
  schedule:
    # Run at midnight (00:00) on the first day of every month
    - cron: "0 0 1 * *"
  workflow_dispatch: # Allow manual triggering

The cron syntax is structured as minute hour day-of-month month day-of-week:

0 0: At 00:00 (midnight)
1: On the first day of the month
* *: Every month, every day of the week

For the weekly workflow, the cron expression 0 0 * * 1 runs every Monday at midnight.

Running Workflows Manually

Both workflows include the workflow_dispatch trigger, which allows you to run them manually through the GitHub UI. To run a workflow manually:

Navigate to your repository on GitHub
Click on the “Actions” tab at the top of the repository page
From the left sidebar, select the workflow you want to run (either “Monthly Scraping Bullet Echo Base URL” or “Weekly Scraping Bullet Echo Base URL”)
Click the “Run workflow” button on the right side of the page
Select the branch to run the workflow on (usually “main”)
Click the green “Run workflow” button to start the process

This is particularly useful for testing your workflows or running them outside of their scheduled times.

Permissions and Workflow Steps

The permissions and steps work the same for both workflows, with the only difference being which script gets executed.

By setting up both of these workflows, you create a fully automated system where:

On the first day of each month, GitHub Actions runs the monthly scrape that rebuilds the article list and performs a full content update
Every Monday, GitHub Actions runs the weekly scrape that efficiently checks for and updates only changed content

The results are automatically committed back to the repository, creating a version-controlled history of the data changes over time.

In the next step, we’ll discuss storage strategies for our scraped data and how to make it more accessible for downstream applications.

Step 6: Storage strategies for production use

Our current implementation stores scraped wiki content as markdown files in a local directory structure. While this approach works for our demonstration, it has several limitations for production use:

Limited scalability: Local filesystem storage doesn’t scale well for large datasets with thousands of pages.
Limited accessibility: Files stored locally aren’t easily accessible by other systems or applications.
No indexing: Simple file storage makes it difficult to search or query content efficiently.
Limited metadata: Our current approach stores minimal metadata about changes and content.
No versioning: We overwrite files when content changes, losing historical versions.

For a production environment, we should consider more robust storage solutions:

Database Storage

Implementing a database solution would allow for better organization and retrieval:

# Example using SQLAlchemy with SQLite (for simplicity)
from sqlalchemy import create_engine, Column, Integer, String, Text, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import datetime

Base = declarative_base()

class WikiArticle(Base):
    __tablename__ = 'wiki_articles'
    
    id = Column(Integer, primary_key=True)
    url = Column(String, unique=True, index=True)
    title = Column(String)
    content = Column(Text)
    last_updated = Column(DateTime)
    last_checked = Column(DateTime)
    change_status = Column(String)  # new, same, changed, removed
    
    def __repr__(self):
        return f"<WikiArticle(title='{self.title}')>"

# Initialize database
engine = create_engine('sqlite:///data/wiki_articles.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)

This would allow us to:

Store article content with comprehensive metadata
Track last updated and checked timestamps
Query articles by various criteria
Build an API layer for downstream applications

Cloud Storage Options

For production, consider cloud storage solutions:

Amazon S3 or similar object storage:

import boto3

def save_to_s3(content, title, bucket_name="wiki-content"):
    s3 = boto3.client('s3')
    s3.put_object(
        Body=content,
        Bucket=bucket_name,
        Key=f"articles/{title}.md",
        ContentType="text/markdown"
    )

Vector databases for semantic search:

from pinecone import Pinecone

def store_with_embeddings(content, title, url):
    # Generate embeddings using a model like OpenAI's
    embedding = get_embedding(content)
    
    # Store in vector database
    pc = Pinecone(api_key="your-api-key")
    index = pc.Index("wiki-articles")
    index.upsert(
        vectors=[{
            "id": url,
            "values": embedding,
            "metadata": {"title": title, "last_updated": datetime.now().isoformat()}
        }]
    )

Content API for Downstream Applications

Create a simple REST API to serve the content:

from fastapi import FastAPI, HTTPException
from sqlalchemy.orm import Session

app = FastAPI()

@app.get("/articles/")
def list_articles(session: Session = Depends(get_session)):
    articles = session.query(WikiArticle).all()
    return articles

@app.get("/articles/{title}")
def get_article(title: str, session: Session = Depends(get_session)):
    article = session.query(WikiArticle).filter(WikiArticle.title == title).first()
    if not article:
        raise HTTPException(status_code=404, detail="Article not found")
    return article

@app.get("/search/")
def search_articles(query: str, session: Session = Depends(get_session)):
    # Implement full-text search
    articles = session.query(WikiArticle).filter(
        WikiArticle.content.like(f"%{query}%")
    ).all()
    return articles

Change History and Versioning

Implement versioning to retain historical content:

class WikiArticleVersion(Base):
    __tablename__ = 'wiki_article_versions'
    
    id = Column(Integer, primary_key=True)
    article_id = Column(Integer, ForeignKey('wiki_articles.id'))
    content = Column(Text)
    version_date = Column(DateTime, default=datetime.datetime.utcnow)
    change_description = Column(String)

By implementing these more robust storage solutions, our change detection system becomes far more useful for production applications:

RAG systems could directly query the API for up-to-date content
Analytics tools could monitor and visualize change patterns
Notification systems could alert content owners about updates
Content pipelines could automatically process and transform updated content

These improvements transform our simple file-based demo into a production-ready system that can reliably feed data into downstream applications while maintaining a complete history of content changes.

Conclusion

That concludes our step-by-step guide to building a change detection system with Firecrawl. We’ve created a robust solution that efficiently monitors web content for changes and only downloads updated pages. The system uses monthly comprehensive scans to establish a baseline and weekly incremental checks to catch recent changes. By leveraging Firecrawl’s change tracking and structured data extraction capabilities, we’ve eliminated the need for complex HTML parsing and manual change detection logic.

For production implementation, consider upgrading from local file storage to databases or cloud storage solutions that better support indexing, versioning, and accessibility. To further explore Firecrawl’s capabilities, visit the Firecrawl website and read the change tracking documentation. The code from this tutorial is available on GitHub and can serve as a starting point for your own web monitoring projects.