O4 Mini Web Crawler
O4 Mini Web Crawler intelligently maps and scrapes websites using Firecrawl API, then applies OpenAI's o4-mini model to identify and extract relevant information based on user objectives.
Description
O4 Mini Web Crawler
A simple web crawler that uses Firecrawl and OpenAI’s o4-mini model to search websites based on user objectives.
Features
- Maps websites to find relevant URLs
- Uses AI to rank URLs by relevance to the objective
- Scrapes content and analyzes it with o4-mini
- Returns structured data when objectives are met
Prerequisites
- Python 3.6+
- Firecrawl API key
- OpenAI API key
Installation
- Clone this repository
- Install the required packages:
pip install -r requirements.txt
- Copy
.env.example
to.env
and fill in your API keys:cp .env.example .env
Usage
Run the script:
python o4-mini-web-crawler.py
You will be prompted to:
- Enter a website URL to crawl
- Define your objective (what information you’re looking for)
The crawler will then:
- Map the website to find relevant URLs
- Rank the most relevant pages
- Scrape and analyze the content
- Return structured data if the objective is met
Example
Enter the website to crawl: https://example.com
Enter your objective: Find the company's headquarters address
The crawler will search for pages likely to contain this information, analyze them, and return the address in a structured format.
License
Related Templates
Explore more templates similar to this one
Top Italian Restaurants in SF
Search for websites that contain the top italian restaurants in SF. With page content
Quotes.toscrape.com Scrape
Zed.dev Crawl
The first step of many to create an LLM-friendly document for Zed's configuration.
Developers.campsite.com Crawl
o3 mini Company Researcher
This Python script integrates SerpAPI, OpenAI's O3 Mini model, and Firecrawl to create a comprehensive company research tool. The workflow begins by using SerpAPI to search for company information, then leverages the O3 Mini model to intelligently select the most relevant URLs from search results, and finally employs Firecrawl's extraction API to pull detailed information from those sources. The code includes robust error handling, polling mechanisms for extraction results, and clear formatting of the output, making it an efficient tool for gathering structured company information based on specific user objectives.