
Scraping job boards to extract structured data can be a complex task, especially when dealing with dynamic websites and unstructured content. In this guide, we’ll walk through how to use Firecrawl Actions and OpenAI models to efficiently scrape job listings and extract valuable information.
Why Use Firecrawl and OpenAI?
- Firecrawl simplifies web scraping by handling dynamic content and providing actions like clicking and scrolling.
- OpenAI’s
o1
and4o
models excel at understanding and extracting structured data from unstructured text.o1
is best for more complex reasoning tasks while4o
is best for speed and cost.
Prerequisites
pip install requests python-dotenv openai
Step 1: Set Up Your Environment
Create a .env
file in your project directory and add your API keys:
FIRECRAWL_API_KEY=your_firecrawl_api_key
OPENAI_API_KEY=your_openai_api_key
Step 2: Initialize API Clients
import os
import requests
import json
from dotenv import load_dotenv
import openai
# Load environment variables
load_dotenv()
# Initialize API keys
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")
Step 3: Define the Jobs Page URL and Resume
Specify the URL of the jobs page you want to scrape and provide your resume for matching.
# URL of the jobs page to scrape
jobs_page_url = "https://openai.com/careers/search"
# Candidate's resume (as a string)
resume_paste = """
[Your resume content here]
"""
Step 4: Scrape the Jobs Page Using Firecrawl
We use Firecrawl to scrape the jobs page and extract the HTML content.
try:
response = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {firecrawl_api_key}"
},
json={
"url": jobs_page_url,
"formats": ["markdown"]
}
)
if response.status_code == 200:
result = response.json()
if result.get('success'):
html_content = result['data']['markdown']
# Prepare the prompt for OpenAI
prompt = f"""
Extract up to 30 job application links from the given markdown content.
Return the result as a JSON object with a single key 'apply_links' containing an array of strings (the links).
The output should be a valid JSON object, with no additional text.
Markdown content:
{html_content[:100000]}
"""
else:
html_content = ""
else:
html_content = ""
except Exception as e:
html_content = ""
Step 5: Extract Apply Links Using OpenAI’s gpt-4o
Model
We use OpenAI’s gpt-4o
model to parse the scraped content and extract application links.
# Extract apply links using OpenAI
apply_links = []
if html_content:
try:
completion = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": prompt
}
]
)
if completion.choices:
result = json.loads(completion.choices[0].message.content.strip())
apply_links = result['apply_links']
except Exception as e:
pass
Step 6: Extract Job Details from Each Apply Link
We iterate over each apply link and use Firecrawl’s extraction capabilities to get job details.
# Initialize a list to store job data
extracted_data = []
# Define the extraction schema
schema = {
"type": "object",
"properties": {
"job_title": {"type": "string"},
"sub_division_of_organization": {"type": "string"},
"key_skills": {"type": "array", "items": {"type": "string"}},
"compensation": {"type": "string"},
"location": {"type": "string"},
"apply_link": {"type": "string"}
},
"required": ["job_title", "sub_division_of_organization", "key_skills", "compensation", "location", "apply_link"]
}
# Extract job details for each link
for link in apply_links:
try:
response = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {firecrawl_api_key}"
},
json={
"url": link,
"formats": ["extract"],
"actions": [{
"type": "click",
"selector": "#job-overview"
}],
"extract": {
"schema": schema
}
}
)
if response.status_code == 200:
result = response.json()
if result.get('success'):
extracted_data.append(result['data']['extract'])
except Exception as e:
pass
Step 7: Match Jobs to Your Resume Using OpenAI’s o1
Model
We use OpenAI’s o1
model to analyze your resume and recommend the top 3 job listings.
# Prepare the prompt
prompt = f"""
Please analyze the resume and job listings, and return a JSON list of the top 3 roles that best fit the candidate's experience and skills. Include only the job title, compensation, and apply link for each recommended role. The output should be a valid JSON array of objects in the following format:
[
{
"job_title": "Job Title",
"compensation": "Compensation",
"apply_link": "Application URL"
},
...
]
Based on the following resume:
{resume_paste}
And the following job listings:
{json.dumps(extracted_data, indent=2)}
"""
# Get recommendations from OpenAI
completion = openai.ChatCompletion.create(
model="o1-preview",
messages=[
{
"role": "user",
"content": prompt
}
]
)
# Extract recommended jobs
recommended_jobs = json.loads(completion.choices[0].message.content.strip())
Step 8: Output the Recommended Jobs
Finally, we can print or save the recommended jobs.
# Output the recommended jobs
print(json.dumps(recommended_jobs, indent=2))
Full Code Example on GitHub
You can find the full code example on GitHub.
Conclusion
By following this guide, you’ve learned how to:
- Scrape dynamic job boards using Firecrawl.
- Extract structured data from web pages with custom schemas.
- Leverage OpenAI’s models to parse content and make intelligent recommendations.
This approach can be extended to other websites and data extraction tasks, providing a powerful toolset for automating data collection and analysis.
References
That’s it! You’ve now built a pipeline to scrape job boards and find the best job matches using Firecrawl and OpenAI. Happy coding!

data from the web