Web Page Extractor
v1PublishedScrape any web page into structured JSON — title, meta description, headings, body markdown, and all outbound links. Parameter: url.
Output & API
Preview the latest data, download it, or call this collector as an API.
| url | https://developer.mozilla.org/en-US/docs/Web/HTML |
|---|---|
| title | HTML: HyperText Markup Language | MDN |
| headings | |
| bodyMarkdown | - [Skip to main content](https://developer.mozilla.org/en-US/docs/Web/HTML#content) - [Skip to search](https://developer.mozilla.org/en-US/docs/Web/HTML#search) Learn frontend, backend, and AI from our course partner [Scrimba](https://scrimba.com/learn/frontend?via=mdn) # HTML: HyperText Markup Language **HTML** (HyperText Markup Language) is the most basic building block of the Web. It defines the meaning and structure of web content. Other technologies besides HTML are generally used to describe a web page's appearance/presentation ( [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS)) or functionality/behavior ( [JavaScript](https://developer.mozilla.org/en-US/docs/Web/JavaScript)). "Hypertext" refers to links that connect web pages to one another, either within a single website or between websites. Links are a fundamental aspect of the Web. By uploading content to the Internet and linking it to pages created by other people, you become an active participant in the World Wide Web. HTML uses "markup" to annotate text, images, and other content for display in a Web browser. HTML markup includes special "elements" such as [`<head>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/head), [`<title>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/title), [`<body>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/body), [`<header>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/header), [`<footer>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/footer), [`<article>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/article), [`<section>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/section), [`<p>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/p), [`<div>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/div), [`<span>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/span), [`<img>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/img), [`<aside>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/aside), [`<audio>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/audio), [`<canvas>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/canvas), [`<datalist>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/datalist), [`<details>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/details), [`<embed>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/embed), [`<nav>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/nav), [`<search>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/search), [`<output>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/output), [`<progress>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/progress), [`<video>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/video), [`<ul>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/ul), [`<ol>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/ol), [`<li>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/li) and many others. An HTML element is set off from other text in a document by "tags", which consist of the element name surrounded by `<` and `>`. The name of an element inside a tag is case-insensitive. That is, it can be written in uppercase, lowercase, or a mixture. For example, the `<title>` tag can be written as `<Title>`, `<TITLE>`, or in any other way. However, the convention and recommended practice is to write tags in lowercase. The articles below can help you learn more about HTML. ## [Beginner's tutorials](https://developer.mozilla.org/en-US/docs/Web/HTML\#beginners_tutorials) Our [learn web development core modules](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core) contain modern, up-to-date tutorials covering HTML fundamentals. [Your first website: Creating the content](https://developer.mozilla.org/en-US/docs/Learn_web_development/Getting_started/Your_first_website/Creating_the_content) This article provides a brief tour of what HTML is and how to use it, aimed at people who are completely new to web development. [Structuring content with HTML](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Structuring_content) This module covers the basics of the HTML language, before looking at key areas such as document structure, links, lists, images, forms, and more. [HTML forms](https://developer.mozilla.org/en-US/docs/Learn_web_development/Extensions/Forms) Forms are a very important part of the Web — these provide much of the functionality you need for interacting with websites, e.g., registering and logging in, sending feedback, buying products, and more. This module gets you started with creating the client-side/front-end parts of forms. ## [Guides](https://developer.mozilla.org/en-US/docs/Web/HTML\#guides) The [HTML guides](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides) help you build with HTML on the web. They cover topics such as forms, CORS, content preloading, and responsive images. [HTML cheatsheet for syntax and common tasks](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Cheatsheet) Quick reference for common HTML syntax and tasks. [Using HTML comments `<!-- … -->`](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Comments) HTML comments are used to add explanatory notes to the markup or to prevent the browser from interpreting specific parts of the document. [Using HTML form validation and the Constraint Validation API](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Constraint_validation) HTML5 introduced constraint validation to ease form validation on the client side. Basic constraints can be checked without JavaScript by setting attributes on form elements. [Content categories](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Content_categories) HTML is comprised of several kinds of content, each of which is allowed to be used in certain contexts and is disallowed in others. Similarly, each context has a set of other content categories it can contain and elements that can or can't be used in them. This is a guide to these categories. [Using date and time formats in HTML](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Date_and_time_formats) Certain HTML elements use date and/or time values. This guide describes the formats of the strings that specify these values. [Using microdata in HTML](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Microdata) Microdata is used to nest metadata within existing content on web pages. Search engines and web crawlers can extract and process microdata to provide a richer browsing experience. [Using microformats in HTML](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Microformats) Microformats are standards used to embed semantics and structured data in HTML for use by social web applications, search engines, aggregators, and other tools. [Understanding quirks and standards modes](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Quirks_mode_and_standards_mode) Historical information on quirks mode and standards mode. [Using responsive images in HTML](https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Responsive_images) Learn about responsive images that work well on devices with widely differing screen sizes, resolutions, and other features, improving performance across different devices. [Media types and formats on the web](https://developer.mozilla.org/en-US/docs/Web/Media/Guides/Formats) The [`<audio>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/audio) and [`<video>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/video) elements allow you to play audio and video media natively within your content without the need for external software support. ## [How to](https://developer.mozilla.org/en-US/docs/Web/HTML\#how_to) [Define terms with HTML](https://developer.mozilla.org/en-US/docs/Web/HTML/How_to/Define_terms_with_HTML) HTML provides several ways to convey description semantics, whether inline or as structured glossaries. This article shows how to properly mark up keywords when defining them. [Use data attributes](https://developer.mozilla.org/en-US/docs/Web/HTML/How_to/Use_data_attributes) HTML5 is designed with extensibility in mind for data that should be associated with a particular element but need not have any defined meaning. `data-*` attributes allow us to store extra information on standard, semantic HTML elements. [Use cross-origin images in a canvas](https://developer.mozilla.org/en-US/docs/Web/HTML/How_to/CORS_enabled_image) Some HTML elements that provide support for [CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/CORS), such as [`<img>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/img) or [`<video>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/video), have a `crossorigin` attribute (`crossOrigin` property), which lets you configure the CORS requests for the element's fetched data. [Add a hitmap on top of an image](https://developer.mozilla.org/en-US/docs/Web/HTML/How_to/Add_a_hit_map_on_top_of_an_image) Image maps allow hyperlinks to be associated with different parts of an image. This article shows how to create and implement them. [Author fast-loading HTML pages](https://developer.mozilla.org/en-US/docs/Web/HTML/How_to/Author_fast-loading_HTML_pages) These tips are based on common knowledge and experimentation. An optimized web page not only provides for a more responsive site for your visitors but also reduces the load on your web servers and internet connection. [Add JavaScript to your web page](https://developer.mozilla.org/en-US/docs/Web/HTML/How_to/Add_JavaScript_to_your_web_page) This article explains how to add JavaScript code to an HTML file. ## [Reference](https://developer.mozilla.org/en-US/docs/Web/HTML\#reference) HTML consists of **elements**, each of which may be modified by some number of **attributes**. HTML documents are connected to each other with **links**. Browse the complete [HTML reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference) documentation. [HTML elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements) Reference for all HTML [elements](https://developer.mozilla.org/en-US/docs/Glossary/Element). [HTML attributes](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Attributes) Reference for all HTML attributes. Attributes are additional values that configure elements or adjust their behavior in various ways. [Global attributes](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Global_attributes) Reference for global attributes that may be specified on all HTML elements, _even those not specified in the standard_. This means that any non-standard elements must still permit these attributes, even though those elements make the document HTML5-noncompliant. ### [Attributes by element](https://developer.mozilla.org/en-US/docs/Web/HTML\#attributes_by_element) [Input types](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/input) Used to create interactive controls for web-based forms. [Script types](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/script/type) Indicates the type of script represented by the element. [meta name](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/meta/name) Provides metadata in name-value pairs for the whole page. ### [Attribute values](https://developer.mozilla.org/en-US/docs/Web/HTML\#attribute_values) [rel keywords](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Attributes/rel) Defines the relationship between a linked resource and the current document. ## [Related topics](https://developer.mozilla.org/en-US/docs/Web/HTML\#related_topics) [Applying color to HTML elements using CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/Guides/Colors/Applying_color) This article covers most of the ways you use CSS to add color to HTML content, listing what parts of HTML documents can be colored and what CSS properties to use when doing so. ## Help improve MDN Was this page helpful to you? YesNo [Learn how to contribute](https://developer.mozilla.org/en-US/docs/MDN/Community/Getting_started) This page was last modified on Dec 22, 2025 by [MDN contributors](https://developer.mozilla.org/en-US/docs/Web/HTML/contributors.txt). [View this page on GitHub](https://github.com/mdn/content/blob/main/files/en-us/web/html/index.md?plain=1 "Folder: en-us/web/html (Opens in a new tab)") • [Report a problem with this content](https://github.com/mdn/content/issues/new?template=page-report.yml&mdn-url=https%3A%2F%2Fdeveloper.mozilla.org%2Fen-US%2Fdocs%2FWeb%2FHTML&metadata=%3C%21--+Do+not+make+changes+below+this+line+--%3E%0A%3Cdetails%3E%0A%3Csummary%3EPage+report+details%3C%2Fsummary%3E%0A%0A*+Folder%3A+%60en-us%2Fweb%2Fhtml%60%0A*+MDN+URL%3A+https%3A%2F%2Fdeveloper.mozilla.org%2Fen-US%2Fdocs%2FWeb%2FHTML%0A*+GitHub+URL%3A+https%3A%2F%2Fgithub.com%2Fmdn%2Fcontent%2Fblob%2Fmain%2Ffiles%2Fen-us%2Fweb%2Fhtml%2Findex.md%0A*+Last+commit%3A+https%3A%2F%2Fgithub.com%2Fmdn%2Fcontent%2Fcommit%2Fd1f3f179175c80c18b1b78ba0df0ea7d15ca32cc%0A*+Document+last+modified%3A+2025-12-22T01%3A06%3A28.000Z%0A%0A%3C%2Fdetails%3E "This will take you to GitHub to file a new issue.") |
| outboundLinks | |
| metaDescription | HTML (HyperText Markup Language) is the most basic building block of the Web. It defines the meaning and structure of web content. Other technologies besides HTML are generally used to describe a web page's appearance/presentation (CSS) or functionality/behavior (JavaScript). |
| outboundLinkCount | 18 |
Parameters
--urlstringrequiredThe web page URL to scrape (must be an http or https URL). e.g. "https://developer.mozilla.org/en-US/docs/Web/HTML"
Marketplace
Publish this collector so others can deploy it — you keep ownership.
0 runs in 14d · published 5h ago
Versions
Every build and self-heal appends a version. Pin one to lock runs to it.
v1builtapprovedcurrent5h ago
How this script collects data
import Firecrawl from "@mendable/firecrawl-js";
import * as cheerio from "cheerio";
import { parseArgs } from "node:util";
/**
* Generic web-page structurer.
*
* Given any web page URL, this scrapes the page once and returns a single
* normalized record describing it:
*
* {
* url, // the page that was scraped (final/source URL)
* title, // <title> / document title
* metaDescription, // <meta name="description"> content
* headings, // [{ level, text }] for every h1..h6 on the page
* bodyMarkdown, // the main body content rendered as markdown
* outboundLinks, // [{ url, text }] for every link to another host
* outboundLinkCount,
* }
*
* Strategy: one Firecrawl scrape requesting markdown + rawHtml + links.
* - markdown -> the page's main body content (clean, main-content only).
* - rawHtml -> the full, unmodified page HTML. Parsed deterministically
* with cheerio for the <title>, the meta description, every
* heading, and every <a href>. The processed "html" format is
* stripped of <head>, so rawHtml is required to recover the
* title and meta description.
* - metadata -> used as a fallback for title / description when the markup
* omits them.
*
* "Outbound" links are those whose host differs from the page's host (this
* treats different subdomains as outbound). Relative, fragment-only, mailto:,
* tel:, and javascript: links are excluded.
*/
interface Heading {
level: number; // 1..6
text: string;
}
interface OutboundLink {
url: string;
text: string;
}
interface PageData {
url: string;
title: string | null;
metaDescription: string | null;
headings: Heading[];
bodyMarkdown: string;
outboundLinks: OutboundLink[];
outboundLinkCount: number;
}
function cleanText(v: string | undefined | null): string {
if (!v) return "";
return v.replace(/\s+/g, " ").trim();
}
function nonEmptyOrNull(v: string | undefined | null): string | null {
const t = cleanText(v);
return t.length > 0 ? t : null;
}
async function main(): Promise<void> {
const { values } = parseArgs({
strict: true,
options: {
url: { type: "string" },
},
});
const rawUrl = values.url;
if (!rawUrl || rawUrl.trim().length === 0) {
console.error("Missing required parameter: --url=<page URL>");
process.exit(1);
}
let target: URL;
try {
target = new URL(rawUrl.trim());
} catch {
throw new Error(`OUT_OF_SCOPE: not a valid URL: ${rawUrl}`);
}
if (target.protocol !== "http:" && target.protocol !== "https:") {
throw new Error(`OUT_OF_SCOPE: URL must use http or https: ${rawUrl}`);
}
const apiKey = process.env.FIRECRAWL_API_KEY;
if (!apiKey) {
console.error("FIRECRAWL_API_KEY environment variable is not set");
process.exit(1);
}
const firecrawl = new Firecrawl({ apiKey });
console.error(`Scraping ${target.toString()}`);
const res = (await firecrawl.scrape(target.toString(), {
formats: ["markdown", "rawHtml", "links"],
integration: "prometheus",
})) as {
markdown?: string;
rawHtml?: string;
html?: string;
links?: string[];
metadata?: Record<string, unknown>;
};
const html = res.rawHtml ?? res.html ?? "";
if (!html) {
throw new Error(
`no HTML returned for ${target.toString()} (page may have been bot-blocked or rendered empty)`,
);
}
const metadata = res.metadata ?? {};
// Firecrawl reports the final/resolved URL; fall back to the requested URL.
const baseUrlStr =
nonEmptyOrNull(metadata.sourceURL as string) ??
nonEmptyOrNull(metadata.url as string) ??
target.toString();
let baseUrl: URL;
try {
baseUrl = new URL(baseUrlStr);
} catch {
baseUrl = target;
}
const $ = cheerio.load(html);
// Title: prefer the document <title>, then Firecrawl/OpenGraph metadata.
const title =
nonEmptyOrNull($("title").first().text()) ??
nonEmptyOrNull(metadata.title as string) ??
nonEmptyOrNull(metadata.ogTitle as string);
// Meta description: the <meta name="description"> tag, then OpenGraph/metadata.
const metaDescription =
nonEmptyOrNull($('meta[name="description"]').attr("content")) ??
nonEmptyOrNull($('meta[property="og:description"]').attr("content")) ??
nonEmptyOrNull(metadata.description as string) ??
nonEmptyOrNull(metadata.ogDescription as string);
// Every heading, in document order.
const headings: Heading[] = [];
$("h1, h2, h3, h4, h5, h6").each((_, el) => {
const tag = (el as { tagName?: string }).tagName ?? "";
const level = Number(tag.slice(1));
const text = cleanText($(el).text());
if (Number.isFinite(level) && level >= 1 && level <= 6 && text.length > 0) {
headings.push({ level, text });
}
});
// Every outbound link (different host), with its anchor text.
const outboundLinks: OutboundLink[] = [];
const seen = new Set<string>();
$("a[href]").each((_, el) => {
const href = ($(el).attr("href") ?? "").trim();
if (!href || href.startsWith("#")) return;
const lower = href.toLowerCase();
if (
lower.startsWith("mailto:") ||
lower.startsWith("tel:") ||
lower.startsWith("javascript:") ||
lower.startsWith("data:")
) {
return;
}
let resolved: URL;
try {
resolved = new URL(href, baseUrl);
} catch {
return;
}
if (resolved.protocol !== "http:" && resolved.protocol !== "https:") return;
// Outbound = a host different from the page's host.
if (resolved.host === baseUrl.host) return;
// Anchor text, falling back to common labels for icon/image links.
const text =
cleanText($(el).text()) ||
cleanText($(el).attr("aria-label")) ||
cleanText($(el).attr("title")) ||
cleanText($(el).find("img[alt]").first().attr("alt"));
const url = resolved.toString();
const key = `${url}\n${text}`;
if (seen.has(key)) return;
seen.add(key);
outboundLinks.push({ url, text });
});
const out: PageData = {
url: baseUrl.toString(),
title,
metaDescription,
headings,
bodyMarkdown: res.markdown ?? "",
outboundLinks,
outboundLinkCount: outboundLinks.length,
};
console.error(
`Extracted: title=${out.title ? "yes" : "no"}, ${headings.length} headings, ` +
`${outboundLinks.length} outbound links, ${out.bodyMarkdown.length} markdown chars`,
);
process.stdout.write(JSON.stringify(out));
}
main().catch((err) => {
console.error(err instanceof Error ? err.message : String(err));
process.exit(1);
});
Deploy this collector to unlock schedules, the API endpoint, and destinations.