Helmdesk — Helpdesk, Email & Knowledge Base for Developers

The Structured extraction schema field lets you define a JSON schema so FireScraper extracts specific, typed fields from every page it crawls — instead of returning raw text.

When to use a schema

Use structured extraction when you need specific data points from pages, not just their text content. Common use cases:

Price monitoring — extract product name and price from competitor pages

Job board scraping — extract title, company, location, salary from listings

Directory scraping — extract business name, address, phone number from listings

Content auditing — extract title, author, publish date from articles

How it works

You provide a JSON schema describing the fields you want

FireScraper crawls your pages and extracts text as usual

After crawling, it processes each page against your schema

For each field, it tries to find a matching value using direct mapping, label matching, and type coercion

Results are saved as a separate corpus-extracted.json file you can download

Schema format

The schema must be a valid JSON object with a properties field. Each property has a type.

Supported types

| Type | Description | Example value |

|------|-------------|---------------|

| string | Text value | "Widget Pro" |

| number | Numeric value | 49.99 |

| boolean | True or false | true |

| array | List of values | ["red", "blue"] |

Example schemas

Product pricing

json

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "price": { "type": "number" },
    "description": { "type": "string" }
  }
}

Output per page:

json

{ "url": "https://store.example.com/widget-pro", "extracted": { "title": "Widget Pro", "price": 49.99, "description": "Professional-grade widget with 3-year warranty" }

}

Job listings

json

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "company": { "type": "string" },
    "location": { "type": "string" },
    "salary": { "type": "string" },
    "remote": { "type": "boolean" }
  }
}

Blog articles

json

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "author": { "type": "string" },
    "tags": { "type": "array" }
  }
}

Array values are split from text using commas, semicolons, or pipe characters. For example, "AI, Machine Learning, NLP" becomes ["AI", "Machine Learning", "NLP"].

How extraction works internally

FireScraper uses three strategies to match your schema fields to page content:

Direct mapping — Known fields like title map to the page's HTML <title> tag, and url maps to the page URL

Label matching — Searches the page text for patterns like "Price: $50" or "Author: Jane Smith" using the field name as a label

Type coercion — Converts extracted text to the right type (e.g. strips $ and converts "49.99" to the number 49.99)

Tips

Use the Product catalog template as a starting point — it pre-fills a basic schema with title, price, and description

Name your fields clearly — field names are used as search labels, so price works better than p or field_3

Use string as the default type when unsure — it's the most forgiving

Combine with a CSS selector to narrow down which page content is searched for matches

Use number for prices — FireScraper automatically strips currency symbols and extracts the numeric value

Check your results in the corpus-extracted.json download to verify extraction quality before building a pipeline around it