Back to articles

Extracting structured data with schemas

Projects & CrawlingMay 20, 2026

The Structured extraction schema field lets you define a JSON schema so FireScraper extracts specific, typed fields from every page it crawls — instead of returning raw text.

When to use a schema

Use structured extraction when you need specific data points from pages, not just their text content. Common use cases:

  • Price monitoring — extract product name and price from competitor pages
  • Job board scraping — extract title, company, location, salary from listings
  • Directory scraping — extract business name, address, phone number from listings
  • Content auditing — extract title, author, publish date from articles
  • How it works

  • You provide a JSON schema describing the fields you want
  • FireScraper crawls your pages and extracts text as usual
  • After crawling, it processes each page against your schema
  • For each field, it tries to find a matching value using direct mapping, label matching, and type coercion
  • Results are saved as a separate corpus-extracted.json file you can download
  • Schema format

    The schema must be a valid JSON object with a properties field. Each property has a type.

    Supported types

    | Type | Description | Example value |

    |------|-------------|---------------|

    | string | Text value | "Widget Pro" |

    | number | Numeric value | 49.99 |

    | boolean | True or false | true |

    | array | List of values | ["red", "blue"] |

    Example schemas

    Product pricing

    json
    {
    

    "type": "object",

    "properties": {

    "title": { "type": "string" },

    "price": { "type": "number" },

    "description": { "type": "string" }

    }

    }

    Output per page:

    json
    {
    

    "url": "https://store.example.com/widget-pro",

    "extracted": {

    "title": "Widget Pro",

    "price": 49.99,

    "description": "Professional-grade widget with 3-year warranty"

    }

    }

    Job listings

    json
    {
    

    "type": "object",

    "properties": {

    "title": { "type": "string" },

    "company": { "type": "string" },

    "location": { "type": "string" },

    "salary": { "type": "string" },

    "remote": { "type": "boolean" }

    }

    }

    Blog articles

    json
    {
    

    "type": "object",

    "properties": {

    "title": { "type": "string" },

    "author": { "type": "string" },

    "tags": { "type": "array" }

    }

    }

    Array values are split from text using commas, semicolons, or pipe characters. For example, "AI, Machine Learning, NLP" becomes ["AI", "Machine Learning", "NLP"].

    How extraction works internally

    FireScraper uses three strategies to match your schema fields to page content:

  • Direct mapping — Known fields like title map to the page's HTML <title> tag, and url maps to the page URL
  • Label matching — Searches the page text for patterns like "Price: $50" or "Author: Jane Smith" using the field name as a label
  • Type coercion — Converts extracted text to the right type (e.g. strips $ and converts "49.99" to the number 49.99)
  • Tips

  • Use the Product catalog template as a starting point — it pre-fills a basic schema with title, price, and description
  • Name your fields clearly — field names are used as search labels, so price works better than p or field_3
  • Use string as the default type when unsure — it's the most forgiving
  • Combine with a CSS selector to narrow down which page content is searched for matches
  • Use number for prices — FireScraper automatically strips currency symbols and extracts the numeric value
  • Check your results in the corpus-extracted.json download to verify extraction quality before building a pipeline around it
  • Was this article helpful?