Back to articles

Python SDK: Installation, Usage, and LangChain Integration

API & IntegrationsMay 22, 2026

Python SDK: Installation, Usage, and LangChain Integration

The FireScraper Python SDK lets you scrape websites and download results directly from Python. It includes sync and async clients, plus a LangChain document loader for RAG pipelines.

PyPI: pypi.org/project/firescraper

---

Installation

bash
pip install firescraper

Requires Python 3.9 or later. The only dependency is httpx.

For LangChain integration, also install:

bash
pip install langchain-firescraper langchain-core

---

Authentication

Every request requires an API key. Create one from the API Keys page in your dashboard. Keys start with fsk_.

python
from firescraper import FireScraper

client = FireScraper("fsk_your_api_key")

---

Quick Start

python
from firescraper import FireScraper

client = FireScraper("fsk_your_api_key")

Start a crawl

session = client.scrape(

name="Docs crawl",

urls=["https://docs.example.com/"],

max_depth=2,

scraper="article",

)

print(f"Session started: {session.id}")

Wait for completion with progress updates

result = client.wait_for_completion(

session.id,

on_progress=lambda s: print(

f" {s.counts.success}/{s.counts.total} pages"

),

)

print(f"Done! {result.counts.success} pages scraped")

Download results

download = client.get_results(session.id, format="json")

with open("results.json", "wb") as f:

f.write(download.data)

---

Async Client

For async frameworks (FastAPI, etc.) or concurrent crawls, use AsyncFireScraper:

python
import asyncio

from firescraper import AsyncFireScraper

async def main():

async with AsyncFireScraper("fsk_your_api_key") as client:

session = await client.scrape(

name="Async crawl",

urls=["https://docs.example.com/"],

max_depth=2,

)

result = await client.wait_for_completion(session.id)

download = await client.get_results(session.id, format="json")

print(f"Downloaded {len(download.data)} bytes")

asyncio.run(main())

All methods are identical to the sync client but return coroutines.

---

Available Methods

| Method | Description |

|---|---|

| client.scrape(name, urls, ...) | Start a new crawl session |

| client.get_session(session_id) | Get status, page counts, queue depth |

| client.wait_for_completion(session_id) | Poll until the crawl finishes |

| client.list_results(session_id) | List available export files |

| client.get_results(session_id, format) | Download results (json, csv, markdown, zip, etc.) |

| client.get_partial_results(session_id) | Download mid-crawl results |

---

Scrape Options

The scrape() method accepts these parameters:

| Parameter | Type | Default | Description |

|---|---|---|---|

| name | string | required | Human-readable session name |

| urls | list[str] | required | Seed URLs to crawl |

| max_depth | int | 1 | Link-hop depth (0 = seed URLs only) |

| scraper | string | "article" | "article" or "full" |

| ignore_urls | list[str] | None | URLs to exclude |

| min_text_length | int | None | Minimum word count to keep a page |

| webhook_url | string | None | URL to notify on completion |

| extraction_schema | dict | None | JSON Schema for structured extraction |

| respect_robots_txt | bool | None | Respect target site's robots.txt |

| content_selector | string | None | CSS selector to restrict extraction |

---

Download Formats

Use client.get_results(session_id, format="...") with one of:

| Format | Description |

|---|---|

| json | JSON array of all pages |

| csv | Tabular data with URL, title, text, word count |

| markdown | One Markdown file with all page content |

| zip | ZIP bundle of individual text files |

| documents | JSONL — one JSON object per line (page-level) |

| chunks | JSONL — one JSON object per line (chunk-level) |

| structured | Structured extraction results |

---

Error Handling

The SDK raises typed exceptions for API errors:

python
from firescraper.exceptions import (

AuthenticationError, # 401 — bad or missing API key

BadRequestError, # 400 — invalid parameters

NotFoundError, # 404 — session not found

RateLimitError, # 429 — too many requests

ServerError, # 5xx — server-side issue

TimeoutError, # poll or request timeout

)

try:

session = client.scrape(name="Test", urls=["https://example.com"])

except AuthenticationError:

print("Check your API key")

except RateLimitError:

print("Too many requests — retry in a moment")

---

LangChain Integration

The FireScraperLoader turns any website into LangChain Document objects:

python
from langchain_firescraper import FireScraperLoader

loader = FireScraperLoader(

api_key="fsk_your_api_key",

urls=["https://docs.example.com/"],

max_depth=3,

scraper="article",

)

Load all pages as Documents

docs = loader.load()

print(f"Loaded {len(docs)} documents")

for doc in docs[:3]:

print(f" {doc.metadata['url']} — {doc.metadata['word_count']} words")

Each Document contains:

  • page_content — the extracted text
  • metadata — url, title, word_count, session_id, scraper, source
  • RAG Pipeline Example

    python
    from langchain_firescraper import FireScraperLoader
    

    from langchain_openai import OpenAIEmbeddings, ChatOpenAI

    from langchain_community.vectorstores import FAISS

    from langchain.text_splitter import RecursiveCharacterTextSplitter

    from langchain.chains import RetrievalQA

    1. Scrape

    loader = FireScraperLoader(

    api_key="fsk_your_api_key",

    urls=["https://docs.example.com/"],

    max_depth=3,

    )

    docs = loader.load()

    2. Chunk

    splitter = RecursiveCharacterTextSplitter(

    chunk_size=1000, chunk_overlap=200

    )

    chunks = splitter.split_documents(docs)

    3. Embed and store

    vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

    4. Query

    qa = RetrievalQA.from_chain_type(

    llm=ChatOpenAI(model="gpt-4"),

    retriever=vectorstore.as_retriever(),

    )

    answer = qa.invoke("How do I authenticate API requests?")

    print(answer["result"])

    Lazy Loading

    For large crawls, use lazy_load() to process one document at a time:

    python
    for doc in loader.lazy_load():
    

    chunks = splitter.split_documents([doc])

    vectorstore.add_documents(chunks)

    ---

    Related

  • API Documentation
  • TypeScript SDK on npm
  • Python SDK on PyPI
  • Was this article helpful?