Python SDK: Installation, Usage, and LangChain Integration

The FireScraper Python SDK lets you scrape websites and download results directly from Python. It includes sync and async clients, plus a LangChain document loader for RAG pipelines.

PyPI: pypi.org/project/firescraper

---

Installation

bash

pip install firescraper

Requires Python 3.9 or later. The only dependency is httpx.

For LangChain integration, also install:

bash

pip install langchain-firescraper langchain-core

---

Authentication

Every request requires an API key. Create one from the API Keys page in your dashboard. Keys start with fsk_.

python

from firescraper import FireScraper

client = FireScraper("fsk_your_api_key")

---

Quick Start

python

from firescraper import FireScraper

client = FireScraper("fsk_your_api_key")

Start a crawl
session = client.scrape(
    name="Docs crawl",
    urls=["https://docs.example.com/"],
    max_depth=2,
    scraper="article",
)
print(f"Session started: {session.id}")

Wait for completion with progress updates
result = client.wait_for_completion(
    session.id,
    on_progress=lambda s: print(
        f"  {s.counts.success}/{s.counts.total} pages"
    ),
)
print(f"Done! {result.counts.success} pages scraped")

Download results
download = client.get_results(session.id, format="json")
with open("results.json", "wb") as f:
    f.write(download.data)

---

Async Client

For async frameworks (FastAPI, etc.) or concurrent crawls, use AsyncFireScraper:

python

import asyncio
from firescraper import AsyncFireScraper

async def main():
    async with AsyncFireScraper("fsk_your_api_key") as client:
        session = await client.scrape(
            name="Async crawl",
            urls=["https://docs.example.com/"],
            max_depth=2,
        )
        result = await client.wait_for_completion(session.id)
        download = await client.get_results(session.id, format="json")
        print(f"Downloaded {len(download.data)} bytes")

asyncio.run(main())

All methods are identical to the sync client but return coroutines.

---

Available Methods

| Method | Description |

|---|---|

| client.scrape(name, urls, ...) | Start a new crawl session |

| client.get_session(session_id) | Get status, page counts, queue depth |

| client.wait_for_completion(session_id) | Poll until the crawl finishes |

| client.list_results(session_id) | List available export files |

| client.get_results(session_id, format) | Download results (json, csv, markdown, zip, etc.) |

| client.get_partial_results(session_id) | Download mid-crawl results |

---

Scrape Options

The scrape() method accepts these parameters:

|---|---|---|---|

| max_depth | int | 1 | Link-hop depth (0 = seed URLs only) |

---

Download Formats

Use client.get_results(session_id, format="...") with one of:

| Format | Description |

|---|---|

| json | JSON array of all pages |

| csv | Tabular data with URL, title, text, word count |

| markdown | One Markdown file with all page content |

| zip | ZIP bundle of individual text files |

| documents | JSONL — one JSON object per line (page-level) |

| chunks | JSONL — one JSON object per line (chunk-level) |

| structured | Structured extraction results |

---

Error Handling

The SDK raises typed exceptions for API errors:

python

from firescraper.exceptions import (
    AuthenticationError,  # 401 — bad or missing API key
    BadRequestError,      # 400 — invalid parameters
    NotFoundError,        # 404 — session not found
    RateLimitError,       # 429 — too many requests
    ServerError,          # 5xx — server-side issue
    TimeoutError,         # poll or request timeout
)

try:
    session = client.scrape(name="Test", urls=["https://example.com"])
except AuthenticationError:
    print("Check your API key")
except RateLimitError:
    print("Too many requests — retry in a moment")

---

LangChain Integration

The FireScraperLoader turns any website into LangChain Document objects:

python

from langchain_firescraper import FireScraperLoader

loader = FireScraperLoader(
    api_key="fsk_your_api_key",
    urls=["https://docs.example.com/"],
    max_depth=3,
    scraper="article",
)

Load all pages as Documents
docs = loader.load()
print(f"Loaded {len(docs)} documents")

for doc in docs[:3]:
    print(f"  {doc.metadata['url']} — {doc.metadata['word_count']} words")

Each Document contains:

page_content — the extracted text

metadata — url, title, word_count, session_id, scraper, source

RAG Pipeline Example

python

from langchain_firescraper import FireScraperLoader
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

1. Scrape
loader = FireScraperLoader(
    api_key="fsk_your_api_key",
    urls=["https://docs.example.com/"],
    max_depth=3,
)
docs = loader.load()

2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(docs)

3. Embed and store
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

4. Query
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(),
)
answer = qa.invoke("How do I authenticate API requests?")
print(answer["result"])

Lazy Loading

For large crawls, use lazy_load() to process one document at a time:

python

for doc in loader.lazy_load():
    chunks = splitter.split_documents([doc])
    vectorstore.add_documents(chunks)

---

API Documentation

TypeScript SDK on npm

Python SDK on PyPI

Python SDK: Installation, Usage, and LangChain Integration

Installation

Authentication

Quick Start

Start a crawl

Wait for completion with progress updates

Download results

Async Client

Available Methods

Scrape Options

Download Formats

Error Handling

LangChain Integration

Load all pages as Documents

RAG Pipeline Example

1. Scrape

2. Chunk

3. Embed and store

4. Query

Lazy Loading

Related