Python SDK: Installation, Usage, and LangChain Integration
Python SDK: Installation, Usage, and LangChain Integration
The FireScraper Python SDK lets you scrape websites and download results directly from Python. It includes sync and async clients, plus a LangChain document loader for RAG pipelines.
PyPI: pypi.org/project/firescraper
---
Installation
pip install firescraper
Requires Python 3.9 or later. The only dependency is httpx.
For LangChain integration, also install:
pip install langchain-firescraper langchain-core
---
Authentication
Every request requires an API key. Create one from the API Keys page in your dashboard. Keys start with fsk_.
from firescraper import FireScraper
client = FireScraper("fsk_your_api_key")
---
Quick Start
from firescraper import FireScraper
client = FireScraper("fsk_your_api_key")
Start a crawl
session = client.scrape(
name="Docs crawl",
urls=["https://docs.example.com/"],
max_depth=2,
scraper="article",
)
print(f"Session started: {session.id}")
Wait for completion with progress updates
result = client.wait_for_completion(
session.id,
on_progress=lambda s: print(
f" {s.counts.success}/{s.counts.total} pages"
),
)
print(f"Done! {result.counts.success} pages scraped")
Download results
download = client.get_results(session.id, format="json")
with open("results.json", "wb") as f:
f.write(download.data)
---
Async Client
For async frameworks (FastAPI, etc.) or concurrent crawls, use AsyncFireScraper:
import asyncio
from firescraper import AsyncFireScraper
async def main():
async with AsyncFireScraper("fsk_your_api_key") as client:
session = await client.scrape(
name="Async crawl",
urls=["https://docs.example.com/"],
max_depth=2,
)
result = await client.wait_for_completion(session.id)
download = await client.get_results(session.id, format="json")
print(f"Downloaded {len(download.data)} bytes")
asyncio.run(main())
All methods are identical to the sync client but return coroutines.
---
Available Methods
| Method | Description |
|---|---|
| client.scrape(name, urls, ...) | Start a new crawl session |
| client.get_session(session_id) | Get status, page counts, queue depth |
| client.wait_for_completion(session_id) | Poll until the crawl finishes |
| client.list_results(session_id) | List available export files |
| client.get_results(session_id, format) | Download results (json, csv, markdown, zip, etc.) |
| client.get_partial_results(session_id) | Download mid-crawl results |
---
Scrape Options
The scrape() method accepts these parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| name | string | required | Human-readable session name |
| urls | list[str] | required | Seed URLs to crawl |
| max_depth | int | 1 | Link-hop depth (0 = seed URLs only) |
| scraper | string | "article" | "article" or "full" |
| ignore_urls | list[str] | None | URLs to exclude |
| min_text_length | int | None | Minimum word count to keep a page |
| webhook_url | string | None | URL to notify on completion |
| extraction_schema | dict | None | JSON Schema for structured extraction |
| respect_robots_txt | bool | None | Respect target site's robots.txt |
| content_selector | string | None | CSS selector to restrict extraction |
---
Download Formats
Use client.get_results(session_id, format="...") with one of:
| Format | Description |
|---|---|
| json | JSON array of all pages |
| csv | Tabular data with URL, title, text, word count |
| markdown | One Markdown file with all page content |
| zip | ZIP bundle of individual text files |
| documents | JSONL — one JSON object per line (page-level) |
| chunks | JSONL — one JSON object per line (chunk-level) |
| structured | Structured extraction results |
---
Error Handling
The SDK raises typed exceptions for API errors:
from firescraper.exceptions import (
AuthenticationError, # 401 — bad or missing API key
BadRequestError, # 400 — invalid parameters
NotFoundError, # 404 — session not found
RateLimitError, # 429 — too many requests
ServerError, # 5xx — server-side issue
TimeoutError, # poll or request timeout
)
try:
session = client.scrape(name="Test", urls=["https://example.com"])
except AuthenticationError:
print("Check your API key")
except RateLimitError:
print("Too many requests — retry in a moment")
---
LangChain Integration
The FireScraperLoader turns any website into LangChain Document objects:
from langchain_firescraper import FireScraperLoader
loader = FireScraperLoader(
api_key="fsk_your_api_key",
urls=["https://docs.example.com/"],
max_depth=3,
scraper="article",
)
Load all pages as Documents
docs = loader.load()
print(f"Loaded {len(docs)} documents")
for doc in docs[:3]:
print(f" {doc.metadata['url']} — {doc.metadata['word_count']} words")
Each Document contains:
page_content — the extracted textmetadata — url, title, word_count, session_id, scraper, sourceRAG Pipeline Example
from langchain_firescraper import FireScraperLoader
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
1. Scrape
loader = FireScraperLoader(
api_key="fsk_your_api_key",
urls=["https://docs.example.com/"],
max_depth=3,
)
docs = loader.load()
2. Chunk
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(docs)
3. Embed and store
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
4. Query
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
retriever=vectorstore.as_retriever(),
)
answer = qa.invoke("How do I authenticate API requests?")
print(answer["result"])
Lazy Loading
For large crawls, use lazy_load() to process one document at a time:
for doc in loader.lazy_load():
chunks = splitter.split_documents([doc])
vectorstore.add_documents(chunks)
---
Related
Was this article helpful?
