POST
/
crawler
/
jobs

Crawler jobs may take several minutes to complete. Use the Get job endpoint to check the status of a job, and fetch the results from the Get job data endpoint when the job is complete.

Authorizations

Authorization
string
headerrequired

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
urls
string[]
required

Array of one or more website URLs to crawl

exclude_globs
string[]
required

Globs (https://developer.mozilla.org/en-US/docs/Web/API/URL_Pattern_API) to exclude page URLs from being crawled

exclude_elements
string
default: nav, header, footer, script, style, noscript, svg, [role="alert"], [role="banner"], [role="dialog"], [role="alertdialog"], [role="region"][aria-label*="skip" i], [aria-modal="true"]required

CSS selectors of content to exclude from page html before converting to output format (separate multiple selectors with commas)

output_format
enum<string>
default: textrequired

Format to save all crawled page content to

Available options:
text,
html,
markdown
output_expiry
number
default: 604800required

Time in seconds to store crawler output for, after which it will be automatically deleted (default and max value is 604800 which is 7 days)

min_length
number
default: 50required

Skip any page that has less than the minimum number of characters in the output (default 50 chars)

webhook_url
string

Webhook to call with updates about job progress

page_limit
number
default: 10000required

Maximum number of pages to crawl (limited to 10,000 pages on the free plan)

force_crawling_mode
enum<string>

Force crawling mode to use sitemap or link crawling

Available options:
sitemap,
link
block_resources
boolean
default: truerequired

Block loading of images, stylesheets, and scripts to speed up crawling

include_linked_files
boolean
default: falserequired

Include linked files (e.g. PDFs, images) in the output as URLs