POST
/
crawler
/
jobs
curl --request POST \
  --url https://api.usescraper.com/crawler/jobs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "urls": [
    "<string>"
  ],
  "exclude_globs": [
    "<string>"
  ],
  "exclude_elements": "<string>",
  "output_format": "text",
  "output_expiry": 123,
  "min_length": 123,
  "webhook_url": "<string>",
  "page_limit": 123,
  "force_crawling_mode": "sitemap",
  "block_resources": true,
  "include_linked_files": true
}'
{
    "id": "7YEGS3M8Q2JD6TNMEJB8B6EKVS",
    "urls": [
        "https://example.com"
    ],
    "createdAt": 1699964378397,
    "status": "starting",
    "sitemapPageCount": 0,
    "progress": {
        "scraped": 0,
        "discarded": 0,
        "failed": 0
    },
    "costCents": 0,
    "webhookFails": []
}

Crawler jobs may take several minutes to complete. Use the Get job endpoint to check the status of a job, and fetch the results from the Get job data endpoint when the job is complete.

{
    "id": "7YEGS3M8Q2JD6TNMEJB8B6EKVS",
    "urls": [
        "https://example.com"
    ],
    "createdAt": 1699964378397,
    "status": "starting",
    "sitemapPageCount": 0,
    "progress": {
        "scraped": 0,
        "discarded": 0,
        "failed": 0
    },
    "costCents": 0,
    "webhookFails": []
}

Authorizations

Authorization
string
headerrequired

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
urls
string[]
required

Array of one or more website URLs to crawl

exclude_globs
string[]
required

Globs (https://developer.mozilla.org/en-US/docs/Web/API/URL_Pattern_API) to exclude page URLs from being crawled

exclude_elements
string
default: nav, header, footer, script, style, noscript, svg, [role="alert"], [role="banner"], [role="dialog"], [role="alertdialog"], [role="region"][aria-label*="skip" i], [aria-modal="true"]required

CSS selectors of content to exclude from page html before converting to output format (separate multiple selectors with commas)

output_format
enum<string>
default: textrequired

Format to save all crawled page content to

Available options:
text,
html,
markdown
output_expiry
number
default: 604800required

Time in seconds to store crawler output for, after which it will be automatically deleted (default and max value is 604800 which is 7 days)

min_length
number
default: 50required

Skip any page that has less than the minimum number of characters in the output (default 50 chars)

webhook_url
string

Webhook to call with updates about job progress

page_limit
number
default: 10000required

Maximum number of pages to crawl (limited to 10,000 pages on the free plan)

force_crawling_mode
enum<string>

Force crawling mode to use sitemap or link crawling

Available options:
sitemap,
link
block_resources
boolean
default: truerequired

Block loading of images, stylesheets, and scripts to speed up crawling

include_linked_files
boolean
default: falserequired

Include linked files (e.g. PDFs, images) in the output as URLs