POST
/
crawler
/
jobs
curl --request POST \
  --url https://api.usescraper.com/crawler/jobs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "urls": [
    "<string>"
  ],
  "exclude_globs": [],
  "exclude_elements": "nav, header, footer, script, style, noscript, svg, [role=\"alert\"], [role=\"banner\"], [role=\"dialog\"], [role=\"alertdialog\"], [role=\"region\"][aria-label*=\"skip\" i], [aria-modal=\"true\"]",
  "output_format": "text",
  "output_expiry": 604800,
  "min_length": 50,
  "webhook_url": "<string>",
  "page_limit": 10000,
  "force_crawling_mode": "sitemap",
  "block_resources": true,
  "include_linked_files": false
}'
{
    "id": "7YEGS3M8Q2JD6TNMEJB8B6EKVS",
    "urls": [
        "https://example.com"
    ],
    "createdAt": 1699964378397,
    "status": "starting",
    "sitemapPageCount": 0,
    "progress": {
        "scraped": 0,
        "discarded": 0,
        "failed": 0
    },
    "costCents": 0,
    "webhookFails": []
}

Crawler jobs may take several minutes to complete. Use the Get job endpoint to check the status of a job, and fetch the results from the Get job data endpoint when the job is complete.

{
    "id": "7YEGS3M8Q2JD6TNMEJB8B6EKVS",
    "urls": [
        "https://example.com"
    ],
    "createdAt": 1699964378397,
    "status": "starting",
    "sitemapPageCount": 0,
    "progress": {
        "scraped": 0,
        "discarded": 0,
        "failed": 0
    },
    "costCents": 0,
    "webhookFails": []
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
Crawler parameters
urls
string[]
required

Array of one or more website URLs to crawl

exclude_globs
string[]
required

Globs (https://developer.mozilla.org/en-US/docs/Web/API/URL_Pattern_API) to exclude page URLs from being crawled

exclude_elements
string
default:nav, header, footer, script, style, noscript, svg, [role="alert"], [role="banner"], [role="dialog"], [role="alertdialog"], [role="region"][aria-label*="skip" i], [aria-modal="true"]
required

CSS selectors of content to exclude from page html before converting to output format (separate multiple selectors with commas)

output_format
enum<string>
default:text
required

Format to save all crawled page content to

Available options:
text,
html,
markdown
output_expiry
number
default:604800
required

Time in seconds to store crawler output for, after which it will be automatically deleted (default and max value is 604800 which is 7 days)

Required range: x <= 604800
min_length
number
default:50
required

Skip any page that has less than the minimum number of characters in the output (default 50 chars)

page_limit
number
default:10000
required

Maximum number of pages to crawl (limited to 10,000 pages on the free plan)

Required range: 0 < x <= 500000
block_resources
boolean
default:true
required

Block loading of images, stylesheets, and scripts to speed up crawling

include_linked_files
boolean
default:false
required

Include linked files (e.g. PDFs, images) in the output as URLs

webhook_url
string

Webhook to call with updates about job progress

force_crawling_mode
enum<string>

Force crawling mode to use sitemap or link crawling

Available options:
sitemap,
link

Response

201

The created job object. Job URL will be provided in Location header.