Make your First Crawl

Make your first crawl request in under 5 minutes. By the end of this guide, you'll submit a crawl job, check its status, and retrieve the results.

This guide assumes you've completed environment Setup and have your API key configured.

What You'll Build

In this tutorial, you'll create a script that:

Submits a crawl job to fetch a web page
Automatically polls until the job completes
Downloads and saves the results

The UpRock Crawl API fetches web content through a distributed network of real devices, giving you access to geographically-specific data as real users see it.

Create and Run Your Crawl Script

Choose your language and copy the complete script. Each script handles the entire workflow: submit → poll → download.

Create a file named crawl.py:

import os
import sys
import time
import json
import requests
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
api_key = os.getenv('UPROCK_API_KEY')
if not api_key:
    sys.exit("Error: UPROCK_API_KEY not set")

base_url = os.getenv('UPROCK_BASE_URL', 'https://edge.uprock.com')

# 1. Submit a crawl job
print("Submitting crawl job...")
try:
    response = requests.post(
        f"{base_url}/crawl/v1/new",
        headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
        json={"url": "https://example.com", "method": "GET"}
    )
    response.raise_for_status()
    job = response.json()
    job_id = job['job_id']
    print(f"✓ Job submitted: {job_id}")
except requests.exceptions.RequestException as e:
    sys.exit(f"Failed to submit job: {e}")

# 2. Poll for completion (max 60 seconds)
print("Waiting for crawl to complete...")
start_time = time.time()
wait_time = 2

while time.time() - start_time < 60:
    try:
        response = requests.get(
            f"{base_url}/crawl/v1/status/{job_id}",
            headers={"Authorization": f"Bearer {api_key}"}
        )
        response.raise_for_status()
        status = response.json()
        
        if status['state'] == 'completed':
            print("✓ Crawl completed!")
            break
        elif status['state'] in ['failed', 'timeout']:
            sys.exit(f"Crawl failed: {status.get('error', status['state'])}")
        
        print(f"  Status: {status['state']}... waiting {wait_time}s")
        time.sleep(wait_time)
        wait_time = min(wait_time * 1.5, 10)  # Exponential backoff
        
    except requests.exceptions.RequestException as e:
        sys.exit(f"Failed to get status: {e}")
else:
    sys.exit("Polling timeout after 60 seconds")

# 3. Download and save results
print("Downloading results...")
try:
    response = requests.get(
        status['download_url'],
        headers={"Authorization": f"Bearer {api_key}"}
    )
    response.raise_for_status()
    
    # Parse the JSON response
    result = response.json()
    
    # Save to file
    os.makedirs('results', exist_ok=True)
    with open(f"results/{job_id}.json", 'w') as f:
        json.dump(result, f, indent=2)
    
    print(f"✓ Results saved to results/{job_id}.json")
    print(f"  Status Code: {result.get('status_code')}")
    print(f"  Body preview: {result.get('body', '')[:200]}...")
    
except requests.exceptions.RequestException as e:
    sys.exit(f"Failed to download results: {e}")

Run the script:

python crawl.py

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Create a file named crawl.js:

const axios = require('axios');
const fs = require('fs');
require('dotenv').config();

const apiKey = process.env.UPROCK_API_KEY;
if (!apiKey) {
  console.error('Error: UPROCK_API_KEY not set');
  process.exit(1);
}

const baseUrl = process.env.UPROCK_BASE_URL || 'https://edge.uprock.com';

async function crawl() {
  try {
    // 1. Submit a crawl job
    console.log('Submitting crawl job...');
    const submitResponse = await axios.post(
      `${baseUrl}/crawl/v1/new`,
      { url: 'https://example.com', method: 'GET' },
      {
        headers: {
          'Authorization': `Bearer ${apiKey}`,
          'Content-Type': 'application/json'
        }
      }
    );
    
    const jobId = submitResponse.data.job_id;
    console.log(`✓ Job submitted: ${jobId}`);
    
    // 2. Poll for completion (max 60 seconds)
    console.log('Waiting for crawl to complete...');
    const startTime = Date.now();
    let waitTime = 2000;
    let status;
    
    while (Date.now() - startTime < 60000) {
      const statusResponse = await axios.get(
        `${baseUrl}/crawl/v1/status/${jobId}`,
        { headers: { Authorization: `Bearer ${apiKey}` } }
      );
      
      status = statusResponse.data;
      
      if (status.state === 'completed') {
        console.log('✓ Crawl completed!');
        break;
      } else if (status.state === 'failed' || status.state === 'timeout') {
        console.error(`Crawl failed: ${status.error || status.state}`);
        process.exit(1);
      }
      
      console.log(`  Status: ${status.state}... waiting ${waitTime/1000}s`);
      await new Promise(resolve => setTimeout(resolve, waitTime));
      waitTime = Math.min(waitTime * 1.5, 10000); // Exponential backoff
    }
    
    if (status.state !== 'completed') {
      console.error('Polling timeout after 60 seconds');
      process.exit(1);
    }
    
    // 3. Download and save results
    console.log('Downloading results...');
    const resultsResponse = await axios.get(
      status.download_url,
      { headers: { Authorization: `Bearer ${apiKey}` } }
    );
    
    const result = resultsResponse.data;
    
    // Save to file
    if (!fs.existsSync('results')) {
      fs.mkdirSync('results');
    }
    fs.writeFileSync(`results/${jobId}.json`, JSON.stringify(result, null, 2));
    
    console.log(`✓ Results saved to results/${jobId}.json`);
    console.log(`  Status Code: ${result.status_code}`);
    console.log(`  Body preview: ${result.body?.substring(0, 200)}...`);
    
  } catch (error) {
    if (error.response) {
      console.error(`API Error ${error.response.status}: ${error.response.data?.error || error.message}`);
      if (error.response.status === 401) {
        console.error('Check your API key');
      }
    } else {
      console.error(`Error: ${error.message}`);
    }
    process.exit(1);
  }
}

crawl();

Run the script:

node crawl.js

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Create a file named crawl.rb:

require 'httparty'
require 'json'
require 'fileutils'
require 'dotenv/load'

api_key = ENV['UPROCK_API_KEY']
abort("Error: UPROCK_API_KEY not set") unless api_key

base_url = ENV['UPROCK_BASE_URL'] || 'https://edge.uprock.com'

# 1. Submit a crawl job
puts 'Submitting crawl job...'
response = HTTParty.post(
  "#{base_url}/crawl/v1/new",
  headers: {
    'Authorization' => "Bearer #{api_key}",
    'Content-Type' => 'application/json'
  },
  body: { url: 'https://example.com', method: 'GET' }.to_json
)

abort("Failed to submit job: #{response.code} #{response.body}") unless response.success?

job = JSON.parse(response.body)
job_id = job['job_id']
puts "✓ Job submitted: #{job_id}"

# 2. Poll for completion (max 60 seconds)
puts 'Waiting for crawl to complete...'
start_time = Time.now
wait_time = 2

while Time.now - start_time < 60
  response = HTTParty.get(
    "#{base_url}/crawl/v1/status/#{job_id}",
    headers: { 'Authorization' => "Bearer #{api_key}" }
  )
  
  abort("Failed to get status: #{response.code}") unless response.success?
  status = JSON.parse(response.body)
  
  case status['state']
  when 'completed'
    puts '✓ Crawl completed!'
    break
  when 'failed', 'timeout'
    abort("Crawl failed: #{status['error'] || status['state']}")
  else
    puts "  Status: #{status['state']}... waiting #{wait_time}s"
    sleep wait_time
    wait_time = [wait_time * 1.5, 10].min  # Exponential backoff
  end
end

abort("Polling timeout after 60 seconds") if status['state'] != 'completed'

# 3. Download and save results
puts 'Downloading results...'
response = HTTParty.get(
  status['download_url'],
  headers: { 'Authorization' => "Bearer #{api_key}" }
)

abort("Failed to download: #{response.code}") unless response.success?

result = JSON.parse(response.body)

# Save to file
FileUtils.mkdir_p('results')
File.write("results/#{job_id}.json", JSON.pretty_generate(result))

puts "✓ Results saved to results/#{job_id}.json"
puts "  Status Code: #{result['status_code']}"
puts "  Body preview: #{result['body'][0..200]}..."

Run the script:

ruby crawl.rb

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Create a file named main.go:

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "log"
    "net/http"
    "os"
    "time"
    
    "github.com/joho/godotenv"
)

type JobSubmission struct {
    URL    string `json:"url"`
    Method string `json:"method"`
}

type JobResponse struct {
    JobID string `json:"job_id"`
}

type StatusResponse struct {
    State       string `json:"state"`
    DownloadURL string `json:"download_url"`
    Error       string `json:"error,omitempty"`
}

type CrawlResult struct {
    StatusCode int               `json:"status_code"`
    Headers    map[string]string `json:"headers"`
    Body       string            `json:"body"`
}

func main() {
    // Load environment
    godotenv.Load()
    apiKey := os.Getenv("UPROCK_API_KEY")
    if apiKey == "" {
        log.Fatal("Error: UPROCK_API_KEY not set")
    }
    
    baseURL := os.Getenv("UPROCK_BASE_URL")
    if baseURL == "" {
        baseURL = "https://edge.uprock.com"
    }
    
    client := &http.Client{Timeout: 30 * time.Second}
    
    // 1. Submit a crawl job
    fmt.Println("Submitting crawl job...")
    submission := JobSubmission{
        URL:    "https://example.com",
        Method: "GET",
    }
    
    jsonData, err := json.Marshal(submission)
    if err != nil {
        log.Fatalf("Failed to marshal request: %v", err)
    }
    
    req, err := http.NewRequest("POST", baseURL+"/crawl/v1/new", bytes.NewBuffer(jsonData))
    if err != nil {
        log.Fatalf("Failed to create request: %v", err)
    }
    
    req.Header.Set("Authorization", "Bearer "+apiKey)
    req.Header.Set("Content-Type", "application/json")
    
    resp, err := client.Do(req)
    if err != nil {
        log.Fatalf("Failed to submit job: %v", err)
    }
    defer resp.Body.Close()
    
    if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
        body, _ := io.ReadAll(resp.Body)
        log.Fatalf("Failed to submit job: %d %s", resp.StatusCode, body)
    }
    
    var job JobResponse
    if err := json.NewDecoder(resp.Body).Decode(&job); err != nil {
        log.Fatalf("Failed to decode response: %v", err)
    }
    fmt.Printf("✓ Job submitted: %s\n", job.JobID)
    
    // 2. Poll for completion (max 60 seconds)
    fmt.Println("Waiting for crawl to complete...")
    startTime := time.Now()
    waitTime := 2 * time.Second
    var status StatusResponse
    
    for time.Since(startTime) < 60*time.Second {
        req, _ := http.NewRequest("GET", baseURL+"/crawl/v1/status/"+job.JobID, nil)
        req.Header.Set("Authorization", "Bearer "+apiKey)
        
        resp, err := client.Do(req)
        if err != nil {
            log.Fatalf("Failed to get status: %v", err)
        }
        
        if err := json.NewDecoder(resp.Body).Decode(&status); err != nil {
            resp.Body.Close()
            log.Fatalf("Failed to decode status: %v", err)
        }
        resp.Body.Close()
        
        switch status.State {
        case "completed":
            fmt.Println("✓ Crawl completed!")
            goto download
        case "failed", "timeout":
            log.Fatalf("Crawl failed: %s", status.Error)
        default:
            fmt.Printf("  Status: %s... waiting %v\n", status.State, waitTime)
            time.Sleep(waitTime)
            if waitTime < 10*time.Second {
                waitTime = time.Duration(float64(waitTime) * 1.5)
            }
        }
    }
    log.Fatal("Polling timeout after 60 seconds")
    
download:
    // 3. Download and save results
    fmt.Println("Downloading results...")
    req, _ = http.NewRequest("GET", status.DownloadURL, nil)
    req.Header.Set("Authorization", "Bearer "+apiKey)
    
    resp, err = client.Do(req)
    if err != nil {
        log.Fatalf("Failed to download results: %v", err)
    }
    defer resp.Body.Close()
    
    var result CrawlResult
    if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        log.Fatalf("Failed to decode results: %v", err)
    }
    
    // Save to file
    os.Mkdir("results", 0755)
    filename := fmt.Sprintf("results/%s.json", job.JobID)
    
    data, _ := json.MarshalIndent(result, "", "  ")
    if err := os.WriteFile(filename, data, 0644); err != nil {
        log.Fatalf("Failed to save results: %v", err)
    }
    
    fmt.Printf("✓ Results saved to %s\n", filename)
    fmt.Printf("  Status Code: %d\n", result.StatusCode)
    
    preview := result.Body
    if len(preview) > 200 {
        preview = preview[:200]
    }
    fmt.Printf("  Body preview: %s...\n", preview)
}

Run the script:

go run main.go

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Run these commands in your terminal:

# 1. Submit a job and capture the job_id
JOB_ID=$(curl -s -X POST "https://edge.uprock.com/crawl/v1/new" \
  -H "Authorization: Bearer $UPROCK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "method": "GET"}' | jq -r '.job_id')

echo "✓ Job submitted: ${JOB_ID}"

# 2. Wait a few seconds, then check status
sleep 3
STATUS=$(curl -s "https://edge.uprock.com/crawl/v1/status/${JOB_ID}" \
  -H "Authorization: Bearer $UPROCK_API_KEY")

echo "Status: $(echo $STATUS | jq -r '.state')"

# 3. If completed, download results
DOWNLOAD_URL=$(echo $STATUS | jq -r '.download_url')
if [ "$DOWNLOAD_URL" != "null" ]; then
  echo "✓ Crawl completed!"
  echo "Downloading results..."
  curl -s "${DOWNLOAD_URL}" -H "Authorization: Bearer $UPROCK_API_KEY" | jq '.' > "result_${JOB_ID}.json"
  echo "✓ Results saved to result_${JOB_ID}.json"
else
  echo "Job still processing. Check status again or increase sleep time."
fi

The cURL example uses a simple sleep 3 instead of polling. For production use, implement proper polling with exponential backoff as shown in the other language examples.

About polling: Crawl jobs are processed asynchronously. The scripts automatically check the job status every few seconds using exponential backoff (2s → 3s → 4.5s → 10s max) until completion. Learn more about polling strategies.

Understanding the Response

The script will automatically create a results/ folder in your current directory and save the crawl output as results/{job_id}.json on your local machine.

Result structure:

{
  "status_code": 200,
  "headers": {
    "content-type": "text/html; charset=utf-8",
    "server": "nginx/1.18.0",
    "content-length": "1256"
  },
  "body": "<!DOCTYPE html>\n<html lang=\"en\">...</html>",
  "metadata": {
    "crawled_at": "2024-01-15T10:30:05Z",
    "device_id": "device-xyz789",
    "response_time_ms": 342,
    "device_location": {
      "country": "US",
      "region": "California",
      "city": "San Francisco"
    }
  }
}

What your script did:

You submitted a job to /crawl/v1/new
The API assigned it to a device on the network
The device fetched the page and returned results
Your script saved everything to a local JSON file

Make your First Crawl

What You'll Build

Create and Run Your Crawl Script

Understanding the Response

Next Steps

Basic Crawl Operations

Session Management

Troubleshooting Guide

API Reference

On this page