Make your First Crawl

Make your first crawl request in under 5 minutes. By the end of this guide, you'll submit a crawl job, check its status, and retrieve the results.

This guide assumes you've completed environment Setup and have your API key configured.

What You'll Build

In this tutorial, you'll create a script that:

  • Submits a crawl job to fetch a web page
  • Automatically polls until the job completes
  • Downloads and saves the results

The UpRock Crawl API fetches web content through a distributed network of real devices, giving you access to geographically-specific data as real users see it.

Create and Run Your Crawl Script

Choose your language and copy the complete script. Each script handles the entire workflow: submit → poll → download.

Create a file named crawl.py:

import os
import sys
import time
import json
import requests
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
api_key = os.getenv('UPROCK_API_KEY')
if not api_key:
    sys.exit("Error: UPROCK_API_KEY not set")

base_url = os.getenv('UPROCK_BASE_URL', 'https://edge.uprock.com')

# 1. Submit a crawl job
print("Submitting crawl job...")
try:
    response = requests.post(
        f"{base_url}/crawl/v1/new",
        headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
        json={"url": "https://example.com", "method": "GET"}
    )
    response.raise_for_status()
    job = response.json()
    job_id = job['job_id']
    print(f"✓ Job submitted: {job_id}")
except requests.exceptions.RequestException as e:
    sys.exit(f"Failed to submit job: {e}")

# 2. Poll for completion (max 60 seconds)
print("Waiting for crawl to complete...")
start_time = time.time()
wait_time = 2

while time.time() - start_time < 60:
    try:
        response = requests.get(
            f"{base_url}/crawl/v1/status/{job_id}",
            headers={"Authorization": f"Bearer {api_key}"}
        )
        response.raise_for_status()
        status = response.json()
        
        if status['state'] == 'completed':
            print("✓ Crawl completed!")
            break
        elif status['state'] in ['failed', 'timeout']:
            sys.exit(f"Crawl failed: {status.get('error', status['state'])}")
        
        print(f"  Status: {status['state']}... waiting {wait_time}s")
        time.sleep(wait_time)
        wait_time = min(wait_time * 1.5, 10)  # Exponential backoff
        
    except requests.exceptions.RequestException as e:
        sys.exit(f"Failed to get status: {e}")
else:
    sys.exit("Polling timeout after 60 seconds")

# 3. Download and save results
print("Downloading results...")
try:
    response = requests.get(
        status['download_url'],
        headers={"Authorization": f"Bearer {api_key}"}
    )
    response.raise_for_status()
    
    # Parse the JSON response
    result = response.json()
    
    # Save to file
    os.makedirs('results', exist_ok=True)
    with open(f"results/{job_id}.json", 'w') as f:
        json.dump(result, f, indent=2)
    
    print(f"✓ Results saved to results/{job_id}.json")
    print(f"  Status Code: {result.get('status_code')}")
    print(f"  Body preview: {result.get('body', '')[:200]}...")
    
except requests.exceptions.RequestException as e:
    sys.exit(f"Failed to download results: {e}")

Run the script:

python crawl.py

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Create a file named crawl.js:

const axios = require('axios');
const fs = require('fs');
require('dotenv').config();

const apiKey = process.env.UPROCK_API_KEY;
if (!apiKey) {
  console.error('Error: UPROCK_API_KEY not set');
  process.exit(1);
}

const baseUrl = process.env.UPROCK_BASE_URL || 'https://edge.uprock.com';

async function crawl() {
  try {
    // 1. Submit a crawl job
    console.log('Submitting crawl job...');
    const submitResponse = await axios.post(
      `${baseUrl}/crawl/v1/new`,
      { url: 'https://example.com', method: 'GET' },
      {
        headers: {
          'Authorization': `Bearer ${apiKey}`,
          'Content-Type': 'application/json'
        }
      }
    );
    
    const jobId = submitResponse.data.job_id;
    console.log(`✓ Job submitted: ${jobId}`);
    
    // 2. Poll for completion (max 60 seconds)
    console.log('Waiting for crawl to complete...');
    const startTime = Date.now();
    let waitTime = 2000;
    let status;
    
    while (Date.now() - startTime < 60000) {
      const statusResponse = await axios.get(
        `${baseUrl}/crawl/v1/status/${jobId}`,
        { headers: { Authorization: `Bearer ${apiKey}` } }
      );
      
      status = statusResponse.data;
      
      if (status.state === 'completed') {
        console.log('✓ Crawl completed!');
        break;
      } else if (status.state === 'failed' || status.state === 'timeout') {
        console.error(`Crawl failed: ${status.error || status.state}`);
        process.exit(1);
      }
      
      console.log(`  Status: ${status.state}... waiting ${waitTime/1000}s`);
      await new Promise(resolve => setTimeout(resolve, waitTime));
      waitTime = Math.min(waitTime * 1.5, 10000); // Exponential backoff
    }
    
    if (status.state !== 'completed') {
      console.error('Polling timeout after 60 seconds');
      process.exit(1);
    }
    
    // 3. Download and save results
    console.log('Downloading results...');
    const resultsResponse = await axios.get(
      status.download_url,
      { headers: { Authorization: `Bearer ${apiKey}` } }
    );
    
    const result = resultsResponse.data;
    
    // Save to file
    if (!fs.existsSync('results')) {
      fs.mkdirSync('results');
    }
    fs.writeFileSync(`results/${jobId}.json`, JSON.stringify(result, null, 2));
    
    console.log(`✓ Results saved to results/${jobId}.json`);
    console.log(`  Status Code: ${result.status_code}`);
    console.log(`  Body preview: ${result.body?.substring(0, 200)}...`);
    
  } catch (error) {
    if (error.response) {
      console.error(`API Error ${error.response.status}: ${error.response.data?.error || error.message}`);
      if (error.response.status === 401) {
        console.error('Check your API key');
      }
    } else {
      console.error(`Error: ${error.message}`);
    }
    process.exit(1);
  }
}

crawl();

Run the script:

node crawl.js

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Create a file named crawl.rb:

require 'httparty'
require 'json'
require 'fileutils'
require 'dotenv/load'

api_key = ENV['UPROCK_API_KEY']
abort("Error: UPROCK_API_KEY not set") unless api_key

base_url = ENV['UPROCK_BASE_URL'] || 'https://edge.uprock.com'

# 1. Submit a crawl job
puts 'Submitting crawl job...'
response = HTTParty.post(
  "#{base_url}/crawl/v1/new",
  headers: {
    'Authorization' => "Bearer #{api_key}",
    'Content-Type' => 'application/json'
  },
  body: { url: 'https://example.com', method: 'GET' }.to_json
)

abort("Failed to submit job: #{response.code} #{response.body}") unless response.success?

job = JSON.parse(response.body)
job_id = job['job_id']
puts "✓ Job submitted: #{job_id}"

# 2. Poll for completion (max 60 seconds)
puts 'Waiting for crawl to complete...'
start_time = Time.now
wait_time = 2

while Time.now - start_time < 60
  response = HTTParty.get(
    "#{base_url}/crawl/v1/status/#{job_id}",
    headers: { 'Authorization' => "Bearer #{api_key}" }
  )
  
  abort("Failed to get status: #{response.code}") unless response.success?
  status = JSON.parse(response.body)
  
  case status['state']
  when 'completed'
    puts '✓ Crawl completed!'
    break
  when 'failed', 'timeout'
    abort("Crawl failed: #{status['error'] || status['state']}")
  else
    puts "  Status: #{status['state']}... waiting #{wait_time}s"
    sleep wait_time
    wait_time = [wait_time * 1.5, 10].min  # Exponential backoff
  end
end

abort("Polling timeout after 60 seconds") if status['state'] != 'completed'

# 3. Download and save results
puts 'Downloading results...'
response = HTTParty.get(
  status['download_url'],
  headers: { 'Authorization' => "Bearer #{api_key}" }
)

abort("Failed to download: #{response.code}") unless response.success?

result = JSON.parse(response.body)

# Save to file
FileUtils.mkdir_p('results')
File.write("results/#{job_id}.json", JSON.pretty_generate(result))

puts "✓ Results saved to results/#{job_id}.json"
puts "  Status Code: #{result['status_code']}"
puts "  Body preview: #{result['body'][0..200]}..."

Run the script:

ruby crawl.rb

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Create a file named main.go:

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "log"
    "net/http"
    "os"
    "time"
    
    "github.com/joho/godotenv"
)

type JobSubmission struct {
    URL    string `json:"url"`
    Method string `json:"method"`
}

type JobResponse struct {
    JobID string `json:"job_id"`
}

type StatusResponse struct {
    State       string `json:"state"`
    DownloadURL string `json:"download_url"`
    Error       string `json:"error,omitempty"`
}

type CrawlResult struct {
    StatusCode int               `json:"status_code"`
    Headers    map[string]string `json:"headers"`
    Body       string            `json:"body"`
}

func main() {
    // Load environment
    godotenv.Load()
    apiKey := os.Getenv("UPROCK_API_KEY")
    if apiKey == "" {
        log.Fatal("Error: UPROCK_API_KEY not set")
    }
    
    baseURL := os.Getenv("UPROCK_BASE_URL")
    if baseURL == "" {
        baseURL = "https://edge.uprock.com"
    }
    
    client := &http.Client{Timeout: 30 * time.Second}
    
    // 1. Submit a crawl job
    fmt.Println("Submitting crawl job...")
    submission := JobSubmission{
        URL:    "https://example.com",
        Method: "GET",
    }
    
    jsonData, err := json.Marshal(submission)
    if err != nil {
        log.Fatalf("Failed to marshal request: %v", err)
    }
    
    req, err := http.NewRequest("POST", baseURL+"/crawl/v1/new", bytes.NewBuffer(jsonData))
    if err != nil {
        log.Fatalf("Failed to create request: %v", err)
    }
    
    req.Header.Set("Authorization", "Bearer "+apiKey)
    req.Header.Set("Content-Type", "application/json")
    
    resp, err := client.Do(req)
    if err != nil {
        log.Fatalf("Failed to submit job: %v", err)
    }
    defer resp.Body.Close()
    
    if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
        body, _ := io.ReadAll(resp.Body)
        log.Fatalf("Failed to submit job: %d %s", resp.StatusCode, body)
    }
    
    var job JobResponse
    if err := json.NewDecoder(resp.Body).Decode(&job); err != nil {
        log.Fatalf("Failed to decode response: %v", err)
    }
    fmt.Printf("✓ Job submitted: %s\n", job.JobID)
    
    // 2. Poll for completion (max 60 seconds)
    fmt.Println("Waiting for crawl to complete...")
    startTime := time.Now()
    waitTime := 2 * time.Second
    var status StatusResponse
    
    for time.Since(startTime) < 60*time.Second {
        req, _ := http.NewRequest("GET", baseURL+"/crawl/v1/status/"+job.JobID, nil)
        req.Header.Set("Authorization", "Bearer "+apiKey)
        
        resp, err := client.Do(req)
        if err != nil {
            log.Fatalf("Failed to get status: %v", err)
        }
        
        if err := json.NewDecoder(resp.Body).Decode(&status); err != nil {
            resp.Body.Close()
            log.Fatalf("Failed to decode status: %v", err)
        }
        resp.Body.Close()
        
        switch status.State {
        case "completed":
            fmt.Println("✓ Crawl completed!")
            goto download
        case "failed", "timeout":
            log.Fatalf("Crawl failed: %s", status.Error)
        default:
            fmt.Printf("  Status: %s... waiting %v\n", status.State, waitTime)
            time.Sleep(waitTime)
            if waitTime < 10*time.Second {
                waitTime = time.Duration(float64(waitTime) * 1.5)
            }
        }
    }
    log.Fatal("Polling timeout after 60 seconds")
    
download:
    // 3. Download and save results
    fmt.Println("Downloading results...")
    req, _ = http.NewRequest("GET", status.DownloadURL, nil)
    req.Header.Set("Authorization", "Bearer "+apiKey)
    
    resp, err = client.Do(req)
    if err != nil {
        log.Fatalf("Failed to download results: %v", err)
    }
    defer resp.Body.Close()
    
    var result CrawlResult
    if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        log.Fatalf("Failed to decode results: %v", err)
    }
    
    // Save to file
    os.Mkdir("results", 0755)
    filename := fmt.Sprintf("results/%s.json", job.JobID)
    
    data, _ := json.MarshalIndent(result, "", "  ")
    if err := os.WriteFile(filename, data, 0644); err != nil {
        log.Fatalf("Failed to save results: %v", err)
    }
    
    fmt.Printf("✓ Results saved to %s\n", filename)
    fmt.Printf("  Status Code: %d\n", result.StatusCode)
    
    preview := result.Body
    if len(preview) > 200 {
        preview = preview[:200]
    }
    fmt.Printf("  Body preview: %s...\n", preview)
}

Run the script:

go run main.go

Expected output:

Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
  Status: pending... waiting 2s
  Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
  Status Code: 200
  Body preview: <!DOCTYPE html><html>...

Run these commands in your terminal:

# 1. Submit a job and capture the job_id
JOB_ID=$(curl -s -X POST "https://edge.uprock.com/crawl/v1/new" \
  -H "Authorization: Bearer $UPROCK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "method": "GET"}' | jq -r '.job_id')

echo "✓ Job submitted: ${JOB_ID}"

# 2. Wait a few seconds, then check status
sleep 3
STATUS=$(curl -s "https://edge.uprock.com/crawl/v1/status/${JOB_ID}" \
  -H "Authorization: Bearer $UPROCK_API_KEY")

echo "Status: $(echo $STATUS | jq -r '.state')"

# 3. If completed, download results
DOWNLOAD_URL=$(echo $STATUS | jq -r '.download_url')
if [ "$DOWNLOAD_URL" != "null" ]; then
  echo "✓ Crawl completed!"
  echo "Downloading results..."
  curl -s "${DOWNLOAD_URL}" -H "Authorization: Bearer $UPROCK_API_KEY" | jq '.' > "result_${JOB_ID}.json"
  echo "✓ Results saved to result_${JOB_ID}.json"
else
  echo "Job still processing. Check status again or increase sleep time."
fi

The cURL example uses a simple sleep 3 instead of polling. For production use, implement proper polling with exponential backoff as shown in the other language examples.

About polling: Crawl jobs are processed asynchronously. The scripts automatically check the job status every few seconds using exponential backoff (2s → 3s → 4.5s → 10s max) until completion. Learn more about polling strategies.

Understanding the Response

The script will automatically create a results/ folder in your current directory and save the crawl output as results/{job_id}.json on your local machine.

Result structure:

{
  "status_code": 200,
  "headers": {
    "content-type": "text/html; charset=utf-8",
    "server": "nginx/1.18.0",
    "content-length": "1256"
  },
  "body": "<!DOCTYPE html>\n<html lang=\"en\">...</html>",
  "metadata": {
    "crawled_at": "2024-01-15T10:30:05Z",
    "device_id": "device-xyz789",
    "response_time_ms": 342,
    "device_location": {
      "country": "US",
      "region": "California",
      "city": "San Francisco"
    }
  }
}

What your script did:

  1. You submitted a job to /crawl/v1/new
  2. The API assigned it to a device on the network
  3. The device fetched the page and returned results
  4. Your script saved everything to a local JSON file

Next Steps

Congratulations! You've successfully completed your first crawl. Here's what to explore next:

Need help? Check the Troubleshooting Guide or contact support@uprock.com.

On this page