Make your First Crawl
Make your first crawl request in under 5 minutes. By the end of this guide, you'll submit a crawl job, check its status, and retrieve the results.
This guide assumes you've completed environment Setup and have your API key configured.
What You'll Build
In this tutorial, you'll create a script that:
- Submits a crawl job to fetch a web page
- Automatically polls until the job completes
- Downloads and saves the results
The UpRock Crawl API fetches web content through a distributed network of real devices, giving you access to geographically-specific data as real users see it.
Create and Run Your Crawl Script
Choose your language and copy the complete script. Each script handles the entire workflow: submit → poll → download.
Create a file named crawl.py:
import os
import sys
import time
import json
import requests
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
api_key = os.getenv('UPROCK_API_KEY')
if not api_key:
sys.exit("Error: UPROCK_API_KEY not set")
base_url = os.getenv('UPROCK_BASE_URL', 'https://edge.uprock.com')
# 1. Submit a crawl job
print("Submitting crawl job...")
try:
response = requests.post(
f"{base_url}/crawl/v1/new",
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json={"url": "https://example.com", "method": "GET"}
)
response.raise_for_status()
job = response.json()
job_id = job['job_id']
print(f"✓ Job submitted: {job_id}")
except requests.exceptions.RequestException as e:
sys.exit(f"Failed to submit job: {e}")
# 2. Poll for completion (max 60 seconds)
print("Waiting for crawl to complete...")
start_time = time.time()
wait_time = 2
while time.time() - start_time < 60:
try:
response = requests.get(
f"{base_url}/crawl/v1/status/{job_id}",
headers={"Authorization": f"Bearer {api_key}"}
)
response.raise_for_status()
status = response.json()
if status['state'] == 'completed':
print("✓ Crawl completed!")
break
elif status['state'] in ['failed', 'timeout']:
sys.exit(f"Crawl failed: {status.get('error', status['state'])}")
print(f" Status: {status['state']}... waiting {wait_time}s")
time.sleep(wait_time)
wait_time = min(wait_time * 1.5, 10) # Exponential backoff
except requests.exceptions.RequestException as e:
sys.exit(f"Failed to get status: {e}")
else:
sys.exit("Polling timeout after 60 seconds")
# 3. Download and save results
print("Downloading results...")
try:
response = requests.get(
status['download_url'],
headers={"Authorization": f"Bearer {api_key}"}
)
response.raise_for_status()
# Parse the JSON response
result = response.json()
# Save to file
os.makedirs('results', exist_ok=True)
with open(f"results/{job_id}.json", 'w') as f:
json.dump(result, f, indent=2)
print(f"✓ Results saved to results/{job_id}.json")
print(f" Status Code: {result.get('status_code')}")
print(f" Body preview: {result.get('body', '')[:200]}...")
except requests.exceptions.RequestException as e:
sys.exit(f"Failed to download results: {e}")Run the script:
python crawl.pyExpected output:
Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
Status: pending... waiting 2s
Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
Status Code: 200
Body preview: <!DOCTYPE html><html>...Create a file named crawl.js:
const axios = require('axios');
const fs = require('fs');
require('dotenv').config();
const apiKey = process.env.UPROCK_API_KEY;
if (!apiKey) {
console.error('Error: UPROCK_API_KEY not set');
process.exit(1);
}
const baseUrl = process.env.UPROCK_BASE_URL || 'https://edge.uprock.com';
async function crawl() {
try {
// 1. Submit a crawl job
console.log('Submitting crawl job...');
const submitResponse = await axios.post(
`${baseUrl}/crawl/v1/new`,
{ url: 'https://example.com', method: 'GET' },
{
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
}
}
);
const jobId = submitResponse.data.job_id;
console.log(`✓ Job submitted: ${jobId}`);
// 2. Poll for completion (max 60 seconds)
console.log('Waiting for crawl to complete...');
const startTime = Date.now();
let waitTime = 2000;
let status;
while (Date.now() - startTime < 60000) {
const statusResponse = await axios.get(
`${baseUrl}/crawl/v1/status/${jobId}`,
{ headers: { Authorization: `Bearer ${apiKey}` } }
);
status = statusResponse.data;
if (status.state === 'completed') {
console.log('✓ Crawl completed!');
break;
} else if (status.state === 'failed' || status.state === 'timeout') {
console.error(`Crawl failed: ${status.error || status.state}`);
process.exit(1);
}
console.log(` Status: ${status.state}... waiting ${waitTime/1000}s`);
await new Promise(resolve => setTimeout(resolve, waitTime));
waitTime = Math.min(waitTime * 1.5, 10000); // Exponential backoff
}
if (status.state !== 'completed') {
console.error('Polling timeout after 60 seconds');
process.exit(1);
}
// 3. Download and save results
console.log('Downloading results...');
const resultsResponse = await axios.get(
status.download_url,
{ headers: { Authorization: `Bearer ${apiKey}` } }
);
const result = resultsResponse.data;
// Save to file
if (!fs.existsSync('results')) {
fs.mkdirSync('results');
}
fs.writeFileSync(`results/${jobId}.json`, JSON.stringify(result, null, 2));
console.log(`✓ Results saved to results/${jobId}.json`);
console.log(` Status Code: ${result.status_code}`);
console.log(` Body preview: ${result.body?.substring(0, 200)}...`);
} catch (error) {
if (error.response) {
console.error(`API Error ${error.response.status}: ${error.response.data?.error || error.message}`);
if (error.response.status === 401) {
console.error('Check your API key');
}
} else {
console.error(`Error: ${error.message}`);
}
process.exit(1);
}
}
crawl();Run the script:
node crawl.jsExpected output:
Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
Status: pending... waiting 2s
Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
Status Code: 200
Body preview: <!DOCTYPE html><html>...Create a file named crawl.rb:
require 'httparty'
require 'json'
require 'fileutils'
require 'dotenv/load'
api_key = ENV['UPROCK_API_KEY']
abort("Error: UPROCK_API_KEY not set") unless api_key
base_url = ENV['UPROCK_BASE_URL'] || 'https://edge.uprock.com'
# 1. Submit a crawl job
puts 'Submitting crawl job...'
response = HTTParty.post(
"#{base_url}/crawl/v1/new",
headers: {
'Authorization' => "Bearer #{api_key}",
'Content-Type' => 'application/json'
},
body: { url: 'https://example.com', method: 'GET' }.to_json
)
abort("Failed to submit job: #{response.code} #{response.body}") unless response.success?
job = JSON.parse(response.body)
job_id = job['job_id']
puts "✓ Job submitted: #{job_id}"
# 2. Poll for completion (max 60 seconds)
puts 'Waiting for crawl to complete...'
start_time = Time.now
wait_time = 2
while Time.now - start_time < 60
response = HTTParty.get(
"#{base_url}/crawl/v1/status/#{job_id}",
headers: { 'Authorization' => "Bearer #{api_key}" }
)
abort("Failed to get status: #{response.code}") unless response.success?
status = JSON.parse(response.body)
case status['state']
when 'completed'
puts '✓ Crawl completed!'
break
when 'failed', 'timeout'
abort("Crawl failed: #{status['error'] || status['state']}")
else
puts " Status: #{status['state']}... waiting #{wait_time}s"
sleep wait_time
wait_time = [wait_time * 1.5, 10].min # Exponential backoff
end
end
abort("Polling timeout after 60 seconds") if status['state'] != 'completed'
# 3. Download and save results
puts 'Downloading results...'
response = HTTParty.get(
status['download_url'],
headers: { 'Authorization' => "Bearer #{api_key}" }
)
abort("Failed to download: #{response.code}") unless response.success?
result = JSON.parse(response.body)
# Save to file
FileUtils.mkdir_p('results')
File.write("results/#{job_id}.json", JSON.pretty_generate(result))
puts "✓ Results saved to results/#{job_id}.json"
puts " Status Code: #{result['status_code']}"
puts " Body preview: #{result['body'][0..200]}..."Run the script:
ruby crawl.rbExpected output:
Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
Status: pending... waiting 2s
Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
Status Code: 200
Body preview: <!DOCTYPE html><html>...Create a file named main.go:
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
"time"
"github.com/joho/godotenv"
)
type JobSubmission struct {
URL string `json:"url"`
Method string `json:"method"`
}
type JobResponse struct {
JobID string `json:"job_id"`
}
type StatusResponse struct {
State string `json:"state"`
DownloadURL string `json:"download_url"`
Error string `json:"error,omitempty"`
}
type CrawlResult struct {
StatusCode int `json:"status_code"`
Headers map[string]string `json:"headers"`
Body string `json:"body"`
}
func main() {
// Load environment
godotenv.Load()
apiKey := os.Getenv("UPROCK_API_KEY")
if apiKey == "" {
log.Fatal("Error: UPROCK_API_KEY not set")
}
baseURL := os.Getenv("UPROCK_BASE_URL")
if baseURL == "" {
baseURL = "https://edge.uprock.com"
}
client := &http.Client{Timeout: 30 * time.Second}
// 1. Submit a crawl job
fmt.Println("Submitting crawl job...")
submission := JobSubmission{
URL: "https://example.com",
Method: "GET",
}
jsonData, err := json.Marshal(submission)
if err != nil {
log.Fatalf("Failed to marshal request: %v", err)
}
req, err := http.NewRequest("POST", baseURL+"/crawl/v1/new", bytes.NewBuffer(jsonData))
if err != nil {
log.Fatalf("Failed to create request: %v", err)
}
req.Header.Set("Authorization", "Bearer "+apiKey)
req.Header.Set("Content-Type", "application/json")
resp, err := client.Do(req)
if err != nil {
log.Fatalf("Failed to submit job: %v", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
body, _ := io.ReadAll(resp.Body)
log.Fatalf("Failed to submit job: %d %s", resp.StatusCode, body)
}
var job JobResponse
if err := json.NewDecoder(resp.Body).Decode(&job); err != nil {
log.Fatalf("Failed to decode response: %v", err)
}
fmt.Printf("✓ Job submitted: %s\n", job.JobID)
// 2. Poll for completion (max 60 seconds)
fmt.Println("Waiting for crawl to complete...")
startTime := time.Now()
waitTime := 2 * time.Second
var status StatusResponse
for time.Since(startTime) < 60*time.Second {
req, _ := http.NewRequest("GET", baseURL+"/crawl/v1/status/"+job.JobID, nil)
req.Header.Set("Authorization", "Bearer "+apiKey)
resp, err := client.Do(req)
if err != nil {
log.Fatalf("Failed to get status: %v", err)
}
if err := json.NewDecoder(resp.Body).Decode(&status); err != nil {
resp.Body.Close()
log.Fatalf("Failed to decode status: %v", err)
}
resp.Body.Close()
switch status.State {
case "completed":
fmt.Println("✓ Crawl completed!")
goto download
case "failed", "timeout":
log.Fatalf("Crawl failed: %s", status.Error)
default:
fmt.Printf(" Status: %s... waiting %v\n", status.State, waitTime)
time.Sleep(waitTime)
if waitTime < 10*time.Second {
waitTime = time.Duration(float64(waitTime) * 1.5)
}
}
}
log.Fatal("Polling timeout after 60 seconds")
download:
// 3. Download and save results
fmt.Println("Downloading results...")
req, _ = http.NewRequest("GET", status.DownloadURL, nil)
req.Header.Set("Authorization", "Bearer "+apiKey)
resp, err = client.Do(req)
if err != nil {
log.Fatalf("Failed to download results: %v", err)
}
defer resp.Body.Close()
var result CrawlResult
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
log.Fatalf("Failed to decode results: %v", err)
}
// Save to file
os.Mkdir("results", 0755)
filename := fmt.Sprintf("results/%s.json", job.JobID)
data, _ := json.MarshalIndent(result, "", " ")
if err := os.WriteFile(filename, data, 0644); err != nil {
log.Fatalf("Failed to save results: %v", err)
}
fmt.Printf("✓ Results saved to %s\n", filename)
fmt.Printf(" Status Code: %d\n", result.StatusCode)
preview := result.Body
if len(preview) > 200 {
preview = preview[:200]
}
fmt.Printf(" Body preview: %s...\n", preview)
}Run the script:
go run main.goExpected output:
Submitting crawl job...
✓ Job submitted: abc123xyz
Waiting for crawl to complete...
Status: pending... waiting 2s
Status: assigned... waiting 3s
✓ Crawl completed!
Downloading results...
✓ Results saved to results/abc123xyz.json
Status Code: 200
Body preview: <!DOCTYPE html><html>...Run these commands in your terminal:
# 1. Submit a job and capture the job_id
JOB_ID=$(curl -s -X POST "https://edge.uprock.com/crawl/v1/new" \
-H "Authorization: Bearer $UPROCK_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "method": "GET"}' | jq -r '.job_id')
echo "✓ Job submitted: ${JOB_ID}"
# 2. Wait a few seconds, then check status
sleep 3
STATUS=$(curl -s "https://edge.uprock.com/crawl/v1/status/${JOB_ID}" \
-H "Authorization: Bearer $UPROCK_API_KEY")
echo "Status: $(echo $STATUS | jq -r '.state')"
# 3. If completed, download results
DOWNLOAD_URL=$(echo $STATUS | jq -r '.download_url')
if [ "$DOWNLOAD_URL" != "null" ]; then
echo "✓ Crawl completed!"
echo "Downloading results..."
curl -s "${DOWNLOAD_URL}" -H "Authorization: Bearer $UPROCK_API_KEY" | jq '.' > "result_${JOB_ID}.json"
echo "✓ Results saved to result_${JOB_ID}.json"
else
echo "Job still processing. Check status again or increase sleep time."
fiThe cURL example uses a simple sleep 3 instead of polling. For production use, implement proper polling with exponential backoff as shown in the other language examples.
About polling: Crawl jobs are processed asynchronously. The scripts automatically check the job status every few seconds using exponential backoff (2s → 3s → 4.5s → 10s max) until completion. Learn more about polling strategies.
Understanding the Response
The script will automatically create a results/ folder in your current directory and save the crawl output as results/{job_id}.json on your local machine.
Result structure:
{
"status_code": 200,
"headers": {
"content-type": "text/html; charset=utf-8",
"server": "nginx/1.18.0",
"content-length": "1256"
},
"body": "<!DOCTYPE html>\n<html lang=\"en\">...</html>",
"metadata": {
"crawled_at": "2024-01-15T10:30:05Z",
"device_id": "device-xyz789",
"response_time_ms": 342,
"device_location": {
"country": "US",
"region": "California",
"city": "San Francisco"
}
}
}What your script did:
- You submitted a job to
/crawl/v1/new - The API assigned it to a device on the network
- The device fetched the page and returned results
- Your script saved everything to a local JSON file
Next Steps
Congratulations! You've successfully completed your first crawl. Here's what to explore next:
Basic Crawl Operations
Master the job lifecycle, handle errors, and optimize your polling strategy
Session Management
Chain requests together to handle logins, multi-page forms, and maintain cookies
Troubleshooting Guide
Solutions for common issues and error codes
API Reference
Explore every endpoint, parameter, and response format in detail
Need help? Check the Troubleshooting Guide or contact support@uprock.com.