Retries and Resilience

The SDK retries failed requests automatically using multiple layers of resilience: per-request backoff with decorrelated jitter and adaptive concurrency for bulk operations. This page covers everything you need to know to tune that behavior or to decide when to hand retry control back to your orchestrator.

For the exceptions the SDK raises when retries are exhausted, see Error Handling.


Defaults at a Glance

Out of the box — no configuration needed:

What

Default behavior

Max retries (after initial attempt)

3 for REST (4 total), 5 for gRPC (6 total)

Retryable HTTP status codes

408, 429, 500, 502, 503, 504

Retryable gRPC status codes

UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED

Backoff algorithm

Decorrelated jitter — random walk bounded by backoff_factor floor and max_wait cap

Adaptive concurrency (bulk paths)

Self-tunes downward on throttling; max_concurrency is a ceiling, not a constant


Configuring Retries

Pass a RetryConfig to the Pinecone constructor to customize retry behavior for all REST requests made by that client:

from pinecone import Pinecone, RetryConfig

pc = Pinecone(
    retry_config=RetryConfig(
        max_retries=5,
        backoff_factor=0.5,
        max_wait=60.0,
        retryable_status_codes=frozenset({429, 500, 503}),
    )
)

RetryConfig fields

Field

Type

Default

Description

max_retries

int

3

Number of retry attempts after the initial attempt. Total attempts = max_retries + 1.

backoff_factor

float

0.25

Minimum delay floor in seconds (lower bound of decorrelated jitter). See Jitter strategy for the full formula.

max_wait

float

60.0

Maximum delay cap in seconds. The jitter algorithm never waits longer than this between retries.

retryable_status_codes

frozenset[int]

{408, 429, 500, 502, 503, 504}

HTTP status codes that trigger a retry. The SDK retries on these codes and raises on all others.

RetryConfig applies to REST only. The gRPC transport (Rust-backed) uses its own fixed retry policy with 5 retries by default. See Transport differences.

Disabling retries

To disable retries entirely, set max_retries=0:

pc = Pinecone(retry_config=RetryConfig(max_retries=0))

With max_retries=0, the SDK makes exactly one attempt and raises immediately on any error.

Handling rate limits without retrying

By default, 429 responses are retried automatically. To receive RateLimitError immediately instead (for example, so your orchestrator can handle the retry), exclude 429 from the retryable set:

from pinecone import Pinecone, RetryConfig
from pinecone.errors import RateLimitError

pc = Pinecone(
    retry_config=RetryConfig(
        retryable_status_codes=frozenset({408, 500, 502, 503, 504}),  # no 429
    )
)

try:
    index.upsert(vectors=[...])
except RateLimitError:
    time.sleep(30.0)
    index.upsert(vectors=[...])

Migration note: backoff_factor semantic change (v8 → v9)

In v8 and earlier, backoff_factor was an exponential multiplier. In v9, it became the minimum delay floor in seconds — the lower bound of the decorrelated jitter window. The default also changed from 2.0 to 0.25. If you pinned backoff_factor=2.0 in v8, the new equivalent that produces a similar mean first-retry delay is backoff_factor=0.5; if you want to restore the old default behavior (which caused ~4× longer delays than v9), pass backoff_factor=2.0 explicitly. Most users should use the v9 default or leave it unset.


Jitter Strategy

Jitter spreads retries across time so that concurrent clients with the same retry budget don’t collide on the server at the same moment.

Decorrelated jitter (backoff path)

When no server hint is present, the SDK uses decorrelated jitter:

delay = uniform(backoff_factor, max(backoff_factor, prev_delay * 3))
delay = min(delay, max_wait)

Starting from prev_delay = backoff_factor, each retry delay is drawn uniformly from [backoff_factor, prev_delay × 3], capped at max_wait. Because the next window’s upper bound grows with the previous delay, the sequence performs a random walk that diverges naturally without a hard exponential schedule — neighboring clients are unlikely to pick the same delay even when they start at the same time.

Concrete example with defaults (backoff_factor=0.25, max_wait=60.0):

Attempt

Window (seconds)

Typical delay

1st retry

[0.25, 0.75]

~0.5 s

2nd retry

[0.25, ~1.5]

~0.9 s

3rd retry

[0.25, ~4.5]

~2.4 s


Adaptive Concurrency for Bulk Operations

When you run bulk upserts or other parallel operations, the SDK observes throttling signals and automatically reduces the number of concurrent in-flight requests. When throttling subsides, concurrency recovers.

How it works

Each Pinecone client maintains a per-host concurrency limiter. On every retryable response (429, 503, or equivalent gRPC code), the limiter halves the effective concurrency floor for that host. After a streak of consecutive successful requests, it recovers by one slot. The algorithm is AIMD (Additive Increase, Multiplicative Decrease) — the same control loop used by TCP congestion control.

You don’t configure this directly. The max_concurrency parameter you pass to upsert() is a ceiling — the SDK self-tunes between 1 and that ceiling based on what the server can absorb.

Example

from pinecone import Pinecone

pc = Pinecone()
index = pc.index(host="product-search-abc123.svc.pinecone.io")

# max_concurrency=8 is the ceiling.
# If the index throttles during the run, the SDK will automatically
# reduce effective concurrency (e.g. to 4, then 2) and recover as
# throttling subsides. No code changes required.
response = index.upsert(
    vectors=large_list,
    batch_size=200,
    max_concurrency=8,
)
print(response.upserted_count)

Limiter scope

One limiter per index host per Pinecone client. If you create two Pinecone clients and both target the same index, they each maintain an independent limiter — there is no cross-client coordination (see Multi-process and serverless workloads).


Transport Differences

The retry plan goal is parity across REST and gRPC. The remaining differences are small:

Aspect

REST (Index, AsyncIndex)

gRPC (GrpcIndex)

Default max_retries

3 (4 total attempts)

5 (6 total attempts)

Configured via

RetryConfig passed to Pinecone()

Fixed in transport (not user-configurable)

Retryable codes

{408, 429, 500, 502, 503, 504}

UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED

Jitter algorithm

Decorrelated jitter (Python)

Decorrelated jitter (Rust)

Async support

Yes (AsyncIndex)

No — gRPC transport is sync-only

Adaptive concurrency

Yes (REST + gRPC share the same per-host limiter registry)

Yes

gRPC retry is not configurable via RetryConfig. If you need to tune gRPC retry behavior, construct GrpcIndex directly (rather than through Pinecone.index(grpc=True)) and pass max_retries explicitly.


Multi-Process and Serverless Workloads

What the SDK cannot do

The SDK’s retry and adaptive concurrency machinery is per-process. If your workload fans out across multiple Lambda invocations, Cloud Run instances, or Kubernetes pods, each process runs its own independent retry loop. There is no shared state, no cross-process coordination, and no distributed rate-limit awareness.

The per-client adaptive-limiter registry is capped at 256 hosts with LRU eviction; long-running services that rotate through more than 256 distinct hosts will see infrequently-used hosts’ adaptive state reset on next use, which is harmless.

This means:

  • N simultaneously throttled invocations each independently back off and retry. Without coordination, they can collide again at the end of the retry window.

  • The adaptive concurrency limiter starts from scratch for each new process instance (e.g. a fresh Lambda cold start). It cannot inherit a reduced limit that another invocation learned from throttling.

Why jitter still helps across processes

Even without coordination, the SDK’s decorrelated jitter provides statistical relief. If N independent Lambda invocations are all throttled at once, they don’t all retry at the same instant — each draws its own delay, spreading the retries across a window. The larger N is, the more this matters.

Summary: when to trust the SDK vs. the orchestrator

Scenario

Recommended approach

Single-process bulk upsert

Use defaults — SDK handles everything

Long-running worker (persistent process)

Use defaults — adaptive limiter learns and recovers

Lambda / Cloud Functions / Cloud Run (stateless)

max_retries=1, catch RateLimitError, re-raise for orchestrator retry

Fan-out across many pods (e.g. Kubernetes Job)

Same as stateless — set low max_retries, rely on orchestrator

Strict per-invocation SLA (must not block)

max_retries=0, retryable_status_codes=frozenset() — raise immediately


Observability

The SDK emits structured log records so you can diagnose retry storms and throttling pressure without adding instrumentation yourself.

Log namespaces

Logger

Events

pinecone._internal.http_client

Throttled HTTP response received; retry delay computed

pinecone._internal.adaptive

AIMD concurrency limit transitions

INFO messages

An INFO-level record is emitted the first time a given host rate-limits a client instance:

Rate limited by host=<host>. Adaptive concurrency will reduce in-flight requests.
See https://docs.pinecone.io/python/retries for details.

This fires once per host per Pinecone / AsyncPinecone object, so it surfaces in your logs without flooding them on repeated throttling.

DEBUG messages

Enable DEBUG-level logging on the two namespaces above to see granular retry events:

import logging
logging.getLogger("pinecone._internal.http_client").setLevel(logging.DEBUG)
logging.getLogger("pinecone._internal.adaptive").setLevel(logging.DEBUG)

Throttle record (emitted once per retry attempt that receives a retryable response):

Throttled response: status=429 host=my-index.svc.pinecone.io attempt=1/4 delay=0.531s

Fields: status (HTTP status code), host, attempt (N of total attempts), delay (computed wait in seconds).

AIMD limit decrease (emitted when the adaptive limiter reduces concurrency):

AIMD limiter decreased: before=8 after=4 ceiling=8

AIMD limit increase (emitted when the limiter recovers a concurrency slot):

AIMD limiter increased: now=5 ceiling=8

Increase records only fire on actual transitions — not on every successful request — so the volume is proportional to recovery events, not request throughput.


See Also

  • Error Handling — Exception hierarchy and how to catch specific errors

  • Performance — Bulk upsert patterns, max_concurrency tuning, and transport selection

  • Sync vs Async Clients — When to use the async client and how to manage concurrency with asyncio