# Retries and Resilience The SDK retries failed requests automatically using multiple layers of resilience: per-request backoff with decorrelated jitter and adaptive concurrency for bulk operations. This page covers everything you need to know to tune that behavior or to decide when to hand retry control back to your orchestrator. For the exceptions the SDK raises when retries are exhausted, see {doc}`/guides/error-handling`. --- ## Defaults at a Glance Out of the box — no configuration needed: | What | Default behavior | |------|------------------| | Max retries (after initial attempt) | 3 for REST (4 total), 5 for gRPC (6 total) | | Retryable HTTP status codes | 408, 429, 500, 502, 503, 504 | | Retryable gRPC status codes | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED | | Backoff algorithm | Decorrelated jitter — random walk bounded by `backoff_factor` floor and `max_wait` cap | | Adaptive concurrency (bulk paths) | Self-tunes downward on throttling; `max_concurrency` is a ceiling, not a constant | --- ## Configuring Retries Pass a `RetryConfig` to the `Pinecone` constructor to customize retry behavior for all REST requests made by that client: ```python from pinecone import Pinecone, RetryConfig pc = Pinecone( retry_config=RetryConfig( max_retries=5, backoff_factor=0.5, max_wait=60.0, retryable_status_codes=frozenset({429, 500, 503}), ) ) ``` ### `RetryConfig` fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `max_retries` | `int` | `3` | Number of retry attempts *after* the initial attempt. Total attempts = `max_retries + 1`. | | `backoff_factor` | `float` | `0.25` | Minimum delay floor in seconds (lower bound of decorrelated jitter). See [Jitter strategy](#jitter-strategy) for the full formula. | | `max_wait` | `float` | `60.0` | Maximum delay cap in seconds. The jitter algorithm never waits longer than this between retries. | | `retryable_status_codes` | `frozenset[int]` | `{408, 429, 500, 502, 503, 504}` | HTTP status codes that trigger a retry. The SDK retries on these codes and raises on all others. | **`RetryConfig` applies to REST only.** The gRPC transport (Rust-backed) uses its own fixed retry policy with 5 retries by default. See [Transport differences](#transport-differences). ### Disabling retries To disable retries entirely, set `max_retries=0`: ```python pc = Pinecone(retry_config=RetryConfig(max_retries=0)) ``` With `max_retries=0`, the SDK makes exactly one attempt and raises immediately on any error. ### Handling rate limits without retrying By default, 429 responses are retried automatically. To receive `RateLimitError` immediately instead (for example, so your orchestrator can handle the retry), exclude 429 from the retryable set: ```python from pinecone import Pinecone, RetryConfig from pinecone.errors import RateLimitError pc = Pinecone( retry_config=RetryConfig( retryable_status_codes=frozenset({408, 500, 502, 503, 504}), # no 429 ) ) try: index.upsert(vectors=[...]) except RateLimitError: time.sleep(30.0) index.upsert(vectors=[...]) ``` ### Migration note: `backoff_factor` semantic change (v8 → v9) In v8 and earlier, `backoff_factor` was an exponential multiplier. In v9, it became the **minimum delay floor in seconds** — the lower bound of the decorrelated jitter window. The default also changed from `2.0` to `0.25`. If you pinned `backoff_factor=2.0` in v8, the new equivalent that produces a similar mean first-retry delay is `backoff_factor=0.5`; if you want to restore the old default behavior (which caused ~4× longer delays than v9), pass `backoff_factor=2.0` explicitly. Most users should use the v9 default or leave it unset. --- ## Jitter Strategy Jitter spreads retries across time so that concurrent clients with the same retry budget don't collide on the server at the same moment. ### Decorrelated jitter (backoff path) When no server hint is present, the SDK uses decorrelated jitter: ``` delay = uniform(backoff_factor, max(backoff_factor, prev_delay * 3)) delay = min(delay, max_wait) ``` Starting from `prev_delay = backoff_factor`, each retry delay is drawn uniformly from `[backoff_factor, prev_delay × 3]`, capped at `max_wait`. Because the next window's upper bound grows with the previous delay, the sequence performs a random walk that diverges naturally without a hard exponential schedule — neighboring clients are unlikely to pick the same delay even when they start at the same time. **Concrete example with defaults** (`backoff_factor=0.25`, `max_wait=60.0`): | Attempt | Window (seconds) | Typical delay | |---------|-----------------|---------------| | 1st retry | [0.25, 0.75] | ~0.5 s | | 2nd retry | [0.25, ~1.5] | ~0.9 s | | 3rd retry | [0.25, ~4.5] | ~2.4 s | --- ## Adaptive Concurrency for Bulk Operations When you run bulk upserts or other parallel operations, the SDK observes throttling signals and automatically reduces the number of concurrent in-flight requests. When throttling subsides, concurrency recovers. ### How it works Each `Pinecone` client maintains a per-host concurrency limiter. On every retryable response (429, 503, or equivalent gRPC code), the limiter halves the effective concurrency floor for that host. After a streak of consecutive successful requests, it recovers by one slot. The algorithm is AIMD (Additive Increase, Multiplicative Decrease) — the same control loop used by TCP congestion control. **You don't configure this directly.** The `max_concurrency` parameter you pass to `upsert()` is a *ceiling* — the SDK self-tunes between 1 and that ceiling based on what the server can absorb. ### Example ```python from pinecone import Pinecone pc = Pinecone() index = pc.index(host="product-search-abc123.svc.pinecone.io") # max_concurrency=8 is the ceiling. # If the index throttles during the run, the SDK will automatically # reduce effective concurrency (e.g. to 4, then 2) and recover as # throttling subsides. No code changes required. response = index.upsert( vectors=large_list, batch_size=200, max_concurrency=8, ) print(response.upserted_count) ``` ### Limiter scope One limiter per index host per `Pinecone` client. If you create two `Pinecone` clients and both target the same index, they each maintain an independent limiter — there is no cross-client coordination (see [Multi-process and serverless workloads](#multi-process-and-serverless-workloads)). --- ## Transport Differences The retry plan goal is parity across REST and gRPC. The remaining differences are small: | Aspect | REST (`Index`, `AsyncIndex`) | gRPC (`GrpcIndex`) | |--------|------------------------------|---------------------| | Default `max_retries` | 3 (4 total attempts) | 5 (6 total attempts) | | Configured via | `RetryConfig` passed to `Pinecone()` | Fixed in transport (not user-configurable) | | Retryable codes | `{408, 429, 500, 502, 503, 504}` | UNAVAILABLE, RESOURCE\_EXHAUSTED, ABORTED | | Jitter algorithm | Decorrelated jitter (Python) | Decorrelated jitter (Rust) | | Async support | Yes (`AsyncIndex`) | No — gRPC transport is sync-only | | Adaptive concurrency | Yes (REST + gRPC share the same per-host limiter registry) | Yes | **gRPC retry is not configurable via `RetryConfig`.** If you need to tune gRPC retry behavior, construct `GrpcIndex` directly (rather than through `Pinecone.index(grpc=True)`) and pass `max_retries` explicitly. --- ## Multi-Process and Serverless Workloads ### What the SDK cannot do The SDK's retry and adaptive concurrency machinery is per-process. If your workload fans out across multiple Lambda invocations, Cloud Run instances, or Kubernetes pods, each process runs its own independent retry loop. There is no shared state, no cross-process coordination, and no distributed rate-limit awareness. The per-client adaptive-limiter registry is capped at 256 hosts with LRU eviction; long-running services that rotate through more than 256 distinct hosts will see infrequently-used hosts' adaptive state reset on next use, which is harmless. This means: - N simultaneously throttled invocations each independently back off and retry. Without coordination, they can collide again at the end of the retry window. - The adaptive concurrency limiter starts from scratch for each new process instance (e.g. a fresh Lambda cold start). It cannot inherit a reduced limit that another invocation learned from throttling. ### Recommended pattern for fan-out workloads Let your orchestrator handle retries at the job level, and keep the SDK's retry window narrow: ```python from pinecone import Pinecone, RetryConfig from pinecone.errors import RateLimitError # Set max_retries=0 or 1: one attempt (or one fast retry), then raise. # Let the SQS visibility timeout / Cloud Tasks retry / Step Functions catch # handle the outer retry loop. pc = Pinecone(retry_config=RetryConfig(max_retries=1)) index = pc.index(host="product-search-abc123.svc.pinecone.io") try: response = index.upsert(vectors=batch, batch_size=100, max_concurrency=4) except RateLimitError as exc: # Re-raise so the orchestrator sees a task failure and schedules a retry # after the visibility timeout expires. raise ``` ### Why jitter still helps across processes Even without coordination, the SDK's decorrelated jitter provides statistical relief. If N independent Lambda invocations are all throttled at once, they don't all retry at the same instant — each draws its own delay, spreading the retries across a window. The larger N is, the more this matters. ### Summary: when to trust the SDK vs. the orchestrator | Scenario | Recommended approach | |----------|----------------------| | Single-process bulk upsert | Use defaults — SDK handles everything | | Long-running worker (persistent process) | Use defaults — adaptive limiter learns and recovers | | Lambda / Cloud Functions / Cloud Run (stateless) | `max_retries=1`, catch `RateLimitError`, re-raise for orchestrator retry | | Fan-out across many pods (e.g. Kubernetes Job) | Same as stateless — set low `max_retries`, rely on orchestrator | | Strict per-invocation SLA (must not block) | `max_retries=0`, `retryable_status_codes=frozenset()` — raise immediately | --- ## Observability The SDK emits structured log records so you can diagnose retry storms and throttling pressure without adding instrumentation yourself. ### Log namespaces | Logger | Events | |--------|--------| | `pinecone._internal.http_client` | Throttled HTTP response received; retry delay computed | | `pinecone._internal.adaptive` | AIMD concurrency limit transitions | ### INFO messages An INFO-level record is emitted the **first time** a given host rate-limits a client instance: ``` Rate limited by host=. Adaptive concurrency will reduce in-flight requests. See https://docs.pinecone.io/python/retries for details. ``` This fires once per host per `Pinecone` / `AsyncPinecone` object, so it surfaces in your logs without flooding them on repeated throttling. ### DEBUG messages Enable DEBUG-level logging on the two namespaces above to see granular retry events: ```python import logging logging.getLogger("pinecone._internal.http_client").setLevel(logging.DEBUG) logging.getLogger("pinecone._internal.adaptive").setLevel(logging.DEBUG) ``` **Throttle record** (emitted once per retry attempt that receives a retryable response): ``` Throttled response: status=429 host=my-index.svc.pinecone.io attempt=1/4 delay=0.531s ``` Fields: `status` (HTTP status code), `host`, `attempt` (N of total attempts), `delay` (computed wait in seconds). **AIMD limit decrease** (emitted when the adaptive limiter reduces concurrency): ``` AIMD limiter decreased: before=8 after=4 ceiling=8 ``` **AIMD limit increase** (emitted when the limiter recovers a concurrency slot): ``` AIMD limiter increased: now=5 ceiling=8 ``` Increase records only fire on actual transitions — not on every successful request — so the volume is proportional to recovery events, not request throughput. --- ## See Also - {doc}`/guides/error-handling` — Exception hierarchy and how to catch specific errors - {doc}`/guides/performance` — Bulk upsert patterns, `max_concurrency` tuning, and transport selection - {doc}`/guides/sync-vs-async` — When to use the async client and how to manage concurrency with `asyncio`