# Performance The SDK is designed for low overhead. This page describes the key design choices and the patterns that keep your application fast. ## Transport The SDK ships three client variants, with two different application-protocol stacks underneath: - `Index` and `AsyncIndex` (REST): [httpx](https://www.python-httpx.org/) over HTTP/1.1 with connection keepalive. - `GrpcIndex`: a native (Rust-backed) gRPC channel over HTTP/2 with binary protobuf framing. This protocol gap is part of why gRPC has a measurable throughput edge on bulk upsert workloads — see [When to Use gRPC](#when-to-use-grpc). For the REST clients, parallel batched upsert (below) is what you reach for to drive concurrency. ## Connection Pooling Both `Pinecone` and `Index` maintain a persistent `httpx.Client` (or `httpx.AsyncClient` for the async variants). Creating a new client for every request wastes time on TLS handshakes and connection setup. **Reuse the same `Index` instance** across calls rather than constructing a new one each time: ```python # Good — one client, many calls from pinecone import Pinecone pc = Pinecone() desc = pc.indexes.describe("product-search") index = pc.index(host=desc.host) for batch in batches: index.upsert(vectors=batch) # Bad — a new HTTP client for every upsert for batch in batches: index = pc.index(host=desc.host) # new client every time index.upsert(vectors=batch) ``` Use the context manager protocol to ensure connections are released when you are done: ```python with pc.index(host=desc.host) as index: index.upsert(vectors=large_batch) ``` ## Fast Serialization with msgspec and orjson Response models are `msgspec.Struct` instances. `msgspec` uses zero-copy deserialization and avoids Python object allocation overhead that Pydantic-based models incur. Request bodies are serialized with `orjson`, which is typically 5–10× faster than the standard library `json` module. These libraries are always active — no configuration is needed. ## Cold Import Cost The SDK uses lazy imports to keep its cold-start time under 10 ms. Top-level SDK symbols (`Pinecone`, `AsyncPinecone`, etc.) are available as soon as you import `pinecone`, but heavy modules — the gRPC channel, pandas (for `upsert_from_dataframe`), tqdm (for progress bars) — are only loaded when you actually use them. If your application is latency-sensitive at startup, avoid importing `pinecone` in module-level code that runs before it is needed: ```python # Fine — deferred to first use def get_index() -> Index: from pinecone import Pinecone pc = Pinecone() return pc.index(host="...") ``` ## Batching Large Upserts For datasets larger than a single request payload, pass `batch_size` to `Index.upsert()`. The SDK splits the input into batches and sends them in parallel — sync via a cached `ThreadPoolExecutor`, async via an `asyncio.Semaphore`. HTTP-level retries happen automatically per batch. ```python response = index.upsert( vectors=large_list, # any length batch_size=100, # vectors per request max_concurrency=4, # parallel in-flight requests (default 4, range 1–64) ) print(response.upserted_count) # successful items print(response.failed_item_count) # 0 if everything succeeded ``` The same kwargs are accepted on `AsyncIndex.upsert()` and `Index.upsert_from_dataframe()`. `Index.upsert_records()` does **not** accept `batch_size` or `max_concurrency` — it sends a single NDJSON request per call, so chunk the record list yourself and call `upsert_records()` once per chunk. When `batch_size` is set, `upsert()` returns an `UpsertResponse` with partial-failure information instead of raising on the first failed batch — see [Handling partial failures](../how-to/vectors/upsert-and-query.md#handling-partial-failures). ### How much faster is parallel batching? Measured on 10k vectors / 1536-d / batch=100 against an aws-us-east-1 serverless index ([Methodology](#methodology)) — wall time, p50: | Client | `max_concurrency` | REST sync | REST async | gRPC | |---|---:|---:|---:|---:| | v8 | sequential (baseline) | 112 s | 67 s | 34 s | | v9 | 1 | 31.5 s | 32.7 s | 35.0 s | | v9 | 4 (default) | **9.6 s** | **10.2 s** | **10.0 s** | | v9 | 8 | 5.7 s | 5.9 s | 5.7 s | | v9 | 16 | 5.0 s | 6.6 s | 4.0 s | | v9 | 32 | 4.4 s | 5.0 s | 2.7 s | The v8 row is the published `pinecone==8.x` client running its sequential `batch_size=` loop; the v9 rows are this client using native parallel batched upsert. The headline win for the typical caller — v8 REST sync sequential vs v9 REST sync at the default `max_concurrency=4` — is ~12×. Async REST shows a similar shape with a smaller multiplier because v8 async sequential was already faster than v8 sync sequential. gRPC is faster than REST at high concurrency — see [When to Use gRPC](#when-to-use-grpc). The `max_concurrency=1` row isn't a setting you'd reach for in practice — at `c=1` you've opted out of the main reason to pass `batch_size=` in the first place — but it's a useful diagnostic. It isolates how much of the v8 → v9 speedup comes from non-parallelism improvements in the client (request building, serialization, response decoding, retry layer) versus the explicit fan-out parallel batching adds on top. For REST sync, ~3.6× of the 11.7× default-settings win comes from those raw client improvements alone; the remaining ~3.3× is parallelism. For gRPC, almost the entire win comes from parallelism — v8 gRPC was already efficient at the request level. ### Tuning `max_concurrency` The default of `4` is calibrated to capture ~70% of the achievable speedup with modest pressure on the cluster — safe to use without tuning. Push higher only when you have a reason and can measure the result on your workload: | `max_concurrency` | When to use it | |---:|---| | `1` | Strict per-second quota, or you want sequential semantics for ordering | | `4` *(default)* | General use; ~70% of the win, no tuning required | | `8` | Large bulk loads on a well-provisioned index — typically the sweet spot | | `16–32` | Diminishing returns; the cluster (not the SDK) is usually the bottleneck above ~16 | | `>32` | Rarely worth it for a single client; consider sharding the work across multiple clients instead | Throughput saturates around `c≈16` for most workloads because cluster-side ingress capacity becomes the bottleneck, not the SDK. If you do need to push past that ceiling, run multiple `Index` instances from separate processes rather than raising `max_concurrency` further on one client. For multi-million-vector loads from cloud storage, prefer `index.start_import()` over batched upsert — it avoids per-batch HTTP overhead entirely. ## Query Latency Queries don't benefit from parallel batching the way bulk upserts do — each query is a single round trip — but the v9 client decodes responses substantially faster than v8 on REST for any query that returns more than a trivial payload. The wins come from `msgspec.Struct` response models and `orjson` for JSON decoding (see [Fast Serialization](#fast-serialization-with-msgspec-and-orjson)), neither of which the v8 client uses. Measured on the same 1536-d serverless index ([Methodology](#methodology)), median latency: | Client | Scenario | REST sync | REST async | gRPC | |---|---|---:|---:|---:| | v8 | `query_k10` | 35.8 ms | 34.3 ms | 32.5 ms | | v9 | `query_k10` | 33.4 ms | 35.2 ms | 31.1 ms | | v8 | `query_k100` + values + metadata | 800 ms | 708 ms | 133 ms | | v9 | `query_k100` + values + metadata | **279 ms** | **260 ms** | 120 ms | | v8 | `query_k1000` + values | 7.01 s | 7.12 s | 534 ms | | v9 | `query_k1000` + values | **2.18 s** | **2.16 s** | 493 ms | Two patterns stand out: - **The REST win scales with response size.** Small `top_k=10` queries are near-parity (~1.05×); `top_k=100` with values + metadata is ~2.8× sync / ~2.7× async; `top_k=1000` with values is ~3.2× sync / ~3.3× async. The bottleneck on v8 REST queries was decoding large JSON payloads — exactly the failure mode `msgspec` + `orjson` were chosen to fix. - **gRPC is at parity throughout** (~1.05–1.13× across scenarios). gRPC responses are protobuf, so they bypass the JSON decoding path entirely; there's no msgspec/orjson dividend to collect. If you're already on gRPC for queries, upgrading doesn't change much on the query side. If you're on REST and run heavy queries, the upgrade is a substantial win. Filter complexity (`eq`, `in_50`, `nested`) on `top_k=100` adds modest extra wins on REST (~1.15–1.22×) and stays at parity on gRPC. Filter overhead is small relative to network and decoding. ## Async Concurrency Pick the async client (`AsyncPinecone` / `AsyncIndex`) when your code is already inside an `async def` — most often because you're conforming to the interface of an async web framework like **FastAPI**, **Starlette**, or **Litestar**, where request handlers are coroutines. In that setting, calling a blocking sync method either stalls the event loop (degrading throughput for every concurrent request) or forces you to offload to a thread; the async client lets you `await` Pinecone calls inline without either workaround. The async client is also natural when you want concurrent reads and writes that should overlap — multiple queries in flight, or a query running while an upsert finishes — though sync code can achieve the same with threads. For **pure bulk upsert**, prefer native batched upsert over a hand-rolled `asyncio.gather` — same parallelism, less code, automatic retries, and partial-failure reporting: ```python # Preferred async with pc.index(host=desc.host) as index: response = await index.upsert( vectors=large_list, batch_size=100, max_concurrency=8, ) ``` For **mixed workloads** — concurrent upserts and queries, or query fan-out across many namespaces — `asyncio.gather` over `AsyncIndex` calls is still the natural pattern: ```python async with pc.index(host=desc.host) as index: results = await asyncio.gather( index.upsert(vectors=writes_batch, batch_size=100), index.query(vector=q1, top_k=10), index.query(vector=q2, top_k=10), ) ``` Sync vs async at high concurrency: with native batched upsert at `max_concurrency=32`, sync (~4.4 s) edges out async (~5.0 s) on the 10k-vector benchmark — the cached `ThreadPoolExecutor` is competitive with `asyncio.Semaphore` once cluster-side ingress dominates. Pick the client that matches your application style; throughput is similar at the saturation point. ## When to Use gRPC `pinecone.grpc.GrpcIndex` accepts the same `batch_size=` and `max_concurrency=` kwargs as the REST `Index`, so the call site looks identical. The wire-level differences are HTTP/2 framing (vs HTTP/1.1 + keepalive on REST) and binary protobuf encoding (vs JSON). The gRPC channel ships with the package — no separate install step. Reading off the [throughput table above](#how-much-faster-is-parallel-batching), a few things about gRPC stand out: - **Even sequential, v8 gRPC was ~3× faster than v8 REST sync** (34 s vs 112 s). HTTP/2 multiplexing and protobuf encoding buy a lot before any parallelism enters the picture — and that gap is structural to the protocols, not something parallel batching alone closes. - **At default settings, the three transports are essentially tied** (~10 s). For typical workloads, the choice is about API style, not throughput. - **gRPC pulls ahead as concurrency rises** — at `max_concurrency=32`, gRPC finishes the same work 1.5–1.9× faster than REST. - **`max_concurrency=1` doesn't help gRPC** — v8 gRPC was already pipelining requests over its HTTP/2 channel, so the new code path's win at gRPC comes from explicit fan-out at higher concurrency, not from the new partial-success machinery. Pick gRPC when: - You're doing **sustained bulk upserts at `max_concurrency` ≥ 16** — gRPC finishes the same work 1.5–1.9× faster than REST at that concurrency. - You want the **lowest absolute write latency floor** on a single client (~2.7 s for 10k vectors at `c=32` on the reference workload). Stay on REST when: - You're at default settings or low concurrency — there is no measurable throughput benefit at `max_concurrency` ≤ 8. - You need async — `GrpcIndex` is sync-only; for async workloads use `AsyncIndex` over REST. ```python from pinecone import Pinecone pc = Pinecone() with pc.index(name="product-search", grpc=True) as index: response = index.upsert( vectors=large_list, batch_size=100, max_concurrency=16, ) ``` ## Summary | Technique | Where it helps | |-----------|---------------| | HTTP keepalive (REST) / HTTP/2 (gRPC) | Reused TCP connections, lower per-call setup cost | | Reuse `Index` instance | Eliminate per-call TLS/connection overhead | | msgspec structs | Response deserialization — faster than Pydantic | | orjson | Request serialization — faster than stdlib `json` | | Lazy imports | Reduce cold-start time | | `Index.upsert(batch_size=…, max_concurrency=…)` | Bulk upsert — typical 10–25× over a sequential loop | | `AsyncIndex` + `asyncio.gather()` | Mixed concurrent read/write workloads | | `GrpcIndex` (sync only) | Sustained bulk upserts at `max_concurrency` ≥ 16 — ~1.5–1.9× over REST | | `index.start_import()` | Multi-million-vector loads from cloud storage | ## Methodology The numbers in this guide come from a controlled benchmark — 10,000 random 1536-dimensional vectors, `batch_size=100`, single client, fresh namespace per run, against an aws-us-east-1 serverless index. The client ran on a GCP `n2-standard-2` VM (2 vCPU, 8 GB) in `us-central1-a` running Ubuntu 24.04, so every request crosses GCP → AWS — RTT and inter-cloud bandwidth are real factors in the absolute numbers. The "v8 sequential" rows use `pinecone==8.1.2` from PyPI (sequential `batch_size=` loop, fail-fast on first batch error). The `max_concurrency=N` rows use this version of the SDK with native parallel batched upsert. Iteration counts vary by scenario. Batched-upsert cells use n=3 measured iterations after 1 warmup — each iteration writes 10k vectors, so increasing n trades wall time for precision. That table is best read as a directional guide: the large speedup factors (≥3×) are well above run-to-run noise, but small differences between adjacent rows in the same column should not be over-interpreted. Query cells use n=25 (n=10 for `query_k1000_values`); the query numbers are statistically firm. We plan to re-run the batched-upsert sweep at higher iteration counts too; this page will be refreshed at that time. Your numbers will vary with client region, RTT, vector dimension, batch size, payload metadata, and concurrent traffic from other clients. When in doubt, measure on your own workload.