Performance

The SDK is designed for low overhead. This page describes the key design choices and the patterns that keep your application fast.

Transport

The SDK ships three client variants, with two different application-protocol stacks underneath:

  • Index and AsyncIndex (REST): httpx over HTTP/1.1 with connection keepalive.

  • GrpcIndex: a native (Rust-backed) gRPC channel over HTTP/2 with binary protobuf framing.

This protocol gap is part of why gRPC has a measurable throughput edge on bulk upsert workloads — see When to Use gRPC. For the REST clients, parallel batched upsert (below) is what you reach for to drive concurrency.

Connection Pooling

Both Pinecone and Index maintain a persistent httpx.Client (or httpx.AsyncClient for the async variants). Creating a new client for every request wastes time on TLS handshakes and connection setup.

Reuse the same Index instance across calls rather than constructing a new one each time:

# Good — one client, many calls
from pinecone import Pinecone

pc = Pinecone()
desc = pc.indexes.describe("product-search")
index = pc.index(host=desc.host)

for batch in batches:
    index.upsert(vectors=batch)

# Bad — a new HTTP client for every upsert
for batch in batches:
    index = pc.index(host=desc.host)  # new client every time
    index.upsert(vectors=batch)

Use the context manager protocol to ensure connections are released when you are done:

with pc.index(host=desc.host) as index:
    index.upsert(vectors=large_batch)

Fast Serialization with msgspec and orjson

Response models are msgspec.Struct instances. msgspec uses zero-copy deserialization and avoids Python object allocation overhead that Pydantic-based models incur. Request bodies are serialized with orjson, which is typically 5–10× faster than the standard library json module.

These libraries are always active — no configuration is needed.

Cold Import Cost

The SDK uses lazy imports to keep its cold-start time under 10 ms. Top-level SDK symbols (Pinecone, AsyncPinecone, etc.) are available as soon as you import pinecone, but heavy modules — the gRPC channel, pandas (for upsert_from_dataframe), tqdm (for progress bars) — are only loaded when you actually use them.

If your application is latency-sensitive at startup, avoid importing pinecone in module-level code that runs before it is needed:

# Fine — deferred to first use
def get_index() -> Index:
    from pinecone import Pinecone
    pc = Pinecone()
    return pc.index(host="...")

Batching Large Upserts

For datasets larger than a single request payload, pass batch_size to Index.upsert(). The SDK splits the input into batches and sends them in parallel — sync via a cached ThreadPoolExecutor, async via an asyncio.Semaphore. HTTP-level retries happen automatically per batch.

response = index.upsert(
    vectors=large_list,    # any length
    batch_size=100,        # vectors per request
    max_concurrency=4,     # parallel in-flight requests (default 4, range 1–64)
)
print(response.upserted_count)         # successful items
print(response.failed_item_count)      # 0 if everything succeeded

The same kwargs are accepted on AsyncIndex.upsert() and Index.upsert_from_dataframe(). Index.upsert_records() does not accept batch_size or max_concurrency — it sends a single NDJSON request per call, so chunk the record list yourself and call upsert_records() once per chunk.

When batch_size is set, upsert() returns an UpsertResponse with partial-failure information instead of raising on the first failed batch — see Handling partial failures.

How much faster is parallel batching?

Measured on 10k vectors / 1536-d / batch=100 against an aws-us-east-1 serverless index (Methodology) — wall time, p50:

Client

max_concurrency

REST sync

REST async

gRPC

v8

sequential (baseline)

112 s

67 s

34 s

v9

1

31.5 s

32.7 s

35.0 s

v9

4 (default)

9.6 s

10.2 s

10.0 s

v9

8

5.7 s

5.9 s

5.7 s

v9

16

5.0 s

6.6 s

4.0 s

v9

32

4.4 s

5.0 s

2.7 s

The v8 row is the published pinecone==8.x client running its sequential batch_size= loop; the v9 rows are this client using native parallel batched upsert. The headline win for the typical caller — v8 REST sync sequential vs v9 REST sync at the default max_concurrency=4 — is ~12×. Async REST shows a similar shape with a smaller multiplier because v8 async sequential was already faster than v8 sync sequential. gRPC is faster than REST at high concurrency — see When to Use gRPC.

The max_concurrency=1 row isn’t a setting you’d reach for in practice — at c=1 you’ve opted out of the main reason to pass batch_size= in the first place — but it’s a useful diagnostic. It isolates how much of the v8 → v9 speedup comes from non-parallelism improvements in the client (request building, serialization, response decoding, retry layer) versus the explicit fan-out parallel batching adds on top. For REST sync, ~3.6× of the 11.7× default-settings win comes from those raw client improvements alone; the remaining ~3.3× is parallelism. For gRPC, almost the entire win comes from parallelism — v8 gRPC was already efficient at the request level.

Tuning max_concurrency

The default of 4 is calibrated to capture ~70% of the achievable speedup with modest pressure on the cluster — safe to use without tuning. Push higher only when you have a reason and can measure the result on your workload:

max_concurrency

When to use it

1

Strict per-second quota, or you want sequential semantics for ordering

4 (default)

General use; ~70% of the win, no tuning required

8

Large bulk loads on a well-provisioned index — typically the sweet spot

16–32

Diminishing returns; the cluster (not the SDK) is usually the bottleneck above ~16

>32

Rarely worth it for a single client; consider sharding the work across multiple clients instead

Throughput saturates around c≈16 for most workloads because cluster-side ingress capacity becomes the bottleneck, not the SDK. If you do need to push past that ceiling, run multiple Index instances from separate processes rather than raising max_concurrency further on one client.

For multi-million-vector loads from cloud storage, prefer index.start_import() over batched upsert — it avoids per-batch HTTP overhead entirely.

Query Latency

Queries don’t benefit from parallel batching the way bulk upserts do — each query is a single round trip — but the v9 client decodes responses substantially faster than v8 on REST for any query that returns more than a trivial payload. The wins come from msgspec.Struct response models and orjson for JSON decoding (see Fast Serialization), neither of which the v8 client uses.

Measured on the same 1536-d serverless index (Methodology), median latency:

Client

Scenario

REST sync

REST async

gRPC

v8

query_k10

35.8 ms

34.3 ms

32.5 ms

v9

query_k10

33.4 ms

35.2 ms

31.1 ms

v8

query_k100 + values + metadata

800 ms

708 ms

133 ms

v9

query_k100 + values + metadata

279 ms

260 ms

120 ms

v8

query_k1000 + values

7.01 s

7.12 s

534 ms

v9

query_k1000 + values

2.18 s

2.16 s

493 ms

Two patterns stand out:

  • The REST win scales with response size. Small top_k=10 queries are near-parity (~1.05×); top_k=100 with values + metadata is ~2.8× sync / ~2.7× async; top_k=1000 with values is ~3.2× sync / ~3.3× async. The bottleneck on v8 REST queries was decoding large JSON payloads — exactly the failure mode msgspec + orjson were chosen to fix.

  • gRPC is at parity throughout (~1.05–1.13× across scenarios). gRPC responses are protobuf, so they bypass the JSON decoding path entirely; there’s no msgspec/orjson dividend to collect. If you’re already on gRPC for queries, upgrading doesn’t change much on the query side. If you’re on REST and run heavy queries, the upgrade is a substantial win.

Filter complexity (eq, in_50, nested) on top_k=100 adds modest extra wins on REST (~1.15–1.22×) and stays at parity on gRPC. Filter overhead is small relative to network and decoding.

Async Concurrency

Pick the async client (AsyncPinecone / AsyncIndex) when your code is already inside an async def — most often because you’re conforming to the interface of an async web framework like FastAPI, Starlette, or Litestar, where request handlers are coroutines. In that setting, calling a blocking sync method either stalls the event loop (degrading throughput for every concurrent request) or forces you to offload to a thread; the async client lets you await Pinecone calls inline without either workaround.

The async client is also natural when you want concurrent reads and writes that should overlap — multiple queries in flight, or a query running while an upsert finishes — though sync code can achieve the same with threads.

For pure bulk upsert, prefer native batched upsert over a hand-rolled asyncio.gather — same parallelism, less code, automatic retries, and partial-failure reporting:

# Preferred
async with pc.index(host=desc.host) as index:
    response = await index.upsert(
        vectors=large_list,
        batch_size=100,
        max_concurrency=8,
    )

For mixed workloads — concurrent upserts and queries, or query fan-out across many namespaces — asyncio.gather over AsyncIndex calls is still the natural pattern:

async with pc.index(host=desc.host) as index:
    results = await asyncio.gather(
        index.upsert(vectors=writes_batch, batch_size=100),
        index.query(vector=q1, top_k=10),
        index.query(vector=q2, top_k=10),
    )

Sync vs async at high concurrency: with native batched upsert at max_concurrency=32, sync (~4.4 s) edges out async (~5.0 s) on the 10k-vector benchmark — the cached ThreadPoolExecutor is competitive with asyncio.Semaphore once cluster-side ingress dominates. Pick the client that matches your application style; throughput is similar at the saturation point.

When to Use gRPC

pinecone.grpc.GrpcIndex accepts the same batch_size= and max_concurrency= kwargs as the REST Index, so the call site looks identical. The wire-level differences are HTTP/2 framing (vs HTTP/1.1 + keepalive on REST) and binary protobuf encoding (vs JSON). The gRPC channel ships with the package — no separate install step.

Reading off the throughput table above, a few things about gRPC stand out:

  • Even sequential, v8 gRPC was ~3× faster than v8 REST sync (34 s vs 112 s). HTTP/2 multiplexing and protobuf encoding buy a lot before any parallelism enters the picture — and that gap is structural to the protocols, not something parallel batching alone closes.

  • At default settings, the three transports are essentially tied (~10 s). For typical workloads, the choice is about API style, not throughput.

  • gRPC pulls ahead as concurrency rises — at max_concurrency=32, gRPC finishes the same work 1.5–1.9× faster than REST.

  • max_concurrency=1 doesn’t help gRPC — v8 gRPC was already pipelining requests over its HTTP/2 channel, so the new code path’s win at gRPC comes from explicit fan-out at higher concurrency, not from the new partial-success machinery.

Pick gRPC when:

  • You’re doing sustained bulk upserts at max_concurrency ≥ 16 — gRPC finishes the same work 1.5–1.9× faster than REST at that concurrency.

  • You want the lowest absolute write latency floor on a single client (~2.7 s for 10k vectors at c=32 on the reference workload).

Stay on REST when:

  • You’re at default settings or low concurrency — there is no measurable throughput benefit at max_concurrency ≤ 8.

  • You need async — GrpcIndex is sync-only; for async workloads use AsyncIndex over REST.

from pinecone import Pinecone

pc = Pinecone()
with pc.index(name="product-search", grpc=True) as index:
    response = index.upsert(
        vectors=large_list,
        batch_size=100,
        max_concurrency=16,
    )

Summary

Technique

Where it helps

HTTP keepalive (REST) / HTTP/2 (gRPC)

Reused TCP connections, lower per-call setup cost

Reuse Index instance

Eliminate per-call TLS/connection overhead

msgspec structs

Response deserialization — faster than Pydantic

orjson

Request serialization — faster than stdlib json

Lazy imports

Reduce cold-start time

Index.upsert(batch_size=…, max_concurrency=…)

Bulk upsert — typical 10–25× over a sequential loop

AsyncIndex + asyncio.gather()

Mixed concurrent read/write workloads

GrpcIndex (sync only)

Sustained bulk upserts at max_concurrency ≥ 16 — ~1.5–1.9× over REST

index.start_import()

Multi-million-vector loads from cloud storage

Methodology

The numbers in this guide come from a controlled benchmark — 10,000 random 1536-dimensional vectors, batch_size=100, single client, fresh namespace per run, against an aws-us-east-1 serverless index. The client ran on a GCP n2-standard-2 VM (2 vCPU, 8 GB) in us-central1-a running Ubuntu 24.04, so every request crosses GCP → AWS — RTT and inter-cloud bandwidth are real factors in the absolute numbers. The “v8 sequential” rows use pinecone==8.1.2 from PyPI (sequential batch_size= loop, fail-fast on first batch error). The max_concurrency=N rows use this version of the SDK with native parallel batched upsert.

Iteration counts vary by scenario. Batched-upsert cells use n=3 measured iterations after 1 warmup — each iteration writes 10k vectors, so increasing n trades wall time for precision. That table is best read as a directional guide: the large speedup factors (≥3×) are well above run-to-run noise, but small differences between adjacent rows in the same column should not be over-interpreted. Query cells use n=25 (n=10 for query_k1000_values); the query numbers are statistically firm. We plan to re-run the batched-upsert sweep at higher iteration counts too; this page will be refreshed at that time.

Your numbers will vary with client region, RTT, vector dimension, batch size, payload metadata, and concurrent traffic from other clients. When in doubt, measure on your own workload.