Predicting DynamoDB Hot Partitions with OpenTelemetry and CloudWatch

Overview

Overview: In an AWS on-demand DynamoDB scenario with bursty writes, predicting hot partitions before throttling requires granular telemetry and clever analysis. This guide outlines a scalable process to instrument your Java application with OpenTelemetry, collect high-detail metrics in Amazon CloudWatch (Metrics and Logs), and analyze trends to flag “hot” keys early. We’ll cover what metrics to capture, how to handle high-cardinality identifiers safely, strategies for aggregating the data, statistical heuristics for detecting hot partitions, and building dashboards or alerts. The focus is on AWS-native solutions (OpenTelemetry + CloudWatch) that minimize cost while providing actionable insight.

OpenTelemetry Instrumentation for DynamoDB Operations

Capture Detailed DynamoDB Metrics

Instrument all DynamoDB calls in your Java application using OpenTelemetry. The goal is to tag each operation with key attributes and performance data, enabling per-partition analysis. Important metrics and attributes to capture for each DynamoDB request include:

Partition Key Identifier: e.g. accountId or contractId (from keys like "accountId=1007776605~~sortId=6"). This identifies which partition (hash key) the operation targets.
Operation Type: The DynamoDB API action (e.g. PutItem, UpdateItem, BatchWriteItem, Query, etc.). This can be captured as a span name or metric label (e.g. commandType=PutItem).
Consumed Capacity Units: The read or write capacity consumed by the request. For writes, DynamoDB can return the ConsumedCapacity if requested; capturing this value per operation is crucial. It accounts for table + LSI writes (e.g. writing a 1.5KB item with 2 LSIs might consume ~3–4 WCUs total).
Request Counts: A simple count of requests per key and operation (increment a counter for each call).
Latency and Outcome: Record the request duration and whether it succeeded, throttled, or failed. Noting if a request resulted in a ProvisionedThroughputExceededException is vital for correlating with throttling events.

Use OpenTelemetry’s tracing to automatically record calls to DynamoDB (e.g. via the AWS SDK instrumentation) and attach custom attributes for the partition key and consumed capacity. Also emit custom OpenTelemetry metrics (counters) for capacity and counts. For example, a counter metric dynamodb_wcu_consumed{partitionKey=X, operation=PutItem} can accumulate the write units consumed by each key (in practice we will handle partitionKey carefully due to cardinality – see next section).

Instrument Batch Operations

Since writes are done in batch processing, ensure batch size and rate are captured. For a BatchWriteItem or a loop of many writes in a job, emit a metric or log event summarizing the batch (e.g. “Processed 10,000 items for Account X in 30 seconds”). This per-task metric will highlight sudden surges for a single key. It’s safer and cheaper to record one aggregated metric per batch than 10,000 individual metrics. You can instrument the code that processes each account/contract in the batch to log the total items and capacity used for that key’s job.

Throttle Alerts via Instrumentation

Also consider a custom counter metric for throttled events (e.g. dynamodb_throttled_requests{table=XYZ}), incremented whenever the app receives a ProvisionedThroughputExceededException. Medium’s engineers use a similar custom metric to track throttle errors at the application level. While our goal is to predict and avoid throttling, tracking these errors helps validate predictions and can trigger immediate alerts if throttling starts.

Handling High-Cardinality Metrics Safely

Recording metrics per partition key introduces high-cardinality dimensions (each accountId/contractId is potentially unique). Pushing every key as a separate CloudWatch metric can quickly blow past limits and incur high cost. We need strategies to safely emit these metrics:

Prefer Logs for Detailed Key Data: Instead of treating partitionKey as a CloudWatch metric dimension, log it. Use CloudWatch Logs (or Embedded Metric Format) to record each operation’s details (key, capacity, etc.) without creating a new metric time-series for each key. For example, log JSON like: {"table":"MyTable", "partitionKey":"1007776605", "operation":"PutItem", "consumedWCU":2}. This way, keys remain in logs for analysis, and we only create aggregate metrics (like total WCU) in CloudWatch.
Dimension Aggregation/Grouping: If you must emit metrics, aggregate by a less-granular dimension. For instance, you might hash or bucket the partitionKey (e.g. use a hash prefix or a group ID) so that only a limited set of values appear as metric dimensions. Another approach is to replace actual IDs with a template value in the metric dimension. For example, configure the OTel instrumentation to report an operation name like PutItem account/{accountId} instead of the raw ID – CloudWatch’s Application Insights can be set to aggregate such patterns. This prevents each unique ID from spawning a new time-series.
Top N Key Metrics Only: Implement logic to emit custom metrics only for the currently hottest keys. For example, your app (or a log-processing Lambda) could track which partition keys are consuming the most WCU/RCU in the last interval and push metrics for just those (e.g. HotPartitionKey=1007776605 dimension with value=throughput). Limit to the top 5–10 keys to control cardinality. Less-active keys won’t appear as separate metrics. This dynamic approach focuses on the keys that matter at any time.
Use CloudWatch EMF (Embedded Metric Format): The AWS OTel Collector’s EMF exporter allows high-cardinality metrics by writing them as logs first. With EMF, you can include the key as a dimension in the log record; CloudWatch will index some metrics but can drop others based on a limiter. For example, the CloudWatch Agent/ADOT has a metric limiter that, by default, will start grouping excess unique dimensions into an “Other” category after 500 unique values. You can tune this or use the default to ensure CloudWatch doesn’t get overwhelmed. Essentially, EMF lets you capture granular data in logs and still create some aggregated metrics (e.g. it might create an “AllOtherKeys” catch-all metric when too many keys are seen).
Cost Awareness: Remember that each unique metric (unique dimension combination) in CloudWatch is a custom metric that incurs cost. Avoid an unbounded label like raw contractId. Instead, utilize one of the above techniques to keep metric counts reasonable. CloudWatch Logs costs (for ingestion and storage) are typically lower for high-cardinality data than CloudWatch Metrics costs for the same detail, so shifting detail to logs is often more cost-effective.

CloudWatch Data Aggregation and Storage

Use CloudWatch Metrics for Global Stats

Continue to leverage CloudWatch’s built-in DynamoDB metrics and a few custom metrics for high-level monitoring. For example, track ConsumedWriteCapacityUnits and ConsumedReadCapacityUnits (table-wide) and your custom ThrottledRequests count. These give a picture of overall load and if/when throttling occurs. You might also use custom Metric Math to calculate usage as a percentage of estimated capacity (more on that in detection logic). However, these table-level metrics won’t pinpoint which key is hot – that’s where our granular data comes in.

Leverage CloudWatch Logs for Per-Key Details

All the per-request or per-key metrics collected via OpenTelemetry should be sent to a CloudWatch Log Group (either directly as log events, or via the OTel EMF pipeline). This log group acts as a reservoir of high-cardinality data that we can query. Each log entry might represent a single DynamoDB request or an aggregated batch. For example:

{
  "timestamp": "2026-01-07T18:39:00Z",
  "table": "MyTable",
  "partitionKey": "account#1007776605",
  "operation": "PutItem",
  "consumedWCU": 2
}

During a batch job, you might instead log a summary like:

{
  "timestamp": "2026-01-07T18:40:00Z",
  "table": "MyTable",
  "partitionKey": "account#1007776605",
  "operation": "BatchWriteItem",
  "itemsWritten": 5000,
  "totalWCU": 10000,
  "durationMs": 30000
}

By storing this data in CloudWatch Logs, we enable retrospective analysis (looking back at yesterday’s batch to see which keys spiked) and real-time querying via CloudWatch Logs Insights. The logs can be kept for a short retention if cost is a concern, or archived to S3 if needed for long-term analysis.

Aggregation Techniques

To reduce log volume, consider aggregating counts before logging. For instance, the OpenTelemetry Collector’s batch processor can accumulate metrics over 30 seconds and then emit one combined log event. Similarly, your application could accumulate per-key counters in memory and write out one log line per key per minute rather than one per request. This trades resolution for cost savings. Make sure the aggregation window is short enough (e.g. 1 minute or less) to still catch rapid spikes.

CloudWatch Logs Insights for Analysis

Use CloudWatch Logs Insights queries to analyze the log data. For retrospective analysis, you can run queries to find the keys with the highest traffic over a period. For example:

FILTER table="MyTable"
| stats sum(consumedWCU) as totalWCU, count(*) as requests by partitionKey, operation
| sort totalWCU desc
| limit 10

This query (run over, say, the last 24 hours of logs) will show the top 10 partition keys by write throughput and how many operations they had. You can refine the time window to zoom in on the batch processing period. Because the data is in logs, you can group by the actual key without worrying about CloudWatch metric limits. This forms the basis for identifying hot keys after the fact.

Storing for Real-Time Alerts

For immediate or predictive alerts, you might push some aggregated metrics to CloudWatch. For example, every minute, emit a custom metric for the top key’s WCU (as described earlier). You could maintain a CloudWatch metric like LeadingPartitionWCU (with a dimension for the key name or rank) to track the worst-offender in near-real-time. Alternatively, skip the metric and directly use a CloudWatch Logs Insights query in a CloudWatch Dashboard widget or in an alarm (CloudWatch supports creating alarms based on Logs Insights queries by using Metric Filters or scheduled queries). A simple approach is to use a metric filter to count occurrences of a specific key in the log (not general enough for unknown keys) – instead, prefer a Logs Insights query that finds any key over threshold and triggers an alert via Lambda or EventBridge. This requires some custom scripting but keeps most data in logs until needed.

Detecting Emerging Hot Partitions (Heuristics & Thresholds)

To predict hot partitions before they throttle, use a combination of known DynamoDB limits and statistical heuristics on the per-key metrics:

Absolute Throughput Thresholds: DynamoDB imposes a per-partition limit on throughput. In on-demand mode, an individual partition (and thus a single partition key’s data, if it all lives in one partition) tops out around 1,000 write request units per second or 3,000 read units per second. In other words, a single key can sustain at most ~1000 writes/sec (or equivalent mix) before hitting a partition’s capacity. This corresponds to ~60,000 write units/min for one key. Use these as hard warning limits. If any partition key approaches, say, 800–1000 WCU/sec or 2500+ RCU/sec, that’s a red flag. You can compute this from your metrics: e.g. if a key consumed ~60,000 WCU in the last minute, it was at the max 1,000/sec rate continuously. Set a threshold slightly below the max to have buffer.
On-Demand Surge Factor: On-demand tables can throttle if traffic for a table doubles faster than the system can react (roughly within 30 minutes). Monitor the rate of increase of throughput per key. A key that suddenly goes from, say, 100 WCU/sec to 1,000 WCU/sec in a couple of minutes is likely to cause trouble even if 1,000 is the limit – because other partitions might not share load and AWS might not yet redistribute capacity. Define a heuristic like: if a key’s throughput has increased by >2x within the last 5–10 minutes, flag it. This could be implemented by comparing a short-term average to a slightly longer-term average for that key.
Proportion of Table Capacity: Another indicator is if one key consumes a very large portion of the table’s overall throughput. For example, if your table is doing 100k WCU/sec in total and one accountId is consistently 80k of that (80%), it’s a hot partition risk. Even if total is within limits, the imbalance suggests one partition is saturated. You can derive this by dividing per-key metrics by the table’s CloudWatch ConsumedWriteCapacityUnits. If a single key >50% of table usage, it’s likely hot (depending on how many partitions the table has). Medium’s team used a similar approach: they estimated the number of partitions and then watched if a single key’s request rate was nearing the per-partition share of capacity.
Historical Baselines & Anomaly Detection: Maintain a baseline of typical activity for each key (or for the top N keys). For instance, if account “X” usually sees 100 writes/day but suddenly does 10,000 writes in an hour, that’s anomalous. You might not set up a complex anomaly detection algorithm, but you can compute a rolling average or use CloudWatch’s built-in anomaly detection on a custom metric if you have one per key (limited by cardinality though). Simpler: if a key appears in the “top N” list that was never there before, treat it as a potential hot key emerging.
Threshold on Throttled Events (Leading Indicator): If you start seeing even a few throttled requests for a specific key (from your instrumentation logs), that’s a clear sign the partition is hot. Ideally we predict before throttles, but a single throttle or two can serve as an early warning to back off. Track partial throttling: e.g. if in one minute, key “X” had 5 throttled writes, even if retries succeeded, it’s at the limit and should be flagged immediately.

By combining these criteria, you can detect an emerging hot partition. For example, you might implement a rule: “If any key exceeds 50% of table WCU and >800 WCU/sec absolute, or shows >2x growth within 5 minutes, flag it as hot.” These rules can be adjusted based on your observed patterns (daily batch surges likely follow a known profile).

Note: DynamoDB hot keys can be tricky – sometimes two moderately hot keys on the same partition can cause throttle even if each alone is under the threshold. Since you cannot know which partition a key falls into (without DynamoDB Contributor Insights or the key diagnostics library), our prediction isn’t foolproof. We assume one partition key = one partition for safety, but be aware of edge cases (if two hot accountIds hash to the same partition, throttling could occur earlier than expected).

Example CloudWatch Metrics graph: A single partition’s throughput becomes the bottleneck. Despite provisioned capacity being higher (blue line), the consumed write units plateau at ~1000 units/sec (orange line) – the maximum per-partition throughput. This indicates a hot partition caused by an imbalanced key usage.

In the above graph, all writes went to one partition (e.g., using a single date as the key) and hit ~1000 WCU/sec, throttling the application. Such visualizations confirm why our threshold of ~1000 WCU/sec per key is critical.

Monitoring Write Surges and Batch Tasks

Because the writes are done via daily batch processing, special attention is needed for surge detection within a batch job. Often, a batch process might hammer one key at a time (e.g., processing contract “ABC” then moving to “DEF”). This can create brief but intense hot partitions. Here’s how to handle it:

Per-Task Logging: As mentioned, instrument each batch task to emit a summary. If a particular task is about to perform a massive number of writes for one key, log that intention or start. For example: “Starting processing for Account 1007776605 with 50,000 updates queued.” This can even be a dry prediction logged before the writes commence. Then log the outcome: “Completed Account 1007776605: 50,000 items, 120k WCU, took 2 minutes.” These logs let you retrospectively identify which batch tasks caused high load.
Batch-Level Metrics: You might create a custom metric for “batch write size” or “items per account per batch”. This need not have the accountId as a dimension, but you can alert on any single batch exceeding a threshold (e.g., if any batch processes >N items, it’s likely to be a hot surge). The presence of an extremely large batch could warn you that the corresponding key will be hot.
Intra-Batch Throttling Signals: Within a batch, if you see the DynamoDB client start to throttle (even lightly), you should slow down that batch. Implement backoff logic when writing bursts: for example, if you catch a ProvisionedThroughputExceededException, pause or slow the writes for that key’s batch. You can instrument a gauge metric for “current batch write rate” and reduce it dynamically upon throttles.
Rate Limiting by Design: As a preventive measure, consider pacing the batch writes per key. If OpenTelemetry metrics show that last night’s batch for account X hit 900 WCU/sec and caused near-throttling, you could adjust the code to insert a small delay or chunking mechanism for that account’s next run. Essentially, use yesterday’s metrics to tune today’s batch processing (this is a form of analytical prediction outside of monitoring systems).
Batch Window Analytics: Use the logs to compute metrics per time window inside the batch. For instance, you can divide the batch processing period into 1-minute windows and see how many writes each key got per window. A sliding window counter is useful: as the batch runs, every few seconds update a count of writes in the last 60 seconds for the current key. If that sliding window count is trending toward the 60k limit, that’s a sign to throttle down. This can be done in-memory in the application and also exported to logs periodically.

Overall, by instrumenting at the batch task level, you gain insight into when and where surges occur. This data can feed into a prediction model – for example, if the app is about to process a notoriously large account, you might proactively alert or scale resources.

Time-Window Trending and Prediction Logic

Sliding Window Analysis

Implement a sliding time window for tracking per-key activity. For example, maintain a rolling 5-minute window of writes for each key (update it each minute). This can be done by keeping counters in a small in-memory cache keyed by partitionKey and using a deque or ring buffer for the counts per minute. Every minute, calculate the total over the last 5 (drop the oldest minute, add the newest). This 5-min sum or average is a smoother indicator of trend than a single minute. If the 5-min moving total for a key is, say, 4x higher than it was an hour ago, it’s a rising hotspot.

Growth Rate and Acceleration

Look not just at the current value but the derivative (rate of change). For each key, compute how much its throughput increased compared to the previous window. If you have a time-series of a key’s WCU per minute, a rapid upward curve indicates trouble. You can set a rule like: “If a key’s per-minute WCU has been increasing by >20% for each of the last 3 minutes, and it’s now above a baseline threshold, flag it.” This catches exponential growth early. It’s essentially applying a heuristic form of anomaly detection focusing on upward spikes.

Baseline Comparison

Compare recent activity to a historical baseline. For instance, keep yesterday’s same-time window as a reference (if load is daily periodic). If account 123 is usually quiet but today at 2 PM it’s suddenly very busy (compared to yesterday 2 PM or the last few days average), that deviation could predict a potential hot key issue. CloudWatch Metrics Insights or metric math could help if you feed a metric per key; otherwise, do this comparison in a script or mentally via dashboards.

Statistical Thresholds

You can use statistical techniques like standard deviation: treat each key’s request count as a distribution and flag if it exceeds, say, mean + 3σ of its usual range. Given that many keys might be normally near zero, a sudden spike will easily overshoot such thresholds. Even a simpler static threshold (X writes per minute) can be statistically informed by looking at historical max. For example, if no key ever did more than 20k writes in a minute before, and now one is doing 50k, that’s clearly exceptional.

Predicting Imminent Throttle

Synthesize the above into a prediction rule. A possible logic:

“If any partition key’s 1-minute write rate exceeds 80% of 1000 WCU/sec (i.e. >800/sec) or if its 5-min moving average is >500 WCU/sec and increasing for 3 consecutive minutes, then predict throttling risk.”

You could implement this by a CloudWatch Logs Insights query that runs every minute via CloudWatch Scheduled Events/Lambda: the query finds any key over X WCU in last 1 min or with increasing trend (the increasing trend might be inferred by comparing the last 5 min vs last 15 min totals). If the query finds such keys, the Lambda can publish an alert (or even preemptively notify the batch processor to slow down for that key).

Another approach: use CloudWatch Metric Math on a custom metric if available. For example, if you had HotPartitionWCU metric for the worst key, you could apply anomaly detection on it with CloudWatch alarms. But since our granular data is in logs, a Logs Insights + Lambda approach is more flexible.

False Positives and Tuning

Initially, you may want to set the thresholds low to catch anything suspicious, then refine. It’s better to get a few false alarms (and adjust) than to miss a surge. Over time, identify which metrics were the best predictors of actual throttling events. Perhaps you’ll find that “500 WCU/sec sustained for 5 minutes” was safe but “700 WCU/sec for 2 minutes” always preceded throttling. Adjust thresholds accordingly.

Dashboards and Key Visualizations

Visualizing the data helps both in real-time monitoring and in refining your predictive strategy:

Top N Keys Graph: Construct a CloudWatch dashboard that highlights the heaviest keys. Since CloudWatch Metrics doesn’t natively support a dynamic “top N”, use Logs Insights widgets. For example, add a widget with a query: “Top 5 partitionKeys by write throughput in the last 5 minutes”. This can output a table of keys and their WCU. You might have it refresh automatically every 1 minute. While it’s not a continuous time-series graph, it gives an updating leaderboard of hot keys.
Custom Metrics for Top Keys: If you decided to emit metrics for a few top keys, you can add those to a line graph. For instance, you could have a graph showing HotKey1_Throughput, HotKey2_Throughput, etc. (if you encoded rank or specific key dimensions). This requires knowing which keys to track. In a stable workload, maybe the same few accounts tend to be hot, so you can hard-code them in a dashboard. However, if it varies, you’ll rely on logs for dynamic analysis.
Overall vs Per-Key Comparison: Include a graph of total table WCU/RCU alongside something like the top key’s WCU. CloudWatch Metric Math can take the max of all per-key series if you had them, but if not, just compare the total vs the threshold line. For example, a horizontal line at “1000 WCU/sec per partition” is a useful visual aid. If total is well above 1000, that’s fine if spread out; but if you see throttling, you can bet one key hit that line. You could also plot the ThrottledRequests metric on a secondary axis to see if spikes correlate with when a key presumably exceeded capacity.
Trend Charts: If you record any trend metric (like 5-min moving average for top key), plot that to see the shape of surges. CloudWatch’s Metrics Insights (SQL-like query for metrics) can be used if you store the data as custom metrics. For example, a query: SELECT MAX(consumed_wcu) by partitionKey WHERE TableName="MyTable" GROUP BY partitionKey LIMIT 5 could show the top 5 keys as lines (similar to Contributor Insights, but done manually). This is advanced and only feasible if metrics are in CloudWatch; with our logs approach, Logs Insights is the go-to.
CloudWatch Logs Insights Live Tail: During the batch processing window, you can even run a logs insights query continuously (the CloudWatch Console can auto-refresh a query). For example, watch a query that filters the last few minutes of logs for keys exceeding a threshold. This is more ad-hoc, but useful when you know a heavy batch is running and you want a real-time eye on it.

When building dashboards, prioritize clarity: use headings like “Top DynamoDB Partition Keys (WCU)”, “Total Throughput vs Throttling”, etc. Ensure that the team can quickly identify if a single key is abnormal. Metric Math can highlight a single highest value among a set, but since we avoid high-card metrics, our dashboard relies on either logs queries or a few chosen metrics.

Example Visualization (Conceptual)

Imagine a dashboard where one panel lists the top contributors in the last 5 minutes (key and WCU), and next to it a graph shows the total table WCU (stacked by key perhaps). If your instrumentation is feeding data, you could attempt a stacked area graph of the top N keys throughput over time – effectively reproducing a simplified Contributor Insights graph. Each key would be one color band, and you’d clearly see if one band dominates. Without Contributor Insights, you’d have to generate those series yourself (maybe via a scheduled job that pushes the data). If that’s too complex, stick to tables or single-value widgets that call out, for example, “Current hottest key: X (900 WCU/sec)”.

Finally, set up alerts based on this analysis. For instance, a CloudWatch alarm on the custom metric “LeadingPartitionWCU” if >800 for 2 minutes, or an alarm on “ThrottledRequests > 0” (any throttle). Pre-throttle alarms (based on your prediction metrics) are the ultimate goal – e.g., an alarm that triggers if a Logs Insights query finds any key above threshold. This might be implemented via a Lambda that runs the query and emits a CloudWatch metric for “HotKeyAlert” when true, which then an alarm watches. It’s a bit of glue, but all within AWS.

Best Practices for Preemptive Hot Key Detection

To wrap up, here are general practices to ensure the solution is production-grade, scalable, and cost-effective:

Use Sliding Window Counters: Avoid instantaneous spikes by smoothing over short intervals. A sliding window or exponential moving average for each key’s activity will give a stable indicator that can predict trouble ahead better than single spikes.
Set Conservative Thresholds: Especially at first, set your alert thresholds low (e.g., 500 WCU/sec for 1 minute) to catch potential hot keys early. You can always raise thresholds if false alarms are frequent. It’s easier to relax a sensitive alarm than to miss an event.
Incorporate Table Partition Estimates: If you know your table size or have an estimate of partition count, use it. E.g., “We have ~300 partitions, so per-partition share of 360k WCU total is ~1200 WCU; seeing one key at 1000 WCU means it’s using almost an entire partition’s worth.” Such context helps justify scaling up or redesigning keys. Medium calculated expected partitions and divided capacity among them to set key-level limits.
Cost Management: Emit only the metrics that provide value. Use logs for high-cardinality data and keep their retention low if possible (you might only need a few days of logs to analyze patterns, since older data can be summarized and then dropped). Leverage CloudWatch’s free features like Metric Math and Logs Insights before adding new infrastructure. Everything proposed here runs on AWS-managed services (CloudWatch, possibly Lambda for glue) – no need for an external monitoring system, which keeps complexity and cost down.
Automate Mitigation if Possible: Prediction is only as good as the response. If you identify a hot partition emerging, have a plan: for example, the system could automatically throttle back the offending batch job, or route writes for that key to a queue to drain more slowly. At minimum, alert the team with clear information (“Account 1007776605 is trending towards hot partition: 50k WCU/min and rising”). Early detection should give you time to react (perhaps pause the batch or split it) before DynamoDB itself starts throttling heavily.
Iterate and Tune: Treat this monitoring like a living system. After each batch, review the metrics: Did we accurately predict the hotspots? Were there throttles that went unnoticed? Adjust the instrumentation or thresholds accordingly. Over time, you’ll identify patterns (e.g., “end-of-month processing for contract type A always creates hot keys”) and can refine the strategy (maybe add a specific check for that scenario).

By following this structured approach – instrumenting thoroughly, smartly managing metrics, and analyzing trends – you can anticipate DynamoDB hot partitions in advance. This helps avoid the surprise of throttling, ensuring your on-demand table remains performant even under intense, bursty workloads. As the adage goes (and as the Medium team learned): “You can’t improve what you don’t measure.” With these measurements in place, you’ll improve both your insight and your ability to react to DynamoDB’s scaling characteristics.

Sources

AWS DynamoDB Documentation – On-demand capacity limits and hot partition behavior
Real-world case studies (Medium.com) on tracking and mitigating hot keys
CloudWatch & OpenTelemetry Best Practices – High-cardinality metric handling
Cloudonaut Blog – DynamoDB partition throughput limits (1000 WCU/sec per partition) and the importance of a good key distribution
How Medium monitors DynamoDB performance | Datadog
https://www.datadoghq.com/blog/how-medium-monitors-dynamodb-performance/
Manage high-cardinality operations - Amazon CloudWatch
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Application-Signals-Cardinality.html
Using CloudWatch Metrics with AWS Distro for OpenTelemetry | AWS Distro for OpenTelemetry
https://aws-otel.github.io/docs/getting-started/cloudwatch-metrics/
CloudWatch contributor insights for DynamoDB: How it works - Amazon DynamoDB
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/contributorinsights_HowItWorks.html
When Does DynamoDB Throttle Request - Understanding When and Why It Happens | SigNoz
https://signoz.io/guides/when-does-dynamodb-throttle-request/
DynamoDB pitfall: limited throughout due to hot partitions | cloudonaut
https://cloudonaut.io/dynamodb-pitfall-limited-throughput-due-to-hot-partitions/
How Medium Detects Hotspots in DynamoDB using ElasticSearch, Logstash and Kibana | Medium Engineering
https://medium.engineering/how-medium-detects-hotspots-in-dynamodb-using-elasticsearch-logstash-and-kibana-aaa3d6632cfd