Backfill buffer overflow

We are experimeting with backfill options for a tier-2 Historian. We are unable to find documentation that supports our experiments, so I hope we can get some answers through this forum.

We are attempting to backfill 2 full days worth of history data. Each day weighs in at about 500MB, and each day contains about 2.200 tags. (We have suggested implementing value deadbands on floating point values to decrease the size of the data, but this does not have any effect on historical data.)

When we start the backfill, the replicaton seems to go into store&forward due to "Buffer_Half_Full", followed by critically low buffer and buffer discarded.

Each time we test, we remove the replication, shutdown tier-2, delete data from tier-2, rebuild runtime database on tier-2 and start fresh, and we have tried to adjust some different parameters for each run:

  • What is HCAL buffer memory, and how can we increase it?
  • How can we reduce the data rate?
  • Should adjusting the "Buffer count" in replication settings help? We have 128, 300, 500, 2048 and it does not seem to make any difference.
  • Should adjusting the "Time increment" in backfill help? We have tried both 24 and 6 hours. 6 hours seems to produce more buffer discards than 24 hour.

Maybe limiting the bandwidth in replication settings could help? At this point I feel like I'm just testing random parameters, like back in the day when we adjusted modbus parameters until it finally worked (or was that yesterday?)

Parents
  • Based on the high-level metrics, I'd guess the overall average data rate for the 2,200 tags change an average of every ~3 seconds. Does that seem right? The "buffer count" and "buffer memory" are related and apply for this live, streaming data rate, but should not be directly impacted by backflling (which uses "queued" replication). 

    At this rate, the network bandwidth between the nodes should be ~700 Kbps for the normal streaming load, but the backfill will increase that. How does that compare with your environment? Note that high-latency networks (e.g. satellite) can also have an impact. 

    From the "tags" on this post it looks like you're using Historian 2023 R2 for both systems, right? Are these new, or upgraded from an older release? Note that the underlying communications technology in this release changed to gRPC, replacing WCF--in theory, everything works the same, but there is a chance some of the details have changed. 

    There is no way to directly control the data rate just for "simple" replication--you can apply the time/value/rate deadbands for data coming into the "tier 1" and that will also benefit replication. You can also use the periodic "summary replication", which will limit the rate (mostly useful in bandwidth-limited WAN applications). 

    These rates above are well within the supported limits, but as part of your testing, try reducing the tag count by 50% to see if that makes any difference. You might also double check there are not a few tags being replicated with much higher data rates (e.g. subsecond). 

    While doing your experiments, also clear out the sync queue tables in the "tier 1" as part of the "reset" process.

    I don't expect the backfill interval to make a difference. Do check the size of the sync queue (should stay below ~3000 in your case). Check this in the OCMC, system tags or the database table.

Reply
  • Based on the high-level metrics, I'd guess the overall average data rate for the 2,200 tags change an average of every ~3 seconds. Does that seem right? The "buffer count" and "buffer memory" are related and apply for this live, streaming data rate, but should not be directly impacted by backflling (which uses "queued" replication). 

    At this rate, the network bandwidth between the nodes should be ~700 Kbps for the normal streaming load, but the backfill will increase that. How does that compare with your environment? Note that high-latency networks (e.g. satellite) can also have an impact. 

    From the "tags" on this post it looks like you're using Historian 2023 R2 for both systems, right? Are these new, or upgraded from an older release? Note that the underlying communications technology in this release changed to gRPC, replacing WCF--in theory, everything works the same, but there is a chance some of the details have changed. 

    There is no way to directly control the data rate just for "simple" replication--you can apply the time/value/rate deadbands for data coming into the "tier 1" and that will also benefit replication. You can also use the periodic "summary replication", which will limit the rate (mostly useful in bandwidth-limited WAN applications). 

    These rates above are well within the supported limits, but as part of your testing, try reducing the tag count by 50% to see if that makes any difference. You might also double check there are not a few tags being replicated with much higher data rates (e.g. subsecond). 

    While doing your experiments, also clear out the sync queue tables in the "tier 1" as part of the "reset" process.

    I don't expect the backfill interval to make a difference. Do check the size of the sync queue (should stay below ~3000 in your case). Check this in the OCMC, system tags or the database table.

Children
  • I investigated a history block to look at how frequently the tags updated. I found that there are quite a few tags (I'd say 100-200) with ~160.000 VTQs per day. And to average that out, there are quite a few tags with only a few hundred updates per day. So we really need to apply those deadbands, but that won't help for the old history data that is already generated without deadbands.

    Does that mean we could be hitting some rate limit? You mentioned reducing the tag count. The only way I have found to do this is from a troubleshooting document from 2014 which mentions "MaxTransactionValueCount" and "MaxTransactionTagCount" as registry keys that can change the replication service behaviour. Are there any other ways of reducing the tag count?

    Tier-2 is fresh 2023-R2. Tier-1 has been upgraded several times, last upgrade is just a few weeks old. They are running on gigabit network. We will be doing summary replication afterwards to speed up trending, but we also need full data replication with backfill. So this is step one. :)

    EDIT: What does it mean when the logger says the buffer is discarded? Are we basically losing data on the tier-2?

  • The "buffer" messages in the log are about streaming data and apply to incoming data on the node reporting it. Normally, that is for much higher data rates (e.g. >50K values/second).

    Under this lower workload I'd guess this might be a consequence of some general resource constraints (e.g. high CPU in other areas means the buffer isn't written to disk fast enough, which means it fills up, triggering the message). I'd recommend using a query like this to get an overall sense of the resource loading--that might help you know where to make adjustments. If this is a virtualized system, also check what other workloads might be interfering, competing for those same resources. 

    The "MaxTransactionValueCount" and "MaxTransactionTagCount" are for something else. To reduce the number of tags replicating, in the OCMC just remove some of the tags from the "Replication Server-->ServerName-->Simple Replication". If you want to save the current configuration beforehand, use the "Configuration Export and Import" utility first.