How do I find the ingest timestamp on Tier 1 and Tier 2?

We work on a large System Platform solution (100k+ tags) where delays and interruptions are normal. Store-and-forward mechanisms are in place and we receive late data from time to time.

The system has two Tier 1 historians replicating to a common Tier 2 historian.

What is the best way to get the timestamp for when a data point was ingested/received/stored into Tier 1 and when it was replicated (received) on Tier 2?

  • Also as a side note this is something that could be very useful when ex. making reports to validate that every datapoint you need is actually stored in Historian before your report is generated. It kind of solves the question "is the data arrived yet" in a nice way. 

  • The system only maintains a timestamp from the source, not a timestamp of the ingestion. In cases where the source doesn't supply a timestamp, the system will apply a timestamp as close as possible to the source. For example, if a Communications Driver reads the timestamp from that PLC/RTU, that time will flow through across multiple levels of Replication; if that PLC/RTU sends the value without a timestamp, it will get a timestamp of "now" when it reaches the Communications Driver and that will be preserved.

    If you're mostly wanting to measure the latency in Replication for the "happy case" (when everything is connected and streaming data), the only mechanism I can think of is to poll a replicated tag with a known source time stamp (e.g. SysTimeSec) at the "tier 2" system repeatedly with a query like this:

    select DateTime, TagName, Value, QueryTime=getdate(), Latency=datediff(millisecond,DateTime,getdate())
    from Live where TagName like 'MyTier1.SysTimeSec'

    In my tests, "Latency" was consistently <3000 milliseconds for a tag that updates every second (so closer to 2000 milliseconds latency) in a system with no network latency and plenty of resources and very low data rates being replicated. You may have lower latency with higher data rates because buffers are sent when full or on a timer and the low data rates end up waiting for the timer. 

  • Handling the exception case isn't as straightforward (Niklas use case) because Replication "pushes" values by exception from the "tier 1", which makes it hard for the "tier 2" to detect when something wasn't pushed (does no value in 15 seconds mean the "tier 1" is offline or that the value just hasn't changed?). The only built-in instrumentation for this on the "tier 2" side are the system tags:

    • SysStatusTopicsRxData           Total number of "Topics" (sources) receiving data
    • SysStatusSFDataPending        Indicates that some HCAL clients have store-forward data to be sent to the server

    The “SysStatusSFDataPending” tag (“SF” here is “store forward”) is a digital that may be of use. Here’s a brief explanation of the state changes:

    0 = None of the connected sources have store-forward data waiting to be sent or processed
    1 = Some sources are connected which have store-forward data that is not yet available for retrieval

    In this context, a “source” is Historian component that includes store-forward capability (e.g. a System Platform “AppEngine”, a "tier 1", a Plant SCADA Connector, a Remote IDAS, an application built with the Historian SDK). It is not the actual instrumentation that feeds that source (e.g. RTU, PLC, etc.) or the I/O server. 

    An integer tag, “SysStatusTopicsRxData”, indicates how many topics are currently connected. Because the Historian is mostly a passive service for remote sources, there is no standard way to know what this number “should be”, but in a given application you may know (e.g. “there should always be 11 sources sending data”). Within ~30 seconds of a topic's disorderly disconnect or faster for an orderly one, it will decrement. Here, "topics" includes the "sources" mentioned above and the I/O server topics.

    These aren't as easy to use as we'd like for all the cases that may be important, though:

    1. Volatile Sources: If there are a large number of sources sending data in a volatile network environment, “SysStatusSFDataPending” may be “1” much of the time because there is nearly always some store-forward data being processed while connections thrash in and out of store-forward.
    2. Many Sources: If there are a large number of sources, it may not be useful to just know “some source” is offline or has pending store-forward data, because that would mean you can’t really process any data. If the sources also had a “heartbeat” tag you can use to identify which individual one is offline, you might be able to work around this, but such a heartbeat would need to be part of the application (e.g. there isn’t one built-in to Historian). In the case of Historian Replication, there are system tags on the “tier 1” which can help (SysReplicationSyncQueueItems1, SysReplicationSyncQueueValuesPerSec1), but you’ll need to replicate them to be able to make use of them on the “tier 2” and also realize they will be “stale” if the source is disconnected or there is a store-forward still in progress.
    3. When: These tags don’t directly give any indication of what data was unavailable. You may be able to infer enough from when these tags change, though. For example, if “SysStatusTopicsRxData” decremented at “10:17” and incremented at “11:09” and that “SysStatusSFDataPending” became “1” at “11:09” and “0” at “11:12”, you may have enough information. As above, this gets more complicated when there are 20 such sources going in and out of store-forward at different times.
    4. Latency: Where “#3” is less useful is when there is significant latency in the data. Using the example in “#3” above, if the data involved was stuck in an RTU for 4-hours before it made it into the Historian’s source, the actual data that corresponds to the system tag changes would be for “6:17” to “7:09” and there isn’t any information in the system which would indicate that.
  • Thank you for sharing your insights and deep understanding . I appreciate your detailed suggestions and apologize for any previous ambiguity regarding our use case concerning the "ingest time."

    Your comments align with our findings from reviewing the database schema: the system does not maintain ingestion time.

    And thanks for adding to the discussion  

    System Overview: In this system there are more than 200 sources for the two tier 1 historians combined. It is normal that some data arrives late as the System Platform solution and the underlying processes are constantly being worked on and there are several layers of SF. We see that all data eventually ends up in Tier 2, so the mechanisms in Historian to catch up works as intended with the original timestamps coming from the sensor or plc layer. There are business applications querying data from Tier 2.

    Core Concern: Our concern is not the speed of replication or the delay, it is the completeness of the data on Tier 2. We aim to accurately determine the timing of data arrival on Tier 2. I would expect “SysStatusSFDataPending” to be “1” most of the time.  As you wrote, it might be possible to infer based on certain indicators in the data, but it might probably also be cases we would miss. We will investigate more and let you know if we find a robust solution.

    Challenge in identifying late-data:
    We are investigating the following:

    • What late-data has arrived since our last query of a given time window?
    • What late-data was missed in previous X number of queries?

    We could run every query again at a later point for every past query and compare the results of the two datasets to identify the diff. In this case we would need to save all result sets for later reference, and we would potentially need an extra database to store the "book keeping". It would be very resource intensive on 100k tags when we don't know how long back we need to re-query.

    The late-data we are trying to identify could be minutes, hours or days late due to layers of SF. I would like to know "Hey, this dataset of 12-hour old data just arrived late on Tier 2". That could trigger a re-query.

    I look forward to your thoughts on these ideas and any further suggestions you might have.