LAT Reference
LAT Advanced Topics
data pipelines are now the preferred method for loading data into the ocient system for details, see load data docid\ qxg07ea5hv1vxat6neyg addition of an lat instance to the loading path use these steps to add an lat instance to the loading path stop the lat process on all loader nodes using sudo systemctl stop lat copy /opt/lat/ lat data/pipeline/pipeline json from one of the loader nodes to the new node on the new node, execute this command sudo chown lat\ lat /opt/lat/ lat data/pipeline/pipeline json on the new node, execute this command sudo chmod 644 /opt/lat/ lat data/pipeline/pipeline json start the lat process on all loader nodes using sudo systemctl start lat removal of an lat instance from the loading path use these steps to remove an lat instance from the loading path stop the lat process on all loader nodes using sudo systemctl stop lat delete the /opt/lat/ lat data/pipeline/pipeline json from the chosen node start the lat process on all loader nodes using sudo systemctl start lat for file loading, rebalance the load across the remaining nodes for details, see the pipeline rebalance command in lat client command line interface docid\ h7orjtikzcqpw0 ilkvsl for a load using {{kafka}} , shut down the service on the chosen node and delete the configuration dynamic schema changes the loading system in {{ocient}} is designed to support dynamic schema changes the lat allows data to load continuously even while database tables are altered (e g , add column, drop column) in this way dynamic changes can occur on the database and the pipeline can be updated later to begin streaming the revised data through the pipeline into the revised tables the typical flow for updating a table is to first execute a ddl command such as " alter table … add column … " and then when it is completed to execute the update on the lat pipeline to include (or remove) the altered column in the transformation error sink the lat provides 2 ways to view loading, transformation, or binding errors that occur during a pipeline kafka error topic if an {{kafka}} source is used, failed records can optionally be routed to a configurable lat pipeline configuration docid\ aczjrpa6a8wlrrdqa swc this topic will contain the records that fail and their failure reason/exception the topic can be used to gain insight into why a given record failed to load each entry in the topic will contain the following value\[byte\[]] a byte array containing the record itself headers topic the topic from which the record originated partition the partition from which the record originated offset the partition offset from which the record originated state the state of the record (the location where the record encountered an error) exception the exception associated with the error exception message the exception message, if it exists error log file if you use a file source, or if the error topic configuration is not set, the lat sends errors to a dedicated error log file if you use the default log4j2 configuration, this error log can be found alongside the rest of the lat logs in error log and accessed using the lat client with the lat client command line interface docid\ h7orjtikzcqpw0 ilkvsl if you use a custom log4j2 configuration, your appender configuration should look similar to this code \<rollingrandomaccessfile name="errorlog" append="true" filename="logs/error log" filepattern="logs/error %d{yyyymmdd hh} %i log gz" immediateflush="false"> \<patternlayout pattern="%m%n" /> \<policies> \<sizebasedtriggeringpolicy size="100 mb" /> \</policies> \<defaultrolloverstrategy fileindex="nomax"> \<delete basepath="logs"> \<iffilename glob="error log gz" /> \<ifaccumulatedfilesize exceeds="20 gb" /> \</delete> \</defaultrolloverstrategy> \</rollingrandomaccessfile> if the lat service configuration docid\ qsaetjyytgqsqtwjwmvap service configuration and the lat pipeline configuration docid\ aczjrpa6a8wlrrdqa swc pipeline configuration are both set to true , the error log file includes a json representation of the record that caused an error if you use a custom log4j2 configuration, you must still use a rollingrandomaccessfileappender named errorlog also, patternlayout must still be %m%n , because the errors api can retrieve errors if you do not use rollingrandomaccessfileappender , the lat does not start if patternlayout is not correct, the errors api does not work understanding deduplication the lat loads rows from data sources in an exactly once fashion this is made possible by row level deduplication for lat pipelines and the ability to replay records from a source in short, if the same records are replayed through the lat, there are specific scenarios that guarantee that no duplicate records will be persisted into the ocient tables for a deeper understanding this works, a few key concepts are explained here partitioning data the durability horizon deduplication scope partitioning data to deliver high throughput loading, the lat partitions the data source into independent sets of data these are then loaded in parallel across all lat instances each partition is considered a well ordered sequence of rows that is replayable some sources like kafka natively support the concept of partitions and have a native record id as part of their protocol others, like a batch of files from {{aws}} s3, require lat to partition the data on its own and assign a record id for file loading, lat establishes a record id based on the sorted list of files in the file group and the row of each record within the files altering the list of files in the target directory on the source system can change the record id this will impair the ability of the lat to properly deduplicate rows if a file loading pipeline is stopped and restarted the durability horizon within a partition, each record is assigned a unique record id this id is monotonically increasing within a partition as data is loaded through the lat and into ocient’s page stores, this data is said to become "durable" meaning that in the event of a node shutdown, the data would be preserved on non volatile storage at this point the record is no longer in memory, but stored on disk in a redundant fashion the "durability horizon" is the largest record id that has become durable on each partition of data if a previously loaded record were replayed it is recognized as a duplicate of the original record and ignored new records are loaded and the durability horizon is increased deduplication scope deduplication is constrained based on a few settings in the lat pipeline configuration each pipeline id is considered an independent loading task additionally, each topic or file group is considered an independent data set as a result, deduplication does not apply between different pipelines or different topics and file groups even if they are loading the same underlying files records are deduplicated when all of the following are true the pipeline id matches the topic name or file group name matches when this is true, any record with a record id less than or equal to the current durability horizon will be considered a duplicate and ignored when the record id has progressed higher than the durability horizon, then new data will begin loading into the database if no pipeline id is set, the lat client will automatically assign a randomly generated id, resulting in no deduplication across different pipelines in the event that you want deduplication between multiple pipelines, the pipeline id should be copied from the previous pipeline and explicitly set the topic/file groups must also be the same to ensure deduplication when updating an existing file load pipeline to select a different group of files by altering start/stop time or another filter, be sure to use a new pipeline id (best accomplished by avoiding setting an explicit pipeline id when creating the pipeline with the client) otherwise, unexpected results can occur such as rows in a new file being considered a duplicate v1 → v2 migration lat v2 modified the way the deduplication scope is calculated in v1 only the topic name was used for scope calculation therefore, to maintain deduplication in a pipeline that is being upgraded from v1 to v2 must be explicitly set to (the empty string) upon creation related links ingest data with legacy lat reference docid\ luijhab6vyj6g1gn5bhth