LAT Reference
Ingest Data with Legacy LAT Reference
data pipelines are now the preferred method for loading data into the ocient system for details, see docid\ xq0tg7yph vn62uwufibu loading and transformation (lat) overview this section provides an overview of how loading works in {{ocient}} along with step by step examples to illustrate the key aspects of the loading process a separate lat reference provides a detailed explanation of the options for each data format and data source in ocient the key elements in the loading system are data source — an origin source for data such as {{aws}} s3, {{kafka}} , or {{hdfs}} sources can be file or streaming in nature data type extraction — the format for data extraction from the source examples include json, csv, fixed width transformations — the functions used to cleanse, route, and transform the incoming data indexer — operated by the streamloader role on a loader node, the indexer collects transformed records, stores in replicated pages, and converts into segments on the foundation nodes reference docid 4wejuau6gpdqyii5qqtqt docid\ uiqywt8ec9gszunelylqr docid\ xpvlz0ewuxmgynxvxz jb docid\ eumgc9mmid1dzmpahr9on docid\ elwhwxe8oruff36xf4fom docid 4gawr9v 2cqsdff9an6t loading overview the lat is a service within an ocient system that is responsible for fetching data from file or streaming sources, extracting records, transforming them, and routing them to ocient tables the lat runs as a standalone java ® process on loader nodes it runs many parallel workers on high core count servers to deliver the required parallelism for high throughput loading this diagram shows that a set of loader nodes sit between the data sources and the foundation nodes to control the loading flow and deliver high performance across many parallel workers loading and transformation into foundation storage the lat loads data from two general types of data sources streaming sources — continuous loading of data from an ordered stream such as kafka file sources — a discrete batch of data loaded from files until complete, typically from a file system or object storage solution like aws s3 the main difference between these modes is whether the task continues indefinitely or ends when the batch of files has been loaded in both cases, the lat uses identical transformation and loading processes so that users can manage both with the same mental model and configuration files most data types are supported identically in streaming and batch processing modes, however some formats are not suitable for streaming due to their structure a complete list of data sources and supported data formats is in docid 4wejuau6gpdqyii5qqtqt while loading, the lat coordinates with the ocient system to manage backpressure of loading new records new rows accumulate into a storage structure called a "page" which is a column oriented storage mechanism these pages are replicated across foundation nodes so that the specified redundancy is maintained in the event of a system outage when enough pages have accumulated in a given time period (as defined by the {{timekey}} bucket), the indexer will convert the pages into segment structures and place these on the foundation nodes for permanent storage as segment groups that span the width of the storage space this process is transparent to analysts issuing queries rows are federated during queries so that pages and segments are seamlessly presented in result sets pipelines overview the process and configuration that defines how data is loaded is referred to as a "pipeline " a "pipeline" defines the end to end data flow in a loading task including the source type (e g , kafka), the record extraction for the data type (e g , csv extraction), the transformations on each record (e g , json value extraction, data exploding, string concatenation, flattening), and where transformed data should be loaded in ocient tables as shown in this diagram, a user sets each of these sections in the pipeline file to control the lat lat pipeline json configuration with data flow from data sources to storage and processing the lat pipelines are managed through an http api on the loader nodes or through a command line interface (cli) that provides a convenient way to work with the http api because most ocient systems include many loader nodes, the cli coordinates pipelines across the specified set of loader nodes the lat client issues the command to the loader node to create the new pipeline with the pipeline configuration file when started, the lat uses the pipeline to execute a highly parallelized loading task across all loader nodes and foundation nodes to load the configured tables detailed instructions for configuring a pipeline are found in docid\ uiqywt8ec9gszunelylqr time ordering the ocient hyperscale data warehouse contains timeseries data loading in the data warehouse is considerably faster when data is presented in an ordered time sequence according to the timekey in a table for most streaming sources, this is typically accomplished automatically with limited "out of order" data by the nature of the queue however, for file based sources, it is not uncommon for data to be collected in folders in a manner that could appear haphazard to the loader for this reason, it is critical that the sort type setting for file based sources be used correctly to inform the lat how files should be ordered read more about file sorting in docid\ z8gjws65x2ybq02bns gm exactly once guarantees loading in ocient is designed to ensure exactly once processing of data the lat operates independent streams of data that each has a monotonically increasing row id tied to an individual row in a file based load or a partition offset in a kafka based load exactly once processing is made possible through coordination between the loading components around a "durability horizon" that represents the highest record durably stored on non volatile storage in each independent stream in the event of a node outage or a replay of the data, the durability horizon automatically removes duplicate records in an efficient manner it is important to understand how the lat determines the unique row id for different source types to ensure that data is loaded correctly this is described in more detail in the lat reference documentation learn more about the underlying approach used in docid\ mryu94hffhe4j6c 8rrfk dynamic schema changes ocient also supports schema changes on tables while loading data in a continuously streaming pipeline ocient maintains a table version with each running pipeline that serves as a loading contract until the load task complete this ensures that existing pipelines are not interrupted when columns are added or removed from a table that is receiving new data ocient continues loading records using the original table version even after a change has been made to the tables these dynamic schema changes from alter table commands allow flexible updates to tables while complex loading processes are active pipelines can then be updated to add or remove data elements and match the altered schema, reducing the burden of coordination across systems loading examples the following examples walk through a simple loading example using the lat for a streaming load off of kafka and a file load off of s3 these examples are very similar due to the way the lat uses a common language for all transformation and loading regardless of data source or data type the final set of examples outline some more complex transformations examples docid\ o2xda75bl6faotv8 6xkt docid\ b89gu3k4ivdkaytigvohl docid 5ewwidceufudqdvzngboa docid 6dzgbdjottwsvyq4spr6q for more detailed information about lat settings, see the docid\ uiqywt8ec9gszunelylqr related links docid\ cdutjfrhb4hmidwoafvno docid 4wejuau6gpdqyii5qqtqt