The advantage of the ChaosSearch transformation capabilities (the Refinery) and the “decoupled schema” of a Refinery view means that the ingestion pipeline can be simpler, less expensive, easier to manage, and more scalable. ChaosSearch does suggest a few best practices for file sizes and quantity to take best advantage of cloud-storage file management and networking.
Some file size planning can help to balance the following considerations:
- The best file sizes to hold a practical amount of data/records to store and index
- The best quantity of files to process to take advantage of parallel processing
While planning the objects for your storage buckets, keep in mind that larger files—such as 1 TB or 1GB files—take longer to move around the network. They need more time to upload to the cloud, and more time and memory resources when they are processed by the ChaosSearch discovery and indexing services. Also, a very large file cannot take the same advantage of the parallel-processing capabilities of the cloud-storage environments as multiple smaller files do.
If an error occurs, such as a network connection failure, services typically need to restart the process from the beginning. For larger files, restarting the task also restarts the “clock” to process the file and task that was interrupted.
If your log and event data is written and stored as many small files, those files could be very fast to upload to storage and to read into memory for indexing. They can take more advantage of the parallel-processing environment compared to a fewer number of very large files.
However, small files could bring some disadvantages in processing due to possible bottlenecks for the ingest pipeline. For example, AWS SQS has limits on the number of messages that can be pulled. Also small files typically result in smaller indexing segments, which take less advantage of the index compaction and performance of the ChaosSearch index.
Based on experience using and querying the typical log data content, the optimal size for data files is in the range of 10 MB GZIP compressed files (50-500 MB uncompressed), based on the content type of the files.
- Comma-separated value (CSV) files are usually the most dense in terms of rows and indexable information.
- Log files are usually the next most-dense source type for rows and index information.
- JSON format files are sometimes the least dense in size, but the object notation format and its nested arrays of structures can carry a large amount of information. JSON files typically require some special configuration steps for indexing, as described in JSON File Processing.
GZIP compression helps to reduce the network time to move the files, as well as the memory required for the indexing services to read and index the files.
ChaosSearch has a robust library of data discovery mechanisms and built-in algorithms to quickly detect and work with many common source files and data types. Using the ChaosSearch Regex support and related customization features, can tune the indexing and analytics to include other custom processing and data that might be needed for your implementation.
Updated 4 months ago