Step 2. Define Object Groups

Create object groups to filter and select objects (files) in your cloud storage buckets for indexing.

Cloud storage, and data lakes, usually contain a wide variety of objects. In ChaosSearch, you define object groups as a virtual filter for selecting the cloud-storage objects that you want to index. You can select files by a specific pathname prefix, or select files located in paths specified by a regular expression.

The steps to create an object group:

  1. Select the files that you want to index the cloud storage bucket.
  2. Specify filtering options and rules if needed.
  3. Select static (default) or live indexing (live requires a pub-sub policy).
  4. Start indexing.

After you specify the file(s), object groups auto-detect the format (like JSON, log, CSV, Parquet files) as well as compression like GZIP or SNAPPY. ChaosSearch also supports flexible rules that let you further refine the indexing behaviors for the files. You choose either static indexing for already archived files, or live for indexing new matching content as it is written to cloud storage.

πŸ‘

Tips and Good Practices

As a good practice, start slowly; it can be very helpful to create a static object group to filter and index a small sample of object files. Chaos indexing is fast and easy to do, and you can easily delete sample indexed data and test object groups that you don't want.

The sample object group and index data can help to find possible issues or considerations before indexing a wider volume of files. For example:

  • Files might have large sections of information that is of no analysis use, so you could create rules to exclude that unnecessary content.
  • You can review the auto-detected fields, structure, data types, and formatting considerations found during indexing. With ChaosSearch, is it easy to create rules to include only or exclude fields, change field names, override data types of fields, and make the index more useful for data analysis requirements.

Object group testing can also help to identify subsequent Refinery materializations (that is, post-index transformations defined within the lens of the view) that might be needed to refine the columns and filters for analysis.

Also plan periodic cleanups to remove any object groups and related index data that are no longer useful. Removing the early test or no-longer-used groups can help to avoid confusions for the analyst users who create Refinery views.

Object groups can be created using the ChaosSearch console, APIs, or Terraform Provider. This topic focuses on the console steps.

Select Files to Index

To begin, click the Storage tab, then click Create Object Group. The Object Group Preview window appears:

In the Object Group Preview window, a list of cloud storage buckets will be on the left. You can navigate into a desired bucket to select a file, or use the Prefix and/or Regex Filter strings, and/or advanced filter controls, to narrow down to the files to include in the object group for indexing.

After you have identified the files for the group, click Next to display a Content Preview window.

Content Preview Window

The Content Preview window summarizes the auto-discovered format of the included files (such as CSV, LOG, JSON, PARQUET, CUSTOM, or Unknown), the compression types (such as NONE, GZIP, SNAPPY, or SNAPPY-JAVA), and displays a preview sample of the content of the selected files. ChaosSearch can provide a content preview even if the files are compressed. This allows you to stay in the window while constructing regular expressions to parse the fields for indexing.

Depending on the format of the files you selected, the window displays other options such as delimiter values and a column heading field for CSV files, or array flattening options for JSON files. For log files, there is a Formatted Preview area (above) that shows a more user-friendly display of the field components of log files.

Click Schema Overrides to customize the schema of the files processed by the object group. You can override data types for one or more fields, and you can input a JSON file that contains specific rules and processing policies that can tailor how Chaos Index will process and store the content of the files as they are indexed using field inclusion/exclusion rules, custom field naming, JSON Flex processing options, and similar schema-tuning controls.

Object Group Indexing Controls

After you specify the field content and controls, click Create Object Group. The final step for an object group is to name it, and to specify some indexing options.

By default, an object group runs a one-time indexing for the selected files (called a static index). Select Live indexing to automatically index new matching files after they are written to the cloud storage locations. Live Indexing requires you to enter details for a storage event notification service using AWS SQS or Google Pub/Sub Project ID (based on the configured storage type) to send events when new files are written to storage.

For each object group, you can specify a retention policy to control how long to keep indexed data before it ages out and is removed. The default retention is 14 days, but you can deselect Retention Policy to keep indexed data indefinitely (no age out), or keep it selected and specify an alternative number of days or months to keep indexed data.

Start Indexing

After you create the group, the new group is added to the configuration, and the Storage > Properties page appears. Review the information for your new object group, and if everything looks correct, click Start Indexing to index the files. Indexing performs a deep analysis of the files specified by the object group, and includes any instructions that you specified for schema overrides and filters in the resulting indexed data.

After you start indexing an object group, the Properties tab updates to show more information about the index, and a pie chart summary of the data types for the discovered fields. When indexing is complete, the Start Indexing button changes to Restart Indexing.

Review the Properties tab for a closer look at the fields within the indexed data. The Indexed Structure list shows each field in the indexed data, its name, and data type for a field.

The Events tab lists any indexing warnings or issues to address. If any problems stopped or blocked indexing, this tab can provide more information about the problems for you or ChaosSearch Customer Success engineers to troubleshoot the indexing issues.

The Intervals tab lists the name create date of the daily Intervals, and the size in bytes of the cloud storage object files indexed for the group. By default, ChaosSearch creates one or more daily intervals with the name:

_<*object-group-name*>_<*storage-date*>_

πŸ“˜

About the Daily Interval Name

The date value is in yyyy-mm-dd format and is the day component when the matching object files were written/saved to cloud storage. So, for example, a daily interval named _my-app-grp_2022_10_01_ relates to the indexed data for the cloud storage files indexed for my-app-grp object group, and that had a cloud storage modification date of October 1, 2022. If there are matching files with different storage modification dates, ChaosSearch creates a daily interval for each file modification date such as _my-app-grp_2022_10_02_ and so on.

The Isolation tab lists any optional isolation keys configured for the object group. Isolation keys separate indexed data by a defined key derived from the cloud storage object pathnames. The key could be related to tenants/organizations, applications, regions, or similar relationships within the pathname. Isolation keys separate the indexed data in storage and can be used in views to filter results to only the data that matches specific key(s) defined within the view.

The Objects tab lists the files within the customer storage bucket that are indexed by the object group. You can review this list to confirm that all the files that you expect are indexed by the object group.

πŸ“˜

Objects listing could take some time.

When an object group has a very sparse regex that matches very few S3 objects, the Objects tab listing could take a long time to find and display matching objects included in the group. In some cases, the Objects listing UI could display No Matches.

After you create and index an object group, create a Refinery view to define the content available for visualization and analytics.


What’s Next

Create Refinery views to enable users to visualize and query the indexed data for one or more object groups.