JSON Log Flattening Example

In a simplified view, the CloudTrail log used in this example has a JSON format similar to the following. For brevity, this log shows the structure for a Records array with two eventVersion objects each with about 25 JSON properties, some of which are nested properties. Note that some values in this example are truncated and generalized to show sample substitute data:

{
   "Records":[
      {
         "eventVersion":"1.05",
         "userIdentity":{
            "type":"Root",
            "principalId":"EXAMPLEPRINC",
            "Arn":"arn:aws:iam::EXAMPLE:root",
            "accountId":"ACCOUNT",
            "accessKeyId":"KEYID",
            "sessionContext":{
               "attributes":{
                  "mfaAuthenticated":"true",
                  "creationDate":"2018-08-13T11:40:45Z"
               }
            }
         },
         "eventTime":"2018-08-13T16:40:21Z",
         "eventSource":"cloudtrail.amazonaws.com",
         "eventName":"DescribeTrails",
         "awsRegion":"ap-northeast-1",
         "sourceIPAddress":"10.1.207.221",
         "userAgent":"console.amazonaws.com",
         "requestParameters":{
            "trailNameList":[
            ],
            "includeShadowTrails":false
         },
         "responseElements":null,
         "requestID":"b4c3f3d6EXAMPLE",
         "eventID":"EVENT",
         "eventType":"AwsApiCall",
         "recipientAccountId":"ACCOUNT"
      },
      {
         "eventVersion":"1.05",
         "userIdentity":{
            "type":"Root",
            "principal Id":"EXAMPLEPRINC",
            "arn":"arn:aws:iam::EXAMPLE:root",
            "accountId":"ACCOUNT",
            "accessKeyId":"KEYID",
            "sessionContext":{
               "attributes":{
                  "mfaAuthenticated":"true",
                  "creationDate":"2018-08-13T11:40:45Z"
               }
            }
         },
         "eventTime":"2018-08-13T16:40:22Z",
         "eventSource":"cloudtrail.amazonaws.com",
         "eventName":"GetTrailStatus",
         "awsRegion":"ap-northeast-1",
         "sourceIPAddress":"10.1.207.221",
         "userAgent":"console.amazonaws.com",
         "requestParameters":{
            "name":"arn:aws:cloudtrail:us-east-1:EXAMPLE:trail/ch-cloudtrail"
         },
         "responseElements":null,
         "requestID":"b4c3f3d6EXAMPLE",
         "eventID":"EVENT",
         "eventType":"AwsApiCall",
         "recipientAccountId":"ACCOUNT"
      }
   ]
}

📘
This is an intentionally tiny example for the help.
JSON log files could have hundreds of array objects with hundreds of properties, multidimensional arrays, and layers of nested properties.

The following sections describe how this sample file flattens, and the resulting object group fields (and view columns).

Horizontal Expansion Behavior

With a horizontal expansion for the Records array, this sample array and any constituent properties and nested arrays/objects compact to a structure of 1 row that has 41 JSON property columns. Note that some of the nested properties flatten to one dot-notation property.

For each Records array member in the file, the flattened columns include a numeric ID (0, 1, 2, and so on) to identify which Records array object they are associated with. Records.0.eventVersion contains the version data for the first Records array object, while Records.1.eventVersion contains the version data for the second Records array object, and so on. For example, some of the flattened column names appear as follows:

Records.0.eventVersion
Records.0.userIdentity.type
Records.0.userIdentity.principalId
Records.0.userIdentity.arn
Records.0.userIdentity.sessionContext.attributes.mfaAuthenticated
Records.0.userIdentity.sessionContext.attributes.creationDate
Records.0.eventTime
Records.0.eventSource
Records.0.eventName
…
Records.1.eventVersion
Records.1.userIdentity.type
Records.1.userIdentity.principalId
Records.1.userIdentity.arn
Records.1.userIdentity.sessionContext.attributes.mfaAuthenticated
Records.1.userIdentity.sessionContext.attributes.creationDate
Records.1.eventTime
Records.1.eventSource
Records.1.eventName
…

As this list shows, when all of these individual horizontal columns render in a Refinery view, the filter list could be very long, with many multiples of Record.x.eventVersion, and all the other columns. Due to its very separated nature, this type of horizontal column list is not very useful as search filter values or for aggregated analytics.

👍
Horizontal can be useful for storage optimization.
In some cases, horizontal flattening might be the recommended option for some types of JSON files to take advantage of better storage compression for the indexes. See Effects on Storage Space.

Vertical Expansion Behavior

In contrast, if the sample JSON file is vertically expanded with an unlimited array flattening depth, the resulting JSON structure has 2 rows, one for each Records array object, and each row has columns comprised of the properties within that array. In the sample CloudTrail log, the JSON structure flattens to 2 rows that have 20 or 21 columns (the objects differ in the values specified for some nested properties). For example, some of the vertical flattened column names appear as follows:

Records.eventVersion
Records.userIdentity.type
Records.userIdentity.principalId
Records.userIdentity.arn
Records.userIdentity.sessionContext.attributes.mfaAuthenticated
Records.userIdentity.sessionContext.attributes.creationDate
Records.eventTime
Records.eventSource
Records.eventName
Records[0]
...

With vertical flattening, there is one Records.eventVersion column that has two rows -- row 1 has all the values for array object 1, and row 2 has all the values from array object 2. Some column values could be null because an array object does not have a value for that property. The vertical column design with rows of the union of columns and their values organized under each column makes this format much better suited for use in search filters and aggregation analytics.

In a vertical expansion, note the Records[0] column, which holds some metadata for the Records array objects. JSON files with multiple arrays have multiple array_name[x] fields, where x is usually 0 but could increment for multidimensional arrays. The array_name[x] fields are not useful for filtering or analytics; they exist for internal array recordkeeping. You can ignore these array_name[x] metadata fields for reporting. (In the future these array[x] metadata fields might be hidden.)

Effects on Storage Space

For complex, nested JSON files, the index for a file that is expanded horizontally is usually a fraction of the size of the index for the same file when it is expanded vertically. The format and size of the original JSON file has an impact on the storage size differences. In some cases such as very small JSON files, or for some types of JSON structure, the storage difference for horizontal and vertical expansion might be negligible. In other cases, the differences could be significant.

It is important to test with sample files that are representative of the data shape and size that you expect to see in the live environment to review the benefits of the various expansion and flattening depth options. With JSON files and nested properties and arrays, it is often the case that one solution size will not fit all, nor even most of the indexing needs.

📘
When horizontal storage is the choice, you can transform vertically for reporting.
For JSON arrays that you flatten horizontally for storage optimization, ChaosSearch offers the JSON Array Transformation feature, which can virtually transform one or more important horizontal arrays to make them behave like vertically flattened arrays for the Search Analytics advantages.