The challenge for indexing JSON files is efficiently indexing the usually very robust levels of nested JSON objects, properties, and arrays into a two-dimensional representation—like a relational table. This is referred to as flattening the JSON structured format. Indexing stores the JSON structured attributes and values in a format that resembles a table of rows and columns. Depending on the JSON structure complexity, the resulting indexed data could have a significant number of rows and fields, and require a large amount of storage resources to hold the flattened data. This is often referred to as the JSON permutation explosion.
Tests with JSON log files from some common application services show that one highly nested JSON record could flatten to millions of indexed rows—or to one row with millions of columns—and some columns could be very wide if they hold nested objects that were stored as a contiguous JSON string that contains the native JSON properties.
The storage requirements and expansion dilemma causes some administrators to try different alternatives like excluding complex nested arrays from the source files. This can be expensive in terms of time to plan, to redo the pipeline, and to start over again if the resulting subset of data is still too highly nested, or if the data removal is taking away valuable content for analysis.
Another alternative is to use expansion techniques that convert deeply nested objects or arrays into a string blob of the native JSON properties, which helps to minimize the resulting count of rows and columns, but it can also limit its use in analytics. JSON blobs are strings that can be searched using SQL string-like text matches, but fields inside the blob are not available for finer-grained use like field/column filters, which are typically helpful for visualizations or more complex queries.
Teams that want to index and analyze the data in JSON files, especially ones with complex nested structures and arrays, have some important planning and cost-value considerations, such as:
- How much to flatten JSON content with deep nested objects and arrays to make valuable analysis possible, while minimizing the potential storage explosion of indexed rows or indexed fields
- Whether to treat the JSON record as a native JSON blob for text/string searches and proprietary APIs, or to use a tiered flattening that makes some (or all) JSON objects available for filtering and aggregation analytics
- Possible time for re-working and re-running data pipelines to remove overly complicated or undesired data from the source JSON files to reduce the expansion impacts—or to add data back in to find the desired content for analysis
When rich JSON files are in the indexing mix, the complexity of these questions and the efforts for avoiding the JSON permutation explosion and for tuning data pipelines can often cause JSON indexing plans to be put on hold.
ChaosSearch offers an easier solution for simplifying the indexing and analysis of complex JSON files—JSON Flex.
Updated 5 months ago