The challenge for indexing JSON files is efficiently converting complex, nested JSON objects and arrays into a two-dimensional representation—like a relational table. This is referred to as flattening the JSON structured format. Indexing flattens the JSON structured attributes and values to a format that resembles a table of rows and columns. Depending on the JSON structure complexity, the resulting indexed data could require significant storage resources to hold the flattened data. This is often referred to as the JSON permutation explosion.
Tests with JSON log files from some common services show that a highly nested JSON record could flatten to millions of indexed rows, or to one row with millions of columns—and some columns could be very wide if they hold nested array objects that were flattened to a string of native JSON properties.
The storage requirements for the JSON expansion causes some administrators to try alternatives like excluding nested arrays from the source files to index smaller subsets of information. This can be expensive in terms of time to plan, to redo the pipeline, and to start over again if the resulting data is not small enough, or not useful enough for analysis. Another alternative is to use expansion techniques that convert deeply nested arrays into a blob of native JSON properties, which helps to minimize storage space needs, but also limits the analytic value. JSON blobs can be searched using SQL string-like matches, but they do not support finer-grained attribute operations which are typically helpful for visualizations or more complex queries.
Teams that want to index and analyze the data in JSON files, especially ones with complex nested structures and arrays, have some important planning and cost-value considerations, such as:
- How much to flatten JSON content with deep nested arrays to make valuable analysis possible, while minimizing the potential storage explosion of indexed rows
- Whether to treat the JSON record as a native JSON blob for very simple string searches and proprietary APIs, or to use a tiered flattening that makes some (or all) JSON objects available for deeper filtering and aggregations
- Possible time for re-working and re-running data pipelines to remove overly complicated or undesired data from the source JSON files to reduce the expansion impacts—or to add data back in to find the right values for analysis
When rich JSON files are in the indexing mix, the complexity of these questions and the efforts for avoiding the JSON explosion and for tuning data pipelines can often cause JSON indexing plans to be put on hold.
ChaosSearch offers a solution for simplifying the indexing and analysis of complex JSON files—JSON Flex.
Updated 4 months ago