Slashing data analysis costs by reducing data

Written by Ian Tinney

July 30, 2021

Having the ability to route the same data to multiple destinations is important, but only if you can transform it into the format it needs to arrive in, as Ian Tinney explains.

To get the most from your data, you’ll need to draw it from multiple sources and send it to a variety of destinations, often using different data formats. You may also need to alter the data along the way and certainly, you will want to do all of this as fast, as simply and as cost-effectively as possible.

Transforming your data is, therefore, a must to make it intelligible but also to help you reduce the total cost of ownership (TCO), protect sensitive data, meet your compliance requirements, and streamline your data processing.

Lowering TCO

One of the challenges faced by the modern enterprise is monetising its existing investment in legacy systems while moving to a cloud-based architecture. You’ll want to be able to span both worlds by sending the data from your on-premises solutions to cutting-edge cloud applications. This causes issues because these legacy systems often use verbose formats, such as XML, making it difficult to then transfer data to cloud-based systems, which often prefer more efficient formats like JSON or metrics.

What’s needed is some method of transforming the data in real-time and inflight as it passes between the source and destination tools and systems. Using an ESP such as Cribl LogStream, you can take the data generated by your legacy systems and convert it into a protocol that the destination tool is capable of reading. This ensures you can take advantage of emerging cloud technologies and extends the life of your legacy systems, effectively reducing the total cost of ownership (TCO).

Shaping data

There are also other ways to save costs by transforming or shaping your data. You can compress it, for example, to limit its volume. Let’s say you need to write the data in your analytics platform to object storage like S3 or Azure Blob. By compressing it in transit, you can store the data at a fraction of its original size, ensuring you only pay for the storage you need.

In addition to transforming data for efficiency purposes, you may also need to do so to meet a business need or a compliance dictum.

Log data can be highly sensitive so will often need to be encrypted with role-based access privileges applied to determine who can decrypt. In some cases, data may even need to be redacted in real-time. Compliance regulations, such as those relating to data privacy (GDPR, Schrems II, GoBD etc) or financial transactions (PCI), often require personally identifiable information (PII) to be encrypted or obfuscated and to be sent in a certain format.

We recently worked with a large retailer that needed to ensure customer data was dynamically recognised and encrypted. Using the Cribl LogStream solution, we ensured the data was suitably shaped before being sent to their data analysis platform where it could be decrypted by those with the appropriate access privileges.

Simplifying data

It’s also possible to shape data to make it easier for the data analysis platform to assimilate, thereby reducing complexity and saving processing time. For example, Amazon Kinesis Data Firehose can send data to Splunk and in the event of failure, Firehose will send events to a backup S3 bucket.

When this happens, the original event data is placed in a base64-encoded field, a process that Splunk then needs to reverse. That’s horribly complicated. Why not just make all the data accessible to Splunk? So that’s just what we did, using the Cribl LogStream solution to reformat the failed Firehose data back to its original format, inflight before it got to Splunk.

You can also simplify events with a deeply nested JSON structure. Or split an event with an XML, JSON or other multi-line event.

Similarly, it’s also possible to add metadata to your log records, by enriching them with third party data or transform data from your existing sources into multiple schemas.

You can add context to data by adding look-up values like IP location, CMDB, or threat feed info, all achieved in-flight between the sources and destinations of your data.

Six ways to save

Whatever it is, the way you tell your story online can make all the difference.

There are six quick ways to cull data volumes that will instantly save you money:

  1. Filtering events – Filter whole events from specific sources and which match prescribed conditions and drop any events that do not have useful data.

  2. Filtering fields – Drop any fields based on their name or values (ie value=“null”, “n/a”, or “-“), unwanted headers or footers, or those that are simply empty.

  3. Deduplication – Identify duplicate feeds or events, summarise the quantity of similar events and only supply the information once to your analysis platform. This can reduce data ingest by a staggering 93%.

  4. Throttling – Sometimes we get a spike of data which needs to be tamed. By throttling the data source, you can protect the destination from becoming overloaded, avoiding bill shock in the analysis platform

  5. Convert to metrics – If the events you are collecting contain mostly numeric values you can convert these events to metrics inflight. For example, if each event has a series of header fields and one or more counters but we only require the counters to identify trends, we can turn these events into much smaller metrics in the pipeline. These metrics are far less verbose than raw events.

  6. Sampling – If you only need to get a sense of the current situation (ie a statistical analysis) you can sample the data using a representative sample and only send aggregated statistics rather than ingesting all the data.

These processes can be set-up in minutes using the Cribl Logstream solution which routes and parses the data inflight, ensuring the original is sent to storage and the stripped down data to the analysis platform. 

The results can be instantaneous. On a recent project, the business wanted to optimise its use of Splunk and process more data but without increasing licensing costs. Having experienced rapid growth, the business was also retaining indexed logs on the platform despite the fact it was only utilising the last 24 hours of data. By dropping the ‘start’ firewall log events which contained no information and retaining the ‘end’ log events, sampling events using specific filters, and trimming Windows log event descriptions, the business reduces its firewall logs by 62.5%. Overall, the measures put in place had saved the business £55,000 on Splunk licensing costs in just an hour.

Using a single tool to reduce your data also confers benefits because you no longer need to use and maintain multiple log forwarders. Consolidating these reduction methods gives you one view of the dataflow, allowing you to ingest, pre-process and forward rich data to the analysis platform of your choice. 

To discover how others have saved money by reducing their data, see our Splunk and Cribl Logstream datasheet. Or, to find out the kind of cost savings you can expect to make using your processes,  contact us for a one-to-one consultation.

Follow Us