There are numerous ways to control your spending on data analysis. Ian Tinney explores the options and how you can apply them in the real world.

Businesses produce and utilise zettabytes of data on an annual basis, so to ensure your data analysis platform doesn’t become a very expensive storage space, you’ll want to be able to control the data ingestion without throttling it. That means you need visibility of the data inflight, so you can observe and control it before it reaches an expensive analysis platform.

Data is spread across a variety of systems. Agents are often used to get data from its source to an analysis platform.  Agents take care of some of the difficult tasks associated with reading log data, for example, file-locking, seek-pointers, and encrypting and compressing data when transmitting to the destination.  But agents are not designed to allow us to perform complex routing or transforming tasks on the data.

A stream processor is able to sit between the source and destination and perform a series of useful functions that can deliver important benefits.

Looking at data between its source and destination, we can then determine its nature, its value and therefore, where it needs to be sent.  This might be multiple destinations for different reasons.  IT and business teams often want to use the same datasets for different purposes, so the data may need to be sent to more than one destination.

Businesses therefore typically control their data analysis costs by choosing to…

  • Route it to different technology platform destinations for a variety of business purposes, without the need for additional agents

  • Reduce it by dropping, sampling or suppressing events to make sure they are only consuming the data they need

  • Transform it in flight by shaping it into different formats supporting different business requirements

Let’s take a more detailed look at each method and what they involve in practice.

Cribl diagrams-v3D-02.png

Routing data

Data analysis platforms are expensive to use and so it makes sense to separate data based on whether you need to analyse or retain it. The storage you send to will vary depending on what the data is needed for. It could be a real-time system, a batch analytics store or an archive to meet a compliance mandate which obligates you to keep data for a set number of years. Or you could require data sets to be sent to several destinations for different purposes.

Say you want to keep your data online for analysis for 90 days while at the same time routing it to an Amazon S3 bucket for cheaper, long-term storage but then you need to retrieve it from the S3 at a later date to investigate a security incident. You’ve already deleted the data from the analysis platform to save costs but because the original data is archived on the S3 you can replay it and analyse it at any time. Reducing data retention on fast, expensive storage in this way can lead to significant cost savings (we have seen one company saving 93% on their storage costs this way).

Or perhaps you need to move some or all of your data to a different data analysis platform that requires data in a different format (i.e. JSON). You can route the selected data to either or both platforms, changing the format for each platform.

Or, if your current cloud provider has increased their prices to the point where you feel you have no alternative but to move to another provider, you could use this routing functionality to send a second copy of your data to the new provider, allowing you to migrate by moving one service at a time.

Cribl diagrams-v3D-06.png

Reducing data

Data analysis tools often charge by the volume of data you use, the number of events per second (eps) or, more recently, infrastructure (or workload) based pricing. With most of these charging methods, the more you use, the more the licence and storage costs will be. Reducing the volume of data being passed to the data analysis tool can therefore reap significant cost savings. Businesses can instantly reduce daily ingest volumes by 25% or by 50% when dropping, sampling or suppressing events.

There are numerous ways to reduce data. The quickest win is to route a reduced data-set to the analysis platform whilst sending a full data-set to object storage for long-term storage but other options include filtering out unnecessary events, dropping fields based on their name or value, i.e. value=null, de-duplicating feeds or events and removing unwanted headers or footers from events. It’s also possible to extract the data from events and turn it into metrics inflight or to use a representative sample of the data.

For example, in data where field values are “null”, being able to remove both the “key” and “value” can save a great deal of space. The same is also true of header fields; if we were analysing telemetry data as events and only require the counters, we can remove these headers completely. Or perhaps the logging platforms are generating duplicates. By suppressing identical events you can reduce data ingestion significantly.

Occasionally, something happens that results in a spike with an order of magnitude more data generated by a particular source or group of sources. This data is often not useful in itself – all we needed to know was that something happened – so in these instances, the data flow can be throttled from specific sources.

Cribl diagrams-v3D-08.png

Transforming data

As well as routing and reducing data you may also need to transform or alter datasets in real-time as they are routed between source and destination tools and systems. This may be related to the modernity and capability of the destination tool being different to the source tool or it may be related to the need to alter a dataset in line with a business requirement. Many compliance regulations dictate that you must encrypt and/or obfuscate personally identifiable information in a specific format when it is sent to other tools, for instance, making it necessary to encrypt and convert the data into an acceptable format.

Not all systems can receive data in its original format. For example, XML data generated by legacy systems often need to be converted into the JSON format because this is the only format that the destination tool is capable of receiving. So, transforming allows you to get the most out of your existing legacy investment by ensuring you can still extract data from these systems and send it to a cloud or contemporary environment for processing.

Real world results

These aren’t just hypothetical examples – they’re real world instances where customers are using the Cribl Logstream to control and manage their data. LogStream can receive data from any source, streamline and reshape it before sending it on to one or multiple destinations. So, you can combine all your data flows and use one tool to parse, restructure and enrich data in-flight before you pay to analyse it, shrinking consumption costs.

If you’d like to see Cribl Logstream in action, why not sign-up for a demonstration.