HOW TO GUIDES

Long-term Storage for GoBD

Secure, Immutable, Auditable GoBD-Compliant Storage at Scale

Introduction

Storing data securely and immutably is a challenge.
Doing it while streaming the data elsewhere is even harder.

In this 4Data ‘How To’ guide, we walk through building a solution to achieve the following requirements:

  • Store data immutably for 11 years
  • Duplicate the data
  • Send the data to Splunk Enterprise
  • Restore GoBD data for audit from the storage
  • Audit the data in Splunk Enterprise

The whole setup can also be used to simply store your important data in an archive outside your data platform (in our case, Splunk Enterprise).

This can be beneficial in multiple ways; to fulfil legal requirements; to reduce the cost of storing the data on fast searchable storage, or simply to create offsite storage.

The technologies we will use here are:

Storage: AWS S3 (Glacier) Deep Archive

Routing: Cribl Stream

Audit/Searching: Splunk Enterprise

The setup on completion of all steps should look like this:

In this scenario, everything is already located in AWS. If you are using an on-prem environment, you will need to have secure upstream to AWS S3 for this setup to work securely.

If you have specific requirements that you would like to discuss with us here at 4Data, simply get in touch via the contact page or the Chatbot.

Pre-requisites (all located in AWS):

  • Splunk Enterprise
  • Cribl Stream
  • AWS S3

Step 1 – Set up the AWS S3 Bucket

To keep this walkthrough simple, we are working with the AWS Console instead of the Command-line Interface (CLI) or Infrastructure as Code (IaC) platforms like Terraform.

To store something in AWS S3, we need an S3 Bucket. To create a bucket we simply log in to the AWS Console and switch to the S3 UI. Here we create a new bucket.

In our case, this will be a Bucket named “gobd_demo”.

Choose the region in which your Cribl Stream instance lives.

Now we will change some settings to fulfil some of the requirements
for GoBD compliance.

1. Disable Public Access

2. To add encryption at rest we will have to enable encryption:

3. Enable Object Lock

The bucket can now be created, after which we will make some further adjustments to the bucket properties.

4. (Optional) If you track/index your AWS data in Splunk for additional auditing capabilities, then enable server access logging.

5. Once the bucket is created there are some other steps to complete for compliance with the GoBD regulation: Edit the object lock to enable retention and set the mode to “Compliance”. This will keep the data for the allotted retention period and no one can override or delete any of the objects during that period.

6. djusting the permissions on the bucket policy.
This is what the policy should look like:

7. The Bucket ACL (Access Control List) should be adjusted to look like this so that only entities in your AWS Account can read/write to that bucket.

8. Now the last bit is to give your Cribl Stream instance access to that bucket.
To achieve this, we will attach the following policy to the instance role.

You can achieve this by simply moving to IAM and creating a new policy. Paste the example shown here. After the policy has been created, move to the IAM Roles and select the role used on your Cribl Stream instance. Simply attach the policy you just created.

Now that we have set up AWS S3, as well as the policies for access to the bucket, let’s start setting up Cribl Stream.

Step 2 – Set up Cribl Stream

To set up Cribl Stream to send to both S3 and Splunk, we need to configure them as destinations. In Cribl Stream, navigate to Destinations and select Splunk Single Instance, or if you are using a distributed environment, Splunk Load Balanced. Create a new destination with the details of your Splunk instance and ensure the appropriate ports are open in your AWS security policy. Do the same for the S3 bucket, under Destinations, Amazon S3.

To grant Cribl Stream access to the S3 bucket, adjust the settings of the bucket destination as follows:

1. Assuming Cribl Stream is running in the same environment as your S3 bucket, select the ‘Auto’ authentication method.

2. Under ‘Assume Role’, enter the ARN of a role with sufficient access to your S3 bucket. This role should be able to read, write and restore objects in your S3 bucket.

Back under General Settings, there are some options available to you to format the structure of the bucket when Cribl Stream stores your data.

1. The ‘Key Prefix’ determines the name of the root directory created
in the S3 bucket

2. The ‘Partitioning Expression’ allows the use of a JS expression
to determine how the files are partitioned, which can include
fields from the events themselves

Our example stores all files under a root directory named ‘gobd’ and stores files in subsequent directories ordered by year/month/day/index

Navigate to Advanced Settings and select the following:

1. Under ‘Object ACL’ select ‘Bucket Owner Full Control’

2. Now, to satisfy the long-term storage requirements, select the storage class that you would like Cribl Stream to store the data in

Now, Cribl Stream is able to send events to both our S3 bucket and our Splunk environment. We also need to configure the collector which will pull the restored events from S3 in the event of an audit.

Navigate to Data -> Sources and filter by Collectors. Select the S3 collector and create a new collector. Ours will be named ‘gobd_collector’. After naming the collector, select the S3 destination we just configured in the ‘Auto-populate from’ drop-down to automatically configure the collector with the correct settings from our bucket.

Our collector is now set up. The collector also provides a ‘Path’ variable similar to the S3 destination. This determines where the collector looks for events and allows the folder values to be tokenised into fields within the event. This is where we create the “S3_*” fields used in our pack.

Now we are ready to configure our Routes and Pipelines to send the data to the relevant destinations. Fortunately, we have created a pack containing the majority of the config required here to make this easier. Download it here and install it via the Cribl Stream UI.

Once installed, open the pack and you will see two routes configured.

The first takes any data sent to it that doesn’t come from our collector and processes it ready to be sent to S3. The second takes any data that does come from our collector and processes it ready to be sent to Splunk. If you configured your collectors with different names, make sure to update the route filters here.

Now we just need to associate these routes with their destinations. To do this, create two routes outside the pack:

1. Create a filter that selects the GoBD data you want to archive in S3, then select the GoBD_Restore pack as the pipeline and the S3 bucket as the destination.

2. For the second route, use the filter from the Pack which selects events from our GoBD collector. Set the pipeline to the GoBD_Restore pack and set the destination to Splunk.

Now Cribl is configured and ready to go.

Step 3 – Set up Splunk

We want to have a very simple audit process so the GoBD auditor doesn’t necessarily need SPL knowledge. To achieve this, we will utilise our 4Data-made GoBD Audit App. This will cover some very important pieces of compliance.

1. Being able to set up the Audit Dashboard with different fields of interest

2. The audit ability itself

3. Ability to track access to the GoBD bucket

4. Alert on changes to the bucket (This shouldn’t be possible with the settings we have used to set up the bucket)

5. Log the audits done

6. Audit who has accessed the GoBD audit index (this might be added depending on the progress of development)

7. Delete audit data in Splunk when finishing the audit

You can create your own dashboards to achieve the same results.

We utilise a couple of searches to achieve those pieces of compliance.

These are the Dashboards in our App:

GoBD Bucket Access Tracker

GoBD Index Access Audit

Who has accessed the GoBD index and when?

GoBD Audit Dashboard

This dashboard will walk the auditor in a “Wizard” style through the selection of fields and values he would like to check. This is highly customizable.

Step 4 – Replay Audit Data

To replay the audit data from S3, we first need to restore it from Glacier. In order to keep things simple, we will do this in the AWS web UI. Simply navigate to the file(s) you would like to restore and select ‘Initiate Restore’.

Here you can select the number of days you would like the restored data to be available for and the speed of the retrieval. The speed will depend on which storage class you chose to use to store the data when configuring the S3 destination in Cribl.

Once restored, you will see a banner at the top of the page on the event informing you of the restore status and the time remaining for the availability of the data.

At this point, the data is ready for Cribl to collect and re-ingest into Splunk.

In Cribl, navigate to Data, Sources and filter by collectors. Select the S3 collector and on our GoBD collector select Run.

This will show the collector screen where we can retrieve our data. Write a filter to select the data you would like to retrieve. This can include the fields created in the Path variable from the collector configuration created earlier.

Running the collector in Preview mode will show a sample of the events that will be returned when the collector is run. Running in Discovery mode will display the total number of events that will be returned by the collector, and running in Full Run mode will do what we need and return the events themselves.

Select Full Run and run the collector.

And that’s it! Cribl will automatically route the collected data to our GoBD pack and onto the configured Splunk index. It’s time to jump into Splunk and take a look at the data.

Step 5 – Test the GoBD Audit Dashboard

Conclusion
As we have seen, by combining data routing solutions like Cribl Stream, AWS S3 and Splunk Enterprise, it is possible to achieve what other provides of proprietary solutions get paid a pretty penny for.

This solution can also be implemented to store data for a long time in cheaper storage, with the flexibility to restore and search through this at any point later.

In today’s data-driven world, ensuring secure and immutable storage while complying with GoBD regulations is a challenge. However, a comprehensive solution that uses AWS S3, Cribl Stream and Splunk Enterprise can make this process easy and cost-effective. This solution offers long-term storage, data duplication and auditing capabilities, all of which are essential for organisations that need to adhere to GoBD guidelines.

By following the step-by-step instructions provided in this guide, organisations can set up the solution quickly and efficiently, while also ensuring the security and integrity of their data. With the ability to restore and search through data at any point, this solution is not only compliant, but also flexible and scalable, making it an ideal choice for businesses of all sizes.

So, if you’re looking for a secure and compliant storage solution for your organisation, look no further than AWS S3, Cribl Stream and Splunk Enterprise. If you like the solution, but are unsure whether this can be implemented by you or your team, 4Data Solutions’ Professional Services team is here to help.