Data at Rest (DAR) scanning is one of the most common DLP strategy adopted by various organizations. An initial "limited scope" scan provides high level insight into the risk landscape, and most importantly the "initial results" are quick. However, grass that looks green at a distance may not always be the same :)
Why should I consider DAR as a strategy?
The modern organization is spanned across multiple functions, departments, teams etc, and this is even more dispersed with remote working. The primary approach towards securing data starts with employee education, but "Are we there yet?". We (as innocent humans) often leave the outcome of our work lying on an open network share, endpoint system, mailbox etc. "What is the impact of unauthorized personnel peeping into a confidential document?". I have seen DAR scans reveal source codes, credit card numbers, project documents etc scattered all across the organization, mostly in places we would not expect to find these. This is where DAR scanning generates true value, ie. we can identify and correct human errors through technology, while we continue working towards an education oriented security approach.
***Hypothetical scenario*** Don't be surprised the next time you see an unauthorized transaction on your card!! - it was simply your data with a service provider, lying across somewhere and accessible to someone.
Voila, the initial scan outcome was awesome!! Lets do it..
While the initial scan seemed promising, scanning data (Terabytes to Petabytes produced by every employee of an organization) is a mammoth effort. It's not just the DLP scanning technology, but various aspects like source systems, storage, network, employee experience etc that need to be factored. One common problem with most DLP technologies are scan failures, where the scan has to be resumed from scratch. This causes a lot of time burn and frustration, and below are few guidelines for circumventing these technical constraints:
A) Where is the bottleneck?
There are various moving parts for a DAR scan, and we must ensure that none of these reach a peak utilization level. I would recommend targeting a maximum of 80% resource utilization during a scan. Below are key considerations:
1) Source: How much can I squeeze my data source? There is a limit for each data source and pushing it beyond often results in a slower scan. We may evaluate the following parameters:
- How do we connect to the source storage? Is this a direct NAS, mounted SAN, or do we connect via a host server?
- Evaluate the CPU, RAM & IOPS utilization on the source storage or server (or both) as appropriate
2) Network connectivity: This is often a bottleneck as we may have a high speed storage but slower network connectivity. Moreover, we also need to ensure that discovery scan servers are placed in proximity (least network hops) to the respective sources. It might be a good idea to evaluate the difference in network utilization during an active vs paused scan, thus determining bandwidth requirements and optimal scan power.
**Recommendations:
- Leverage a separate NIC over a (dedicated) backup network thus ensuring that user and scan traffic are segregated
- Consider running a local agent based scan (if feasible) in certain scenarios, where network scans present an issue
3) Scanning power: How much muscle do I have, and how much can / should I use? It is important that we evaluate utilization metrics at the source as well as network, before considering to increase the processing power (CPU / RAM) on the discovery scan server. Using too much force on an already constrained source / network would negatively impact scan performance.
4) Number of parallel scans / threads: Too many vehicles on a road tends to slow down traffic, and the same applies to DAR scans. We may certainly leverage the benefits of parallelism (multiple processing threads), as long as it does not have a detrimental effect.
5) Scan schedule: The network as well as resource utilization varies across the clock, and high utilization is likely to impact scan performance as well as user experience. It is imperative that we schedule scans to run appropriately, eg. during hours (preferably night) when the user experience impact is minimal and network/resource utilization is low.
I would recommend capturing all the relevant baseline parameters into an excel sheet, and gradually increase the scan threads, power etc. Increment one aspect at a time (eg. scan threads), while monitoring its impact on other parameters over a few days. This should enable you to gradually optimize the DLP scanning solution, while having complete visibility over bottlenecks. They key is to evaluate where the bottlenecks exist, and optimize the solution without creating additional bottlenecks.
B) How long should a scan run?
A large volume scan could take anywhere between days to weeks, and any technical glitch during this point will result in errors. While most OEM's have some documented benchmarks around scan timelines, they may not reflect in production environments (owing to environment specific differences and constraints). Thus it is best to start with a test scan within your environment, for evaluating expected performance (post bottleneck optimization). The outcome of this exercise may be leveraged to configure multiple smaller scans, gauged to complete within an ideal time-frame of 24 - 48 hours. This reduces the risk of failed scans as well as time burn.
Published: 6th February, 2022
Author: Denis Kattithara