Big Data from Our Point of View

Posted by Amir Szekely   |   April 8, 2014

The words Big Data get thrown around a lot these days.  Large players in the security space have been using the term to talk about their ability to collect huge amounts of data at scale because of their cloud infrastructures.  This raises concerns for enterprise companies who do not in fact, want to have their critical information assets sent to an off-premise cloud, where they don't control how its stored or secured.


At CounterTack we also talk about Big Data, but unlike the majority of other security organizations, we leverage Big Data technology in a unique way.  Our ability to collect data on behaviors across thousands of endpoints is one way we leverage this Big Data approach.

Where we differentiate ourselves is that all of our data collection is located on-premise, allowing our customers to have complete control over where their information is being stored.  Data storage will continue to be one of the biggest concerns facing the market right now because data never stops coming in.

Here's a quick look behind the curtain at an example of CounterTack's work with Hadoop, where our goal is to consistently push the envelope in terms of improving speed and performance of our CounterTack Sentinel endpoint threat detection and response platform. There are many processes that we implement, and many challenges we solve daily - some big and some small. Here's an interesting issue I came across that I wanted to share.

I had a problem where HDFS would fill up really fast on my small test cluster. Using hdfs dfs -du I was able to track it down to the MapReduce staging directory under /user/root/.staging. For some reason, it wasn’t always deleting some old job directories. I wasn’t sure why this kept happening on multiple clusters, but I had to come up with a quick workaround.

I created a small Python script that lists all staging directories and removes any of them not belonging to a currently running job. The script runs from cron and I can now use my cluster without worrying it’s going to run out of space.

This script is pretty slow and it’s probably possible to make it way faster with Snakebite or even some Java code. That being said, for daily or even hourly clean-up, this script is good enough.

import os
import re
import subprocess

all_jobs_raw = subprocess.check_output(
  'mapred job -list all'.split())
running_jobs = re.findall(
  all_jobs_raw, re.M)

staging_raw = subprocess.check_output(
  'hdfs dfs -ls /user/root/.staging'.split())
staging_dirs = re.findall(
  staging_raw, re.M)

stale_staging_dirs = set(staging_dirs) - set(running_jobs)

for stale_dir in stale_staging_dirs:
    'hdfs dfs -rm -r -f -skipTrash ' +
    '/user/root/.staging/%s' % stale_dir)

The script requires at least Python 2.7 and was tested with Hadoop 2.0.0-cdh4.5.0.

Big Data is defined differently by different organizations, and can represent a red light for some but not all. There are technologies that many developers are still figuring out and that is all part of the fun! We are incredibly lucky to have the collaborative approach we do so we can savor small technical victories to improve what we deliver to customers at CounterTack. I hope you'll find this tip helpful!

Subscribe to Email Updates

Recent Posts

Posts by Topic

see all