THE HYPERSCALE CHALLENGE

1.0         Introduction

The emergence of mission-critical business functions that rely upon machine-generated data requires enterprises to evaluate and adopt new storage architectures that can support the ever-accelerating velocity of data ingest, volume of data, and variety of object data types.  At the same time, these architectures must support the scaling of storage capacity to exabytes of data behind the firewall, while delivering the economies of a public storage cloud, similar to Google, Facebook, and Amazon.

 

The term Big Data has become a categorical phase compassing a broad landscape of challenges, use cases and requirements. Within the Big Data category a number of unique requirements exist for different architectures and solutions. YottaStor is focused on machine generated data, the fast growing segment of Big Data. Machine-generated data is overwhelming traditional, POSIX-based architectural designs and rendering them obsolete. Commercial and Federal enterprises are spending hundreds of millions if not billions of dollars deploying advanced sensor technologies that create and capture machine generated data. The YottaDrive is a patented, purpose-built large data object storage service that economically stores this data and exploits it for business insight.  

 

2.0         Defining the HyperScale Challenge

 

The traditional business enterprise is experiencing massive data growth. Traditional database architectures are giving way to new data types like video, multi-spectral sensing, medical imaging, cyberpacket capture, and geospatial data.

 

In early 2010, a new category of data was recognized by industry data organizations that speaks to massive data growth. This new category is called machine generated data.

 

Machine generated data is created by powerful sensor technologies. Examples include:

  1. smartphone cameras that range from 8-41 megapixels
  2. medical imaging devices used for diagnostic processes
  3. genome sequencing technologies that are the foundation of bio-informatics discovery, and
  4. powerful gigapixel class EO/IR sensors that are part of the DoD ISR mission and IP packet capture probes attached to the WAN backbone.

 

According to a recent IDC report, 80% of future enterprise data growth will consist of machine generated data.

 

Another characteristic of machine generated data is that each stored object is growing in size as well.  The smartphone camera serves as an instructive example.  As recently as three years ago the most advanced smartphone camera was 1 or 2 megapixels. Today the Nokia Luma 1020 smartphone has an integrated 41 megapixel camera. A single picture is still just one picture, but with the Nokia Luma 1020 that picture now has 20 to 40 times more data.

 

The business processes supporting these workloads are fundamentally different from traditional enterprise workloads like supply chain, manufacturing and ERP that drive traditional POSIX based storage architectures.  The new workload requirements for machine generated data include:

 

  • Data Ingest
  • Content Streaming
  • Content Collaboration
  • Content Dissemination
  • Analytical Frameworks

 

Introducing HyperScale Technologies

 

HyperScale technologies date back to the early part of the decade to support the workload requirements of the maturing internet market. This innovation was necessary because the amount of data stored was growing at such a rate that traditional enterprise technologies simply could not scale at the required economic to meet the new requirements. 

 

Market leaders like YouTube, Google, Facebook, and Amazon had the ability to employ hundreds of computer scientist to design and develop the first generation of technologies needed to handle the explosion of users and data to be stored. But, over the past decade, as these operational concepts, architectures and workload requirements have become better understood, operationally proven and generally understood by a larger audience of IT professionals a new class of entrepreneur has emerged to transfer these breakthrough technologies to the enterprise market. 

 

Analytics at HyperScale

 

Analytics at HyperScale can be summed up in a single phase – scale changes everything. The constantly accelerating ingest velocity creates a situation where the storage system must seamlessly expand without operational disruption. This accelerating ingest velocity renders traditional extract, transform and load (ETL) architectures for moving data from primary storage to the analytical environment using traditional enterprise analytic engines like SAS, GreenPlum, and Hadoop infeasible.

 

Instead, at HyperScale, the storage system must deliver integrated analytical capabilities like MapReduce which allows smaller, more precise data sets to be virtually identified and then moved to an analytic post-processing engine for exploitation.

 

Traditional approaches to extracting all potential and possible metadata at the point of ingest is also infeasible because the ingest velocity and sheer scale of the data volumes quickly results in this being an unaffordable solution.

 

Finally, one of the industry lessons in building advanced analytics in the anti-terror or law enforcement domain is that our adversaries have learned to quickly change their tactics and procedures to avoid detention by our asymmetric information dominance capability. This means that the intelligence yield of a particular algorithm can have a very “short shelf life.”  

 

HyperScale Design Principles

 

In October 2011, YottaStor published a set of eight design tenets for developing HyperScale storage solutions. These design tenets were developed based on multiple customer experiencing HyperScale data growth. The tenets have been shared and reviewed by multiple DoD and Intelligence Community CTOs, incorporating their thoughts and inputs.

 

YottaStor Design Tenets

  1. Capture and store data once

Write data to disk once during its lifecycle.  The accelerating velocity at which data is being created as well as the sheer magnitude of data being managed will overwhelm network capacity, rendering impractical any attempt to move data electronically.

  1. Process data at the edge

The only point in the architecture that an organization can affordably process data is during the ingest process.  This requires co-locating processing and storage.  Then users can create and capture the metadata required to access the data in the future.

  1. Automate data movement to less expensive storage

The storage system must continually migrate data to less expensive storage, letting customers lower their overall storage cost.  The key metric for storing data becomes cost/GB/month for storing the data. Once this metric is developed for a specific organization, then the year-to-year planning process will focus on reducing this cost.

  1. Adopt self-healing, self-replicating technology

In order to reduce the cost/GB/month the technology must be self-healing and self-replicating. This capability will substantially reduce the number and cost of FTEs required to manage the storage system.

  1. Deploy a federated, global name-space

Adopting name-space technologies that support billions of objects in a single name-space eliminates cost and support complexities.

  1. Access through web services (e.g. S3, REST API, etc)

This level of application abstraction is key to allowing the operational optimization of the storage cloud without impacting the application layer.  One important benefit of this capability is the elimination of “location awareness” that applications must have in POSIX-compliant storage environments.

  1. Design for ever-increasing data variety, data volume and data velocity

The storage system must demonstrate the ability to scale in three dimensions: Data types which will evolve and extend over time, the daily ingest requirement which will continue to increase, and the overall capacity of the storage system which will expand at accelerating rates.

  1. Eliminate RAID

The data durability requirement is actually greater than traditional storage environments. Using new approaches such as replication and erasure coding must be embraced to meet this requirement

 

 

Table 1 – YottaStor HyperScale Design Tenets

 

YottaStor has used these design tenets as the basis of design for the YottaDrive solution described below in Section 4.0 – The YottaDrive. 

 

 

HyperScale Economics

 

At the Gartner IT Infrastructure, Operations and Management Summit held in June 2011, Gartner analysts discussed the growing storage challenges facing the enterprise. One of the metrics provided by Gartner was the estimate that “total lifecycle management costs for data ranges from $0.75 to $1 per GB per month for on-premise enterprise storage environments, with world class on-premise enterprise storage costs being about $0.33 per GB per month”.

 

A simple illustration can help clarify the tremendous cost advantages of HyperScale technologies versus traditional enterprise storage technologies. Consider as an example Air Force ISR data ingest volumes over the next 10 years (this analysis was originally completed in late 2011). The model is built on the operational assumptions that the data ingest will grow rapidly over the next 5 years due to more Remotely Piloted Airframes (RPAs) with longer persistence, carrying multiple sensors; including the new gigapixel class sensor. The model also assumes that next generation, low-loss compression algorithms will become a priority and find its way into operation around 2016. However, in parallel, sensor manufacturers will continue to innovate, creating higher resolution solutions, which will result in a sort of “tug-of-war” in the later years – 2017 to 2020 – of capturing more data but fielding better tools to reduce the size of the data that is stored. Finally, the model assumes that Air Force can achieve operational improvements each year reducing the cost per GB per month by 10%.

 

This is an illustrative model and can be updated to reflect different operational assumptions. But the end result will be the same. When graphing the annual lifecycle data management costs using the Gartner enterprise cost estimates and the public storage cloud economics. The impact can be summed up in the following statement, “Small cost difference per GB results in a huge difference cost differential at HyperScale.”

 

The model demonstrates that if the Air Force were to stay with current, traditional, POSIX storage architectures then the cost of only storing the data over the next 10 years will cost an additional $10+ billion.  This does not include the cost of any analytics but simply the cost of storing the objects.

 

 

3.0         Example of a Common HyperScale Use Case

 

As an example of a common use case, the United States Air Force’s Intelligence Surveillance Reconnaissance (ISR) mission is beginning to develop solutions that will result in HyperScale business requirements. Among these include:

 

  • Rapidly expanding surveillance programs using remotely piloted aircraft with EO/IR, FMV, WAMI, hyperspectral, and SAR sensors
  • Critical network protection IP packet capture
  • Unified communications using VTC
  • Growing model and simulation workloads

 

4.0         Our Solution – the YottaDrive

 

YottaStor has developed the YottaDrive, a patented technology, specifically for machine generated data. A YottaDrive is a new category of mass storage solution that delivers radically superior performance but at significantly lower cost than traditional enterprise storage.

 

Each YottaDrive is delivered fully operational with multiple PB of Large Data Object Storage capacity. As multiple YottaDrives are connected via a high-speed, low latent IP spine and leaf network they create the YottaCloud, a purpose-built large object storage cloud.

 

The YottaDrive creates a global namespace that has the ability to manage hundreds of billions of unique objects at a multi-exabyte storage capacity. Each object has an internally generated unique object ID assigned at ingest. Different, segregated global namespaces can be established for different data types to meet operational requirements. The YottaDrive provides an out-of-the-box REST API access service as well as industry-standard access protocols like SWIFT, S3, CIFS/NFS and Open Geospatial Consortium standards.

 

4.1         YottaDrive is a HyperScale Storage Node

 

The YottaDrive is manufactured at a Miami, FL facility that also designs and delivers containerized IT for eBay and Microsoft Bing.

 

Object Storage - The YottaDrive currently uses a Web Object Scaler (WOS) storage technology specifically designed for exabyte-scale storage clouds. WOS is a peer-to-peer architecture based on the WOS node building block. Currently each WOS node delivers 240TB (using 4TB drives) of capacity. Each WOS node is connected to the YottaDrive backbone via dual 10GbE network connections and receives a unique IP subnet address during manufacturing.

 

Compute - The YottaDrive delivers 2,048 virtual CPUs with 6.144TB of RAM.  Each x86 server blade is connected with dual 10GbE connections to the 40GbE YottaDrive network backbone. Each server also has SSD for the Operating System (OS) and other local storage requirements.

 

Operating System - The YottaDrive currently utilizes the Red Hat OS and virtualization solution.  The YottaDrive is delivered with Red Hat Enterprise Virtualization for Servers in RHEL 6.2.

 

Network - The YottaDrive design leverages state-of-the-art low latency, non-blocking switch technologies with each YottaDrive having a 40GbE backbone and individual YottaDrive connected to the customer’s backbone with 4x40GbE connections. Within the YottaDrive, each WOS node is connected with a dual 10GbE connection and the blade chassis is connected to the fabric with 12x40GbE uplinks with 64x10GbE links to the individual blades.

 

  1. YottaDrive is Purpose-built to the Data Type

 

The YottaDrive is a purpose-built large data object storage node that is optimized during manufacturing to support specific large data object type requirements. This optimization ensures that the YottaDrive accurately extracts the object’s metadata and creates accurate search indexes.  In addition, the optimization ensures that the appropriate data durability and object retention rules are established in the global namespace. 

 

The YottaDrive:

 

  • Supports the ability for hundreds of billions of objects to be stored in a single customer global-name space. Each object has an internally generated unique ID assigned at ingest. Different, segregated global name spaces can be established for customers who require that their data assets are kept separate.
  • Allows customers to define object policy that supports their operational and mission requirements, and determine the object’s lifecycle management regimen.
  • Integrates into existing “data in motion” and “data at rest” encryption capabilities that the customer has already deployed, resulting in an end-to-end secure solution to meet any operational requirement.
  • Provides an out-of-the-box base data durability protection of 99.9995 with an Extreme Data Durability option providing 99.999999999995 protection factor.

 

4.3         Multiple Connected YottaDrives Become a YottaCloud

 

As multiple YottaDrives are connected together via a high-speed IP spine and leaf network they form the YottaCloud, creating a single global name-space that can store hundreds of billions of objects and scales to multiple exabytes of capacity.

 

The YottaCloud provides an out-of-the-box data durability protection of 99.9995%. If a higher data durability protection scheme is required, the Extreme Data Durability option can provide a 99.999999995% protection factor.

 

Our commercial YottaCloud Storage Service (YottaCloud) is a hosted Storage-as-a-Service (SaaS), managed in a secure hosting facility with 24x7x365 support services.