In 2012 Geoffrey Moore tweeted, “Without big data analytics, companies are blind and
deaf, wandering out onto the Web like deer on a freeway.” 
Fast forward a decade and a lot happened in the 2010’s to deliver sight and sound. The storage industry brought innovation to solve the petabyte+ data challenge, the analytics software/toolkits ecosystem rapidly matured, and chip manufacturers delivered accelerated compute to glean insights from the ever-growing troves of data.
But the quest for better insights is never over. In fact, the constantly increasing volume of data is forcing us to take analytics into hyperdrive. For the enterprise to stay competitive in 2021, they must continue to innovate. Below I describe four big data analytics trends I’m seeing, along with some suggested solution features to look for.
- Apache Spark will continue to dominate the big data world
The classic data scientist is known as a badass; give her Apache Spark software with a Jupyter notebook and get out of her way. Apache Spark, a unified analytics engine for large-scale data processing, is now the Kleenex of big data analytics and data engineering. It’s ubiquitous – universities offer classes for it, every Hadoop deployment is leveraging it, the new Spark 3 operator brings native GPU capabilities plus S3 integration. Everyone needs to gear up for the Spark tsunami.
However, a fair amount of thrash in this space causes confusion. Major vendors are forcing businesses to shift to the cloud and dump Hadoop File System (HDFS) for object storage. And a ton of other dedicated solutions are sprouting up to deliver engineered Spark solutions.
The real challenge is figuring out how to easily bridge from Spark on YARN technology to the next-generation Spark on a Kubernetes implementation — without major disruptions to the existing environment. Businesses must also take into account that Spark is just one of many applications they need to support their analytics pipeline.
What to look for? The goal is a solution that simultaneously improves efficiency, agility, elasticity while cutting costs and improving data exploitation capabilities. Ideally, this solution will let data scientists tap into existing data stores without having to move to the cloud or re-platform the data. On the application front, businesses will look to avoid vendor lock-in with multi-version, open-source Kubernetes support without dependencies on Hadoop or YARN.
- Stateful application modernization
App modernization is still red hot, and usually people’s minds go straight to the microservices cloud native apps. But over the past 18 months, I’ve seen a radical shift in the open source, ISV, and even the monolithic analytics vendor space (think Splunk, Cloudera, and SAS). Businesses are now choosing to embrace the modernization of their applications to be deployed via container-native infrastructure. These traditionally stateful and data-centric workloads are looking to become more cloud-like by improving the efficiency of at-scale deployments and by gaining the elasticity and agility needed to deploy anywhere – in minutes.
The challenge is figuring out the right modern home for these stateful applications. Data science and analytics are a team sport, so these applications will need to share data and models, while orchestrating hand-offs across the analytics lifecycle.
What to look for? Businesses are going to quickly need staff that can do more than just spell Kubernetes, but there are ‘no-coding’ answers to this problem. They will need to look to leverage a container platform that can support (and hopefully is validated with) all these applications and can deliver data at petabyte scale. Businesses will also need to make sure their solution is based on open-source Kubernetes with proven hybrid-cloud capabilities so they can quickly move these workloads between on-prem and the public cloud.
- Solving for app dev and data-intensive workloads
When I go camping, my Swiss army knife is always on my belt, but as the adage goes, a jack of all trades is a master of none. Therefore, I also pack a hammer and hatchet for when the specialty need arises. I’m noticing this same thing from the container offerings. You may have already invested in a technology that is particularly good from the app developer perspective and are now trying to stretch that tool to new spaces.
The challenge is that we all want to minimize solution providers, so we optimistically believe our vendors when they advocate for us to use their tools for things they aren’t natively designed to do. Stateful apps are a different beast — running petabyte scale analytics is very different from running microservices web search. The scale of 100’s or 1000’s of clusters and/or hosts per cluster has fundamentally different requirements.
What to look for? Use the right tool for the right job. Don’t be afraid of co-existing multiple platforms to complement your existing solutions and address your varied use cases to deal with scale, performance, and data gravity issues. On the data side, validated CSI drivers is a great start, but you may need a dedicated or integrated high-performance, scale-out data store.
- The edge is here, and you need to solve for both data AND security
We’ve been reading about the billions of edge devices and IoT trends for years now, and I’m seeing more solutions that have actually operationalized data analytics from edge to cloud. In its simplest form, organizations are bridging their data center with the public cloud, others have brought tens of geo locations together, and others are able to collect data from millions of streaming devices — even in orbit. Following this trend, analytics are continually becoming more automated and distributed as they move towards the edge points of data creation. This creates a complex matrix of analytic edges that themselves are composed of interconnected workloads that come and go, interacting with each other over physical and logical limitations…much like today’s web interactions.
Businesses face two inherent challenges in edge analytics. First, how do organizations seamlessly bring together data from the many edges, multiple clouds, and on-prem — while still providing a single, no-silo view of all the data? Secondly, how do businesses liberate analytics to exploit the data across a secure matrix that has no intrinsic attested identity?
What to look for?
Data: A solution that can deliver a common data fabric for all the enterprise’s data on a global scale means faster time to value, better governance, and lower cost. Look for data platforms with proven petabyte scale, hardened enterprise feature set, and proven capabilities (like a global namespace and auto data-tiering) to deliver data from edge to cloud.
Security: A solution that can establish trust in the fluid, interconnected data landscape. Strategies of yesterday to develop trust amongst workloads, like perimeter-based secrets management, are just a band aid that works in the near-term but won’t scale. This strategy will leave the business vulnerable to attacks on the application estate that spans beyond the four walls of the data center. Instead, businesses need to look for technologies that can employ Zero Trust security to fully unlock their analytics over the next decade.
Take analytics to hyperdrive in the 2020s
Data will continue to be nothing without insights. Businesses can’t stand still – they will look to the 2020’s as the decade to take their analytics to hyperdrive.
If you’re looking to learn more on this topic, please join me for HPE’s upcoming event – HPE Ezmeral\Analytics Unleashed. We’ll be speaking with analysts, conducting live demos, and discussing the analytics journey with three of our clients who have delivered solutions ranging from a virtual wallet program, robotic drive for ADAS (advanced driver-assistance systems), and data science as-a-Service.
 @geoffreyamoore. Twitter, 12 Aug. 2012, 7:29 p.m., https://twitter.com/geoffreyamoore/status/234839087566163968?s=20