Image Title

Search Results for Hbase:

PUBLIC SECTOR Speed to Insight


 

>>Hi, this is Cindy Mikey, vice president of industry solutions at caldera. Joining me today is chef is Molly, our solution engineer for the public sector. Today. We're going to talk about speed to insight. Why using machine learning in the public sector, specifically around fraud, waste and abuse. So topic for today, we'll discuss machine learning, why the public sector uses it to target fraud, waste, and abuse, the challenges. How do we enhance your data and analytical approaches the data landscape analytical methods and shad we'll go over reference architecture and a case study. So by definition at fraud waste and abuse per the government accountability office is broad as an attempt to obtain something about a value through unwelcomed misrepresentation waste is about squandering money or resources and abuse is about behaving improperly or unreasonably to actually obtain something of value for your personal, uh, benefit. So as we look at fraud, um, and across all industries, it's a top of mind, um, area within the public sector. >>Um, the types of fraud that we see is specifically around cyber crime, uh, looking at accounting fraud, whether it be from an individual perspective to also, uh, within organizations, looking at financial statement fraud, to also looking at bribery and corruption, as we look at fraud, it really hits us from all angles, whether it be from external perpetrators or internal perpetrators, and specifically for the research by PWC, the key focus area is we also see over half of fraud is actually through some form of internal or external perpetrators, again, key topics. So as we also look at a report recently by the association of certified fraud examiners, um, within the public sector, the us government, um, in 2017, it was identified roughly $148 billion was attributable to fraud, waste and abuse. Specifically of that 57 billion was focused on reported monetary losses and another 91 billion on areas where that opportunity or the monetary basis had not yet been measured. >>As we look at breaking those areas down again, we look at several different topics from an out payment perspective. So breaking it down within the health system, over $65 billion within social services, over $51 billion to procurement fraud to also, uh, uh, fraud, waste and abuse that's happening in the grants and the loan process to payroll fraud, and then other aspects, again, quite a few different topical areas. So as we look at those areas, what are the areas that we see additional type of focus, those are broad stroke areas. What are the actual use cases that, um, agencies are using the data landscape? What data, what analytical methods can we use to actually help curtail and prevent some of the, uh, the fraud waste and abuse. So, as we look at some of the analytical processes and analytical use great, uh, use cases in the public sector, whether it's from, uh, you know, the taxation areas to looking at, you know, social services, uh, to public safety, to also the, um, our, um, additional agency methods, we're going to focus specifically on some of the use cases around, um, you know, fraud within the tax area. >>Uh, we'll briefly look at some of the aspects of unemployment insurance fraud, uh, benefit fraud, as well as payment integrity. So fraud has its, um, uh, underpinnings in quite a few different government agencies and difficult, different analytical methods and I usage of different data. So I think one of the key elements is, you know, you can look at your, your data landscape on specific data sources that you need, but it's really about bringing together different data sources across a different variety, a different velocity. So, uh, data has different dimensions. So we'll look at on structured types of data of semi-structured data, behavioral data, as well as when we look at, um, you know, predictive models, we're typically looking at historical type information, but if we're actually trying to look at preventing fraud before it actually happens, or when a case may be in flight, which is specifically a use case that Chev is going to talk about later it's how do I look at more, that real, that streaming information? >>How do I take advantage of data, whether it be, uh, you know, uh, financial transactions we're looking at, um, asset verification, we're looking at tax records, we're looking at corporate filings. Um, and we can also look at more, uh, advanced data sources where as we're looking at, um, investigation type information. So we're maybe going out and we're looking at, uh, deep learning type models around, uh, you know, semi or that, uh, behavioral that's unstructured data, whether it be camera analysis and so forth. So for quite a different variety of data and the breadth and the opportunity really comes about when you can integrate and look at data across all different data sources. So in essence, looking at a more extensive, uh, data landscape. So specifically I want to focus on some of the methods, some of the data sources and some of the analytical techniques that we're seeing, uh, being used, um, in the government agencies, as well as opportunities to look at new methods. >>So as we're looking at, you know, from a, um, an audit planning or looking at, uh, the opportunity for the likelihood of non-compliance, um, specifically we'll see data sources where we're maybe looking at a constituents profile, we might actually be investigating the forms that they provided. We might be comparing that data, um, or leveraging internal data sources, possibly looking at net worth, comparing it against other financial data, and also comparison across other constituents groups. Some of the techniques that we use are some of the basic natural language processing, maybe we're going to do some text mining. We might be doing some probabilistic modeling, uh, where we're actually looking at, um, information within the agency to also comparing that against possibly tax forms. A lot of times it's information historically has been done on a batch perspective, both structured and semi-structured type information. And typically the data volumes can be low, but we're also seeing those data volumes on increase exponentially based upon the types of events that we're dealing with, the number of transactions. >>Um, so getting the throughput, um, and chef's going to specifically talk about that in a moment. The other aspect is, as we look at other areas of opportunity is when we're building upon, how do I actually do compliance? How do I actually look at conducting audits or potential fraud to also looking at areas of under-reported tax information? So there you might be pulling in, um, some of our other types of data sources, whether it's being property records, it could be data that's being supplied by the actual constituents or by vendors to also pulling in social media information to geographical information, to leveraging photos on techniques that we're seeing used is possibly some sentiment analysis, link analysis. Um, how do we actually blend those data sources together from a natural language processing? But I think what's important here is also the method and the looking at the data velocity, whether it be batch, whether it be near real time, again, looking at all types of data, whether it's structured semi-structured or unstructured and the key and the value behind this is, um, how do we actually look at increasing the potential revenue or the, uh, under reported revenue? >>Uh, how do we actually look at stopping fraudulent payments before they actually occur? Um, also looking at increasing the amount of, uh, the level of compliance, um, and also looking at the potential of prosecution of fraud cases. And additionally, other areas of opportunity could be looking at, um, economic planning. How do we actually perform some link analysis? How do we bring some more of those things that we saw in the data landscape on customer, or, you know, constituent interaction, bringing in social media, bringing in, uh, potentially police records, property records, um, other tax department, database information. Um, and then also looking at comparing one individual to other individuals, looking at people like a specific constituent, are there areas where we're seeing, uh, um, other aspects of a fraud potentially being occurring. Um, and also as we move forward, some of the more advanced techniques that we're seeing around deep learning is looking at computer vision, um, leveraging geospatial information, looking at social network entity analysis, uh, also looking at, um, agent-based modeling techniques, where we're looking at, uh, simulation Monte Carlo type techniques that we typically see in the financial services industry, actually applying that to fraud, waste, and abuse within the, uh, the public sector. >>Um, and again, that really lends itself to a new opportunities. And on that, I'm going to turn it over to Shev to talk about, uh, the reference architecture for, uh, doing these baskets. >>Thanks, Cindy. Um, so I'm going to walk you through an example, reference architecture for fraud detection using, uh, Cloudera underlying technology. Um, and you know, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. So with fraud detection, what we're trying to do is identify anomalies or novelists behavior within our data sets. Um, now in order to understand what aspects of our incoming data represents anomalous behavior, we first need to understand what normal behavior is. So in essence, once we understand normal behavior, anything that deviates from it can be thought of as an anomaly, right? So in order to understand what normal behavior is, we're going to need to be able to collect store and process a very large amount of historical data. And so then comes clutter's platform and this reference architecture that needs to before you, so, uh, let's start on the left-hand side of this reference architecture with the collect phase. >>So fraud detection will always begin with data collection. Uh, we need to collect large amounts of information from systems that could be in the cloud. It could be in the data center or even on edge devices, and this data needs to be collected so we can create our normal behavior profiles. And these normal behavioral profiles would then in turn, be used to create our predictive models for fraudulent activity. Now, uh, uh, to the data collection side, one of the main challenges that many organizations face, uh, in this phase, uh, involves using a single technology that can handle, uh, data that's coming in all different types of formats and protocols and standards with different porosities and velocities. Um, let me give you an example. Uh, we could be collecting data from a database that gets updated daily, uh, and maybe that data is being collected in Agra format. >>At the same time, we can be collecting data from an edge device that's streaming in every second, and that data may be coming in Jason or a binary format, right? So this is a data collection challenge that can be solved with clutter data flow, which is a suite of technologies built on Apache NIFA and mini five, allowing us to ingest all of this data, do a drag and drop interface. So now we're collecting all of this data, that's required to map out normal behavior. The next thing that we need to do is enrich it, transform it and distribute it to, uh, you know, downstream systems for further process. Uh, so let's, let's walk through how that would work first. Let's taking Richmond for, uh, for enrichment, think of adding additional information to your incoming data, right? Let's take, uh, financial transactions, for example, uh, because Cindy mentioned it earlier, right? >>You can store known locations of an individual in an operational database, uh, with Cloudera that would be HBase. And as an individual makes a new transaction, their geo location that's in that transaction data, it can be enriched with previously known locations of that very same individual and all of that enriched data. It can be later used downstream for predictive analysis, predictable. So the data has been enrich. Uh, now it needs to be transformed. We want the data that's coming in, uh, you know, Avro and Jason and binary and whatever other format to be transformed into a single common format. So it can be used downstream for stream processing. Uh, again, this is going to be done through clutter and data flow, which is backed by NIFA, right? So the transformed semantic data is then going to be stimulated to Kafka and coffin. It's going to serve as that central repository of syndicated services or a buffer zone, right? >>So cough is, you know, pretty much provides you with, uh, extremely fast resilient and fault tolerance storage. And it's also going to give you the consumer APIs that you need that are going to enable a wide variety of applications to leverage that enriched and transformed data within your buffer zone. Uh, I'll add that, you know, 17, so you can store that data, uh, in a distributed file system, give you that historical context that you're going to need later on for machine learning, right? So the next step in the architecture is to leverage a cluttered SQL string builder, which enables us to write, uh, streaming sequel jobs on top of Apache Flink. So we can, uh, filter, analyze and, uh, understand the data that's in the Kafka buffer zone in real time. Uh I'll you know, I'll also add like, you know, if you have time series data, or if you need a lab type of cubing, you can leverage kudu, uh, while EDA or exploratory data analysis and visualization, uh, can all be enabled through clever visual patient technology. >>All right, so we've filtered, we've analyzed and we've explored our incoming data. We can now proceed to train our machine learning models, uh, which will detect anomalous behavior in our historically collected data set, uh, to do this, we can use a combination of supervised unsupervised, uh, even deep learning techniques with neural networks and these models can be tested on new incoming streaming data. And once we've gone ahead and obtain the accuracy of the performance, the scores that we want, we can then take these models and deploy them into production. And once the models are productionalized or operationalized, they can be leveraged within our streaming pipeline. So as new data is ingested in real-time knife, I can query these models to detect if the activity is anomalous or fraudulent. And if it is, they can alert downstream users and systems, right? So this in essence is how fraudulent activity detection works. >>Uh, and this entire pipeline is powered by clutter's technology, right? And so, uh, the IRS is one of, uh, clutters customers. That's leveraging our platform today and implementing, uh, a very similar architecture, uh, to detect fraud, waste, and abuse across a very large set of, uh, historical facts, data. Um, and one of the neat things with the IRS is that they've actually, uh, recently leveraged the partnership between Cloudera and Nvidia to accelerate their Spark-based analytics and their machine learning. Uh, and the results have been nothing short of amazing, right? And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the research analytics and statistics division group within the IRS with zero changes to our fraud detection workflow, we're able to obtain eight times to performance simply by adding GPS to our mainstream big data servers. This improvement translates to half the cost of ownership for the same workloads, right? So embedding GPU's into the reference architecture I covered earlier has enabled the IRS to improve their time to insights by as much as eight X while simultaneously reducing their underlying infrastructure costs by half, uh, Cindy back to you >>Chef. Thank you. Um, and I hope that you found, uh, some of the, the analysis, the information that Sheva and I have provided, um, to give you some insights on how cloud era is actually helping, uh, with the fraud waste and abuse challenges within the, uh, the public sector, um, specifically looking at any and all types of data, how the clutter a platform is bringing together and analyzing information, whether it be you're structured you're semi-structured to unstructured data, both in a fast or in a real time perspective, looking at anomalies, being able to do some of those on detection methods, uh, looking at neural network analysis, time series information. So next steps we'd love to have an additional conversation with you. You can also find on some additional information around, uh, how quad areas working in the federal government by going to cloudera.com solutions slash public sector. And we welcome scheduling a meeting with you again, thank you for joining Chevy and I today, we greatly appreciate your time and look forward to future >>Conversation..

Published Date : Aug 5 2021

SUMMARY :

So as we look at fraud, So as we also look at a So as we look at those areas, what are the areas that we see additional So I think one of the key elements is, you know, you can look at your, looking at, uh, deep learning type models around, uh, you know, So as we're looking at, you know, from a, um, an audit planning or looking and the value behind this is, um, how do we actually look at increasing Um, also looking at increasing the amount of, uh, the level of compliance, And on that, I'm going to turn it over to Shev to talk about, uh, the reference architecture for, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher It could be in the data center or even on edge devices, and this data needs to be collected so uh, you know, downstream systems for further process. So the data has been enrich. So the next step in the architecture is to leverage a cluttered SQL string builder, historically collected data set, uh, to do this, we can use a combination of supervised And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the the analysis, the information that Sheva and I have provided, um, to give you some insights on

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Cindy MikeyPERSON

0.99+

NvidiaORGANIZATION

0.99+

MollyPERSON

0.99+

2017DATE

0.99+

patrickPERSON

0.99+

NVIDIAORGANIZATION

0.99+

PWCORGANIZATION

0.99+

CindyPERSON

0.99+

Patrick OsbournePERSON

0.99+

JoePERSON

0.99+

PeterPERSON

0.99+

NIFAORGANIZATION

0.99+

TodayDATE

0.99+

todayDATE

0.99+

HPORGANIZATION

0.99+

ClouderaORGANIZATION

0.99+

over $65 billionQUANTITY

0.99+

over $51 billionQUANTITY

0.99+

last yearDATE

0.99+

ShevPERSON

0.99+

57 billionQUANTITY

0.99+

IRSORGANIZATION

0.99+

ShevaPERSON

0.98+

JasonPERSON

0.98+

firstQUANTITY

0.98+

bothQUANTITY

0.97+

oneQUANTITY

0.97+

HPEORGANIZATION

0.97+

IntelORGANIZATION

0.97+

AvroPERSON

0.96+

saltyPERSON

0.95+

eight XQUANTITY

0.95+

ApacheORGANIZATION

0.94+

single technologyQUANTITY

0.92+

eight timesQUANTITY

0.92+

91 billionQUANTITY

0.91+

zero changesQUANTITY

0.9+

next yearDATE

0.9+

calderaORGANIZATION

0.9+

ChevORGANIZATION

0.87+

RichmondLOCATION

0.85+

three prongQUANTITY

0.85+

$148 billionQUANTITY

0.84+

single common formatQUANTITY

0.83+

SQLTITLE

0.82+

KafkaPERSON

0.82+

ChevyPERSON

0.8+

HP LabsORGANIZATION

0.8+

one individualQUANTITY

0.8+

PatrickPERSON

0.78+

Monte CarloTITLE

0.76+

halfQUANTITY

0.75+

over halfQUANTITY

0.68+

17QUANTITY

0.65+

secondQUANTITY

0.65+

HBaseTITLE

0.56+

elementsQUANTITY

0.53+

Apache FlinkORGANIZATION

0.53+

cloudera.comOTHER

0.5+

coffinPERSON

0.5+

SparkTITLE

0.49+

LakeCOMMERCIAL_ITEM

0.48+

HPETITLE

0.47+

mini fiveCOMMERCIAL_ITEM

0.45+

GreenORGANIZATION

0.37+

PUBLIC SECTOR V1 | CLOUDERA


 

>>Hi, this is Cindy Mikey, vice president of industry solutions at caldera. Joining me today is chef is Molly, our solution engineer for the public sector. Today. We're going to talk about speed to insight. Why using machine learning in the public sector, specifically around fraud, waste and abuse. So topic for today, we'll discuss machine learning, why the public sector uses it to target fraud, waste, and abuse, the challenges. How do we enhance your data and analytical approaches the data landscape analytical methods and shad we'll go over reference architecture and a case study. So by definition, fraud, waste and abuse per the government accountability office is fraud. Isn't an attempt to obtain something about value through unwelcome misrepresentation waste is about squandering money or resources and abuse is about behaving improperly or unreasonably to actually obtain something of value for your personal benefit. So as we look at fraud, um, and across all industries, it's a top of mind, um, area within the public sector. >>Um, the types of fraud that we see is specifically around cyber crime, uh, looking at accounting fraud, whether it be from an individual perspective to also, uh, within organizations, looking at financial statement fraud, to also looking at bribery and corruption, as we look at fraud, it really hits us from all angles, whether it be from external perpetrators or internal perpetrators, and specifically from the research by PWC, the key focus area is we also see over half of fraud is actually through some form of internal or external, uh, perpetrators again, key topics. So as we also look at a report recently by the association of certified fraud examiners, um, within the public sector, the us government, um, in 2017, it was identified roughly $148 billion was attributable to fraud, waste and abuse. Specifically about 57 billion was focused on reported monetary losses and another 91 billion on areas where that opportunity or the monetary basis had not yet been measured. >>As we look at breaking those areas down again, we look at several different topics from permit out payment perspective. So breaking it down within the health system, over $65 billion within social services, over $51 billion to procurement fraud to also, um, uh, fraud, waste and abuse that's happening in the grants and the loan process to payroll fraud, and then other aspects, again, quite a few different topical areas. So as we look at those areas, what are the areas that we see additional type of focus, there's a broad stroke areas. What are the actual use cases that our agencies are using the data landscape? What data, what analytical methods can we use to actually help curtail and prevent some of the, uh, the fraud waste and abuse. So, as we look at some of the analytical processes and analytical use crate, uh, use cases in the public sector, whether it's from, uh, you know, the taxation areas to looking at, you know, social services, uh, to public safety, to also the, um, our, um, uh, additional agency methods, we're gonna use focused specifically on some of the use cases around, um, you know, fraud within the tax area. >>Uh, we'll briefly look at some of the aspects of, um, unemployment insurance fraud, uh, benefit fraud, as well as payment and integrity. So fraud has it it's, um, uh, underpinnings inquiry, like you different on government agencies and difficult, different analytical methods, and I usage of different data. So I think one of the key elements is, you know, you can look at your, your data landscape on specific data sources that you need, but it's really about bringing together different data sources across a different variety, a different velocity. So, uh, data has different dimensions. So we'll look at structured types of data of semi-structured data, behavioral data, as well as when we look at, um, you know, predictive models. We're typically looking at historical type information, but if we're actually trying to look at preventing fraud before it actually happens, or when a case may be in flight, which is specifically a use case that shad is going to talk about later is how do I look at more of that? >>Real-time that streaming information? How do I take advantage of data, whether it be, uh, you know, uh, financial transactions we're looking at, um, asset verification, we're looking at tax records, we're looking at corporate filings. Um, and we can also look at more, uh, advanced data sources where as we're looking at, um, investigation type information. So we're maybe going out and we're looking at, uh, deep learning type models around, uh, you know, semi or that, uh, behavioral, uh, that's unstructured data, whether it be camera analysis and so forth. So for quite a different variety of data and the, the breadth and the opportunity really comes about when you can integrate and look at data across all different data sources. So in a looking at a more extensive, uh, data landscape. So specifically I want to focus on some of the methods, some of the data sources and some of the analytical techniques that we're seeing, uh, being used, um, in the government agencies, as well as opportunities, uh, to look at new methods. >>So as we're looking at, you know, from a, um, an audit planning or looking at, uh, the opportunity for the likelihood of non-compliance, um, specifically we'll see data sources where we're maybe looking at a constituents profile, we might actually be investigating the forms that they've provided. We might be comparing that data, um, or leveraging internal data sources, possibly looking at net worth, comparing it against other financial data, and also comparison across other constituents groups. Some of the techniques that we use are some of the basic natural language processing, maybe we're going to do some text mining. We might be doing some probabilistic modeling, uh, where we're actually looking at, um, information within the agency to also comparing that against possibly tax forms. A lot of times it's information historically has been done on a batch perspective, both structured and semi-structured type information. And typically the data volumes can be low, but we're also seeing those data volumes on increase exponentially based upon the types of events that we're dealing with, the number of transactions. >>Um, so getting the throughput, um, and chef's going to specifically talk about that in a moment. The other aspect is, as we look at other areas of opportunity is when we're building upon, how do I actually do compliance? How do I actually look at conducting audits, uh, or potential fraud to also looking at areas of under-reported tax information? So there you might be pulling in some of our other types of data sources, whether it's being property records, it could be data that's being supplied by the actual constituents or by vendors to also pulling in social media information to geographical information, to leveraging photos on techniques that we're seeing used is possibly some sentiment analysis, link analysis. Um, how do we actually blend those data sources together from a natural language processing? But I think what's important here is also the method and the looking at the data velocity, whether it be batch, whether it be near real time, again, looking at all types of data, whether it's structured semi-structured or unstructured and the key and the value behind this is, um, how do we actually look at increasing the potential revenue or the, um, under reported revenue? >>Uh, how do we actually look at stopping fraudulent payments before they actually occur? Um, also looking at increasing the amount of, uh, the level of compliance, um, and also looking at the potential of prosecution of fraud cases. And additionally, other areas of opportunity could be looking at, um, economic planning. How do we actually perform some link analysis? How do we bring some more of those things that we saw in the data landscape on customer, or, you know, constituent interaction, bringing in social media, bringing in, uh, potentially police records, property records, um, other tax department, database information. Um, and then also looking at comparing one individual to other individuals, looking at people like a specific, like a constituent, are there areas where we're seeing, uh, >>Um, other >>Aspects of, of fraud potentially being occurring. Um, and also as we move forward, some of the more advanced techniques that we're seeing around deep learning is looking at computer vision, um, leveraging geospatial information, looking at social network entity analysis, uh, also looking at, uh, agent-based modeling techniques, where we're looking at simulation Monte Carlo type techniques that we typically see in the financial services industry, actually applying that to fraud, waste, and abuse within the, uh, the public sector. Um, and again, that really, uh, lends itself to a new opportunities. And on that, I'm going to turn it over to chef to talk about, uh, the reference architecture for, uh, doing these buckets. >>Thanks, Cindy. Um, so I'm gonna walk you through an example, reference architecture for fraud detection using, uh, Cloudera's underlying technology. Um, and you know, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. So with fraud detection, what we're trying to do is identify anomalies or novelists behavior within our datasets. Um, now in order to understand what aspects of our incoming data represents anomalous behavior, we first need to understand what normal behavior is. So in essence, once we understand normal behavior, anything that deviates from it can be thought of as an anomaly, right? So in order to understand what normal behavior is, we're going to need to be able to collect store and process a very large amount of historical data. And so incomes, clutters platform, and this reference architecture that needs to be for you. >>So, uh, let's start on the left-hand side of this reference architecture with the collect phase. So fraud detection will always begin with data collection. We need to collect large amounts of information from systems that could be in the cloud. It could be in the data center or even on edge devices, and this data needs to be collected so we can create our normal behavior profiles. And these normal behavioral profiles would then in turn, be used to create our predictive models for fraudulent activity. Now, uh, thinking, uh, to the data collection side, one of the main challenges that many organizations face, uh, in this phase, uh, involves using a single technology that can handle, uh, data that's coming in all different types of formats and protocols and standards with different velocities and velocities. Um, let me give you an example. Uh, we could be collecting data from a database that gets updated daily, uh, and maybe that data is being collected in Agra format. >>At the same time, we can be collecting data from an edge device that's streaming in every second, and that data may be coming in Jason or a binary format, right? So this is a data collection challenge that can be solved with cluttered data flow, which is a suite of technologies built on a patch NIFA in mini five, allowing us to ingest all of this data, do a drag and drop interface. So now we're collecting all of this data, that's required to map out normal behavior. The next thing that we need to do is enrich it, transform it and distribute it to, uh, you know, downstream systems for further process. Uh, so let's, let's walk through how that would work first. Let's taking Richmond for, uh, for enrichment, think of adding additional information to your incoming data, right? Let's take, uh, financial transactions, for example, uh, because Cindy mentioned it earlier, right? >>You can store known locations of an individual in an operational database, uh, with Cloudera that would be HBase. And as an individual makes a new transaction, their geolocation that's in that transaction data can be enriched with previously known locations of that very same individual. And all of that enriched data can be later used downstream for predictive analysis, predictable. So the data has been enrich. Uh, now it needs to be transformed. We want the data that's coming in, uh, you know, Avro and Jason and binary and whatever other format to be transformed into a single common format. So it can be used downstream for stream processing. Uh, again, this is going to be done through clutter and data flow, which is backed by NIFA, right? So the transformed semantic data is then going to be stricted to Kafka and coffin. It's going to serve as that central repository of syndicated services or a buffer zone, right? >>So coffee is going to pretty much provide you with, uh, extremely fast resilient and fault tolerance storage. And it's also gonna give you the consumer APIs that you need that are going to enable a wide variety of applications to leverage that enriched and transformed data within your buffer zone, uh, allowed that, you know, 17. So you can store that data in a distributed file system, give you that historical context that you're going to need later on for machine learning, right? So the next step in the architecture is to leverage a cluttered SQL stream builder, which enables us to write, uh, streaming SQL jobs on top of Apache Flink. So we can, uh, filter, analyze and, uh, understand the data that's in the Kafka buffer in real time. Uh I'll you know, I'll also add like, you know, if you have time series data, or if you need a lab type of cubing, you can leverage kudu, uh, while EDA or, you know, exploratory data analysis and visualization, uh, can all be enabled through clever visualization technology. >>All right, so we've filtered, we've analyzed and we've explored our incoming data. We can now proceed to train our machine learning models, uh, which will detect anomalous behavior in our historically collected data set, uh, to do this, we can use a combination of supervised unsupervised, uh, even deep learning techniques with neural networks. And these models can be tested on new incoming streaming data. And once we've gone ahead and obtain the accuracy of the performance, the scores that we want, we can then take these models and deploy them into production. And once the models are productionalized or operationalized, they can be leveraged within our streaming pipeline. So as new data is ingested in real-time knife, I can query these models to detect if the activity is anomalous or fraudulent. And if it is, they can alert downstream users and systems, right? So this in essence is how fraudulent activity detection works. >>Uh, and this entire pipeline is powered by clutters technology, right? And so, uh, the IRS is one of, uh, clutter's customers. That's leveraging our platform today and implementing, uh, a very similar architecture, uh, to detect fraud, waste, and abuse across a very large set of historical facts, data. Um, and one of the neat things with the IRS is that they've actually recently leveraged the partnership between Cloudera and Nvidia to accelerate their spark based analytics and their machine learning, uh, and the results have been nothing short of amazing, right? And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the research analytics and statistics division group within the IRS with zero changes to our fraud detection workflow, we're able to obtain eight times to performance simply by adding GPS to our mainstream big data servers. This improvement translates to half the cost of ownership for the same workloads, right? So embedding GPU's into the reference architecture I covered earlier has enabled the IRS to improve their time to insights by as much as eight X while simultaneously reducing their underlying infrastructure costs by half, uh, Cindy back to you >>Chef. Thank you. Um, and I hope that you found, uh, some of the, the analysis, the information that Sheva and I have provided, um, to give you some insights on how cloud era is actually helping, uh, with the fraud waste and abuse challenges within the, uh, the public sector, um, specifically looking at any and all types of data, how the clutter platform is bringing together and analyzing information, whether it be you're structured you're semi-structured to unstructured data, both in a fast or in a real-time perspective, looking at anomalies, being able to do some of those on detection, uh, looking at neural network analysis, time series information. So next steps we'd love to have additional conversation with you. You can also find on some additional information around, I have caught areas working in the, the federal government by going to cloudera.com solutions slash public sector. And we welcome scheduling a meeting with you again, thank you for joining us Sheva and I today. We greatly appreciate your time and look forward to future progress. >>Good day, everyone. Thank you for joining me. I'm Sydney. Mike joined by Rick Taylor of Cloudera. Uh, we're here to talk about predictive maintenance for the public sector and how to increase assets, service, reliability on today's agenda. We'll talk specifically around how to optimize your equipment maintenance, how to reduce costs, asset failure with data and analytics. We'll go into a little more depth on, um, what type of data, the analytical methods that we're typically seeing used, um, the associated, uh, Brooke, we'll go over a case study as well as a reference architecture. So by basic definition, uh, predictive maintenance is about determining when an asset should be maintained and what specific maintenance activities need to be performed either based upon an assets of actual condition or state. It's also about predicting and preventing failures and performing maintenance on your time on your schedule to avoid costly unplanned downtime. >>McKinsey has looked at analyzing predictive maintenance costs across multiple industries and has identified that there's the opportunity to reduce overall predictive maintenance costs by roughly 50% with different types of analytical methods. So let's look at those three types of models. First, we've got our traditional type of method for maintenance, and that's really about our corrective maintenance, and that's when we're performing maintenance on an asset, um, after the equipment fails. But the challenges with that is we end up with unplanned. We end up with disruptions in our schedules, um, as well as reduced quality, um, around the performance of the asset. And then we started looking at preventive maintenance and preventative maintenance is really when we're performing maintenance on a set schedule. Um, the challenges with that is we're typically doing it regardless of the actual condition of the asset, um, which has resulted in unnecessary downtime and expense. Um, and specifically we're really now focused on pre uh, condition-based maintenance, which is looking at leveraging predictive maintenance techniques based upon actual conditions and real time events and processes. Um, within that we've seen organizations, um, and again, source from McKenzie have a 50% reduction in downtime, as well as an overall 40% reduction in maintenance costs. Again, this is really looking at things across multiple industries, but let's look at it in the context of the public sector and based upon some activity by the department of energy, um, several years ago, >>Um, they've really >>Looked at what does predictive maintenance mean to the public sector? What is the benefit, uh, looking at increasing return on investment of assets, reducing, uh, you know, reduction in downtime, um, as well as overall maintenance costs. So corrective or reactive based maintenance is really about performing once there's been a failure. Um, and then the movement towards, uh, preventative, which is based upon a set schedule or looking at predictive where we're monitoring real-time conditions. Um, and most importantly is now actually leveraging IOT and data and analytics to further reduce those overall downtimes. And there's a research report by the, uh, department of energy that goes into more specifics, um, on the opportunity within the public sector. So, Rick, let's talk a little bit about what are some of the challenges, uh, regarding data, uh, regarding predictive maintenance. >>Some of the challenges include having data silos, historically our government organizations and organizations in the commercial space as well, have multiple data silos. They've spun up over time. There are multiple business units and note, there's no single view of assets. And oftentimes there's redundant information stored in, in these silos of information. Uh, couple that with huge increases in data volume data growing exponentially, along with new types of data that we can ingest there's social media, there's semi and unstructured data sources and the real time data that we can now collect from the internet of things. And so the challenge is to collect all these assets together and begin to extract intelligence from them and insights and, and that in turn then fuels, uh, machine learning and, um, and, and what we call artificial intelligence, which enables predictive maintenance. Next slide. So >>Let's look specifically at, you know, the, the types of use cases and I'm going to Rick and I are going to focus on those use cases, where do we see predictive maintenance coming into the procurement facility, supply chain, operations and logistics. Um, we've got various level of maturity. So, you know, we're talking about predictive maintenance. We're also talking about, uh, using, uh, information, whether it be on a, um, a connected asset or a vehicle doing monitoring, uh, to also leveraging predictive maintenance on how do we bring about, uh, looking at data from connected warehouses facilities and buildings all bring on an opportunity to both increase the quality and effectiveness of the missions within the agencies to also looking at re uh, looking at cost efficiency, as well as looking at risk and safety and the types of data, um, you know, that Rick mentioned around, you know, the new types of information, some of those data elements that we typically have seen is looking at failure history. >>So when has that an asset or a machine or a component within a machine failed in the past? Uh, we've also looking at bringing together a maintenance history, looking at a specific machine. Are we getting error codes off of a machine or assets, uh, looking at when we've replaced certain components to looking at, um, how are we actually leveraging the assets? What were the operating conditions, uh, um, pulling off data from a sensor on that asset? Um, also looking at the, um, the features of an asset, whether it's, you know, engine size it's make and model, um, where's the asset located on to also looking at who's operated the asset, uh, you know, whether it be their certifications, what's their experience, um, how are they leveraging the assets and then also bringing in together, um, some of the, the pattern analysis that we've seen. So what are the operating limits? Um, are we getting service reliability? Are we getting a product recall information from the actual manufacturer? So, Rick, I know the data landscape has really changed. Let's, let's go over looking at some of those components. Sure. >>So this slide depicts sort of the, some of the inputs that inform a predictive maintenance program. So, as we've talked a little bit about the silos of information, the ERP system of record, perhaps the spares and the service history. So we want, what we want to do is combine that information with sensor data, whether it's a facility and equipment sensors, um, uh, or temperature and humidity, for example, all this stuff is then combined together, uh, and then use to develop machine learning models that better inform, uh, predictive maintenance, because we'll do need to keep, uh, to take into account the environmental factors that may cause additional wear and tear on the asset that we're monitoring. So here's some examples of private sector, uh, maintenance use cases that also have broad applicability across the government. For example, one of the busiest airports in Europe is running cloud era on Azure to capture secure and correlate sensor data collected from equipment within the airport, the people moving equipment more specifically, the escalators, the elevators, and the baggage carousels. >>The objective here is to prevent breakdowns and improve airport efficiency and passenger safety. Another example is a container shipping port. In this case, we use IOT data and machine learning, help customers recognize how their cargo handling equipment is performing in different weather conditions to understand how usage relates to failure rates and to detect anomalies and transport systems. These all improve for another example is Navistar Navistar, leading manufacturer of commercial trucks, buses, and military vehicles. Typically vehicle maintenance, as Cindy mentioned, is based on miles traveled or based on a schedule or a time since the last service. But these are only two of the thousands of data points that can signal the need for maintenance. And as it turns out, unscheduled maintenance and vehicle breakdowns account for a large share of the total cost for vehicle owner. So to help fleet owners move from a reactive approach to a more predictive model, Navistar built an IOT enabled remote diagnostics platform called on command. >>The platform brings in over 70 sensor data feeds for more than 375,000 connected vehicles. These include engine performance, trucks, speed, acceleration, cooling temperature, and break where this data is then correlated with other Navistar and third-party data sources, including weather geo location, vehicle usage, traffic warranty, and parts inventory information. So the platform then uses machine learning and advanced analytics to automatically detect problems early and predict maintenance requirements. So how does the fleet operator use this information? They can monitor truck health and performance from smartphones or tablets and prioritize needed repairs. Also, they can identify that the nearest service location that has the relevant parts, the train technicians and the available service space. So sort of wrapping up the, the benefits Navistar's helped fleet owners reduce maintenance by more than 30%. The same platform is also used to help school buses run safely. And on time, for example, one school district with 110 buses that travel over a million miles annually reduce the number of PTOs needed year over year, thanks to predictive insights delivered by this platform. >>So I'd like to take a moment and walk through the data. Life cycle is depicted in this diagram. So data ingest from the edge may include feeds from the factory floor or things like connected vehicles, whether they're trucks, aircraft, heavy equipment, cargo vessels, et cetera. Next, the data lands on a secure and governed data platform. Whereas combined with data from existing systems of record to provide additional insights, and this platform supports multiple analytic functions working together on the same data while maintaining strict security governance and control measures once processed the data is used to train machine learning models, which are then deployed into production, monitored, and retrained as needed to maintain accuracy. The process data is also typically placed in a data warehouse and use to support business intelligence, analytics, and dashboards. And in fact, this data lifecycle is representative of one of our government customers doing condition-based maintenance across a variety of aircraft. >>And the benefits they've discovered include less unscheduled maintenance and a reduction in mean man hours to repair increased maintenance efficiencies, improved aircraft availability, and the ability to avoid cascading component failures, which typically cost more in repair cost and downtime. Also, they're able to better forecast the requirements for replacement parts and consumables and last, and certainly very importantly, this leads to enhanced safety. This chart overlays the secure open source Cloudera platform used in support of the data life cycle. We've been discussing Cloudera data flow, the data ingest data movement and real time streaming data query capabilities. So data flow gives us the capability to bring data in from the asset of interest from the internet of things. While the data platform provides a secure governed data lake and visibility across the full machine learning life cycle eliminates silos and streamlines workflows across teams. The platform includes an integrated suite of secure analytic applications. And two that we're specifically calling out here are Cloudera machine learning, which supports the collaborative data science and machine learning environment, which facilitates machine learning and AI and the cloud era data warehouse, which supports the analytics and business intelligence, including those dashboards for leadership Cindy, over to you, Rick, >>Thank you. And I hope that, uh, Rick and I provided you some insights on how predictive maintenance condition-based maintenance is being used and can be used within your respective agency, bringing together, um, data sources that maybe you're having challenges with today. Uh, bringing that, uh, more real-time information in from a streaming perspective, blending that industrial IOT, as well as historical information together to help actually, uh, optimize maintenance and reduce costs within the, uh, each of your agencies, uh, to learn a little bit more about Cloudera, um, and our, what we're doing from a predictive maintenance please, uh, business@cloudera.com solutions slash public sector. And we look forward to scheduling a meeting with you, and on that, we appreciate your time today and thank you very much.

Published Date : Aug 4 2021

SUMMARY :

So as we look at fraud, Um, the types of fraud that we see is specifically around cyber crime, So as we look at those areas, what are the areas that we see additional So I think one of the key elements is, you know, you can look at your, the breadth and the opportunity really comes about when you can integrate and Some of the techniques that we use and the value behind this is, um, how do we actually look at increasing Um, also looking at increasing the amount of, uh, the level of compliance, I'm going to turn it over to chef to talk about, uh, the reference architecture for, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. It could be in the data center or even on edge devices, and this data needs to be collected At the same time, we can be collecting data from an edge device that's streaming in every second, So the data has been enrich. So the next step in the architecture is to leverage a cluttered SQL stream builder, obtain the accuracy of the performance, the scores that we want, Um, and one of the neat things with the IRS the analysis, the information that Sheva and I have provided, um, to give you some insights on the analytical methods that we're typically seeing used, um, the associated, doing it regardless of the actual condition of the asset, um, uh, you know, reduction in downtime, um, as well as overall maintenance costs. And so the challenge is to collect all these assets together and begin the types of data, um, you know, that Rick mentioned around, you know, the new types on to also looking at who's operated the asset, uh, you know, whether it be their certifications, So we want, what we want to do is combine that information with So to help fleet So the platform then uses machine learning and advanced analytics to automatically detect problems So data ingest from the edge may include feeds from the factory floor or things like improved aircraft availability, and the ability to avoid cascading And I hope that, uh, Rick and I provided you some insights on how predictive

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Cindy MikeyPERSON

0.99+

RickPERSON

0.99+

Rick TaylorPERSON

0.99+

MollyPERSON

0.99+

NvidiaORGANIZATION

0.99+

2017DATE

0.99+

PWCORGANIZATION

0.99+

40%QUANTITY

0.99+

110 busesQUANTITY

0.99+

EuropeLOCATION

0.99+

50%QUANTITY

0.99+

CindyPERSON

0.99+

MikePERSON

0.99+

JoePERSON

0.99+

ClouderaORGANIZATION

0.99+

TodayDATE

0.99+

todayDATE

0.99+

NavistarORGANIZATION

0.99+

FirstQUANTITY

0.99+

twoQUANTITY

0.99+

more than 30%QUANTITY

0.99+

over $51 billionQUANTITY

0.99+

NIFAORGANIZATION

0.99+

over $65 billionQUANTITY

0.99+

IRSORGANIZATION

0.99+

over a million milesQUANTITY

0.99+

firstQUANTITY

0.98+

oneQUANTITY

0.98+

JasonPERSON

0.98+

AzureTITLE

0.98+

BrookePERSON

0.98+

AvroPERSON

0.98+

one school districtQUANTITY

0.98+

SQLTITLE

0.97+

bothQUANTITY

0.97+

$148 billionQUANTITY

0.97+

ShevaPERSON

0.97+

three typesQUANTITY

0.96+

eachQUANTITY

0.95+

McKenzieORGANIZATION

0.95+

more than 375,000 connected vehiclesQUANTITY

0.95+

ClouderaTITLE

0.95+

about 57 billionQUANTITY

0.95+

saltyPERSON

0.94+

several years agoDATE

0.94+

single technologyQUANTITY

0.94+

eight timesQUANTITY

0.93+

91 billionQUANTITY

0.93+

eight XQUANTITY

0.92+

business@cloudera.comOTHER

0.92+

McKinseyORGANIZATION

0.92+

zero changesQUANTITY

0.92+

Monte CarloTITLE

0.92+

calderaORGANIZATION

0.91+

coupleQUANTITY

0.9+

over 70 sensor data feedsQUANTITY

0.88+

RichmondLOCATION

0.84+

Navistar NavistarORGANIZATION

0.82+

single viewQUANTITY

0.81+

17OTHER

0.8+

single common formatQUANTITY

0.8+

thousands of data pointsQUANTITY

0.79+

SydneyLOCATION

0.78+

Cindy Maike & Nasheb Ismaily | Cloudera


 

>>Hi, this is Cindy Mikey, vice president of industry solutions at Cloudera. Joining me today is chef is Molly, our solution engineer for the public sector. Today. We're going to talk about speed to insight. Why using machine learning in the public sector, specifically around fraud, waste and abuse. So topic for today, we'll discuss machine learning, why the public sector uses it to target fraud, waste, and abuse, the challenges. How do we enhance your data and analytical approaches the data landscape analytical methods and Shev we'll go over reference architecture and a case study. So by definition, fraud, waste and abuse per the government accountability office is fraud is an attempt to obtain something about a value through unwelcomed. Misrepresentation waste is about squandering money or resources and abuse is about behaving improperly or unreasonably to actually obtain something of value for your personal benefit. So as we look at fraud and across all industries, it's a top of mind, um, area within the public sector. >>Um, the types of fraud that we see is specifically around cyber crime, uh, looking at accounting fraud, whether it be from an individual perspective to also, uh, within organizations, looking at financial statement fraud, to also looking at bribery and corruption, as we look at fraud, it really hits us from all angles, whether it be from external perpetrators or internal perpetrators, and specifically from the research by PWC, the key focus area is we also see over half of fraud is actually through some form of internal or external are perpetrators again, key topics. So as we also look at a report recently by the association of certified fraud examiners, um, within the public sector, the us government, um, in 2017, it was identified roughly $148 billion was attributable to fraud, waste and abuse. Specifically of that 57 billion was focused on reported monetary losses and another 91 billion on areas where that opportunity or the monetary basis had not yet been measured. >>As we look at breaking those areas down again, we look at several different topics from an out payment perspective. So breaking it down within the health system, over $65 billion within social services, over $51 billion to procurement fraud to also, um, uh, fraud, waste and abuse that's happening in the grants and the loan process to payroll fraud, and then other aspects, again, quite a few different topical areas. So as we look at those areas, what are the areas that we see additional type of focus, there's broad stroke areas? What are the actual use cases that our agencies are using the data landscape? What data, what analytical methods can we use to actually help curtail and prevent some of the, uh, the fraud waste and abuse. So, as we look at some of the analytical processes and analytical use crate, uh, use cases in the public sector, whether it's from, uh, you know, the taxation areas to looking at social services, uh, to public safety, to also the, um, our, um, uh, additional agency methods, we're going to focus specifically on some of the use cases around, um, you know, fraud within the tax area. >>Uh, we'll briefly look at some of the aspects of unemployment insurance fraud, uh, benefit fraud, as well as payment and integrity. So fraud has its, um, uh, underpinnings in quite a few different on government agencies and difficult, different analytical methods and I usage of different data. So I think one of the key elements is, you know, you can look at your, your data landscape on specific data sources that you need, but it's really about bringing together different data sources across a different variety, a different velocity. So, uh, data has different dimensions. So we'll look at on structured types of data of semi-structured data, behavioral data, as well as when we look at, um, you know, predictive models, we're typically looking at historical type information, but if we're actually trying to lock at preventing fraud before it actually happens, or when a case may be in flight, which is specifically a use case, that shadow is going to talk about later it's how do I look at more of that? >>Real-time that streaming information? How do I take advantage of data, whether it be, uh, you know, uh, financial transactions we're looking at, um, asset verification, we're looking at tax records, we're looking at corporate filings. Um, and we can also look at more, uh, advanced data sources where as we're looking at, um, investigation type information. So we're maybe going out and we're looking at, uh, deep learning type models around, uh, you know, semi or that behavioral, uh, that's unstructured data, whether it be camera analysis and so forth. So quite a different variety of data and the, the breadth, um, and the opportunity really comes about when you can integrate and look at data across all different data sources. So in a sense, looking at a more extensive on data landscape. So specifically I want to focus on some of the methods, some of the data sources and some of the analytical techniques that we're seeing, uh, being used, um, in the government agencies, as well as opportunities, uh, to look at new methods. >>So as we're looking at, you know, from a, um, an audit planning or looking at, uh, the opportunity for the likelihood of non-compliance, um, specifically we'll see data sources where we're maybe looking at a constituents profile, we might actually be, um, investigating the forms that they've provided. We might be comparing that data, um, or leveraging internal data sources, possibly looking at net worth, comparing it against other financial data, and also comparison across other constituents groups. Some of the techniques that we use are some of the basic natural language processing, maybe we're going to do some text mining. We might be doing some probabilistic modeling, uh, where we're actually looking at, um, information within the agency to also comparing that against possibly tax forms. A lot of times it's information historically has been done on a batch perspective, both structured and semi-structured type information. And typically the data volumes can be low, but we're also seeing those data volumes increase exponentially based upon the types of events that we're dealing with, the number of transactions. >>Um, so getting the throughput, um, and chef's going to specifically talk about that in a moment. The other aspect is, as we look at other areas of opportunity is when we're building upon, how do I actually do compliance? How do I actually look at conducting audits, uh, or potential fraud to also looking at areas of under reported tax information? So there you might be pulling in some of our other types of data sources, whether it's being property records, it could be data that's being supplied by the actual constituents or by vendors to also pulling in social media information to geographical information, to leveraging photos on techniques that we're seeing used is possibly some sentiment analysis, link analysis. Um, how do we actually blend those data sources together from a natural language processing? But I think what's important here is also the method and the looking at the data velocity, whether it be batch, whether it be near real time, again, looking at all types of data, whether it's structured semi-structured or unstructured and the key and the value behind this is, um, how do we actually look at increasing the potential revenue or the, um, under reported revenue? >>Uh, how do we actually look at stopping fraudulent payments before they actually occur? Um, also looking at increasing the amount of, uh, the level of compliance, um, and also looking at the potential of prosecution of fraud cases. And additionally, other areas of opportunity could be looking at, um, economic planning. How do we actually perform some link analysis? How do we bring some more of those things that we saw in the data landscape on customer, or, you know, constituent interaction, bringing in social media, bringing in, uh, potentially police records, property records, um, other tax department, database information. Um, and then also looking at comparing one individual to other individuals, looking at people like a specific, like, uh, constituent, are there areas where we're seeing, uh, um, other aspects of, of fraud potentially being, uh, occurring. Um, and also as we move forward, some of the more advanced techniques that we're seeing around deep learning is looking at computer vision, um, leveraging geospatial information, looking at social network entity analysis, uh, also looking at, um, agent-based modeling techniques, where we're looking at simulation, Monte Carlo type techniques that we typically see in the financial services industry, actually applying that to fraud, waste, and abuse within the, the public sector. >>Um, and again, that really, uh, lends itself to a new opportunities. And on that, I'm going to turn it over to Chevy to talk about, uh, the reference architecture for doing these buckets. >>Sure. Yeah. Thanks, Cindy. Um, so I'm going to walk you through an example, reference architecture for fraud detection, using Cloudera as underlying technology. Um, and you know, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. So with fraud detection, what we're trying to do is identify anomalies or anomalous behavior within our datasets. Um, now in order to understand what aspects of our incoming data represents anomalous behavior, we first need to understand what normal behavior is. So in essence, once we understand normal behavior, anything that deviates from it can be thought of as an anomaly, right? So in order to understand what normal behavior is, we're going to need to be able to collect store and process a very large amount of historical data. And so incomes, clutters platform, and this reference architecture that needs to be for you. >>So, uh, let's start on the left-hand side of this reference architecture with the collect phase. So fraud detection will always begin with data collection. Uh, we need to collect large amounts of information from systems that could be in the cloud. It could be in the data center or even on edge devices, and this data needs to be collected so we can create from normal behavior profiles and these normal behavioral profiles would then in turn, be used to create our predictive models for fraudulent activity. Now, uh, uh, to the data collection side, one of the main challenges that many organizations face, uh, in this phase, uh, involves using a single technology that can handle, uh, data that's coming in all different types of formats and protocols and standards with different velocities and velocities. Um, let me give you an example. Uh, we could be collecting data from a database that gets updated daily, uh, and maybe that data is being collected in Agra format. >>At the same time, we can be collecting data from an edge device that's streaming in every second, and that data may be coming in Jace on or a binary format, right? So this is a data collection challenge that can be solved with cluttered data flow, which is a suite of technologies built on Apache NIFA and mini five, allowing us to ingest all of this data, do a drag and drop interface. So now we're collecting all of this data, that's required to map out normal behavior. The next thing that we need to do is enrich it, transform it and distribute it to know downstream systems for further process. Uh, so let's, let's walk through how that would work first. Let's taking Richmond for, uh, for enrichment, think of adding additional information to your incoming data, right? Let's take, uh, financial transactions, for example, uh, because Cindy mentioned it earlier, right? >>You can store known locations of an individual in an operational database, uh, with Cloudera that would be HBase. And as an individual makes a new transaction, their geo location that's in that transaction data, it can be enriched with previously known locations of that very same individual and all of that enriched data. It can be later used downstream for predictive analysis, predictable. So the data has been enrich. Uh, now it needs to be transformed. We want the data that's coming in, uh, you know, Avro and Jason and binary and whatever other format to be transformed into a single common format. So it can be used downstream for stream processing. Uh, again, this is going to be done through clutter and data flow, which is backed by NIFA, right? So the transformed semantic data is then going to be stimulated to Kafka and coffin is going to serve as that central repository of syndicated services or a buffer zone, right? >>So cough is, you know, pretty much provides you with, uh, extremely fast resilient and fault tolerance storage. And it's also going to give you the consumer API APIs that you need that are going to enable a wide variety of applications to leverage that enriched and transform data within your buffer zone. Uh, I'll add that, you know, 17, so you can store that data, uh, in a distributed file system, give you that historical context that you're going to need later on from machine learning, right? So the next step in the architecture is to leverage, uh, clutter SQL stream builder, which enables us to write, uh, streaming sequel jobs on top of Apache Flink. So we can, uh, filter, analyze and, uh, understand the data that's in the Kafka buffer zone in real-time. Uh, I'll, you know, I'll also add like, you know, if you have time series data, or if you need a lab type of cubing, you can leverage Q2, uh, while EDA or, you know, exploratory data analysis and visualization, uh, can all be enabled through clever visualization technology. >>All right, so we've filtered, we've analyzed, and we've our incoming data. We can now proceed to train our machine learning models, uh, which will detect anomalous behavior in our historically collected data set, uh, to do this, we can use a combination of supervised unsupervised, even deep learning techniques with neural networks. Uh, and these models can be tested on new incoming streaming data. And once we've gone ahead and obtain the accuracy of the performance, the X one, uh, scores that we want, we can then take these models and deploy them into production. And once the models are productionalized or operationalized, they can be leveraged within our streaming pipeline. So as new data is ingested in real time knife, I can query these models to detect if the activity is anomalous or fraudulent. And if it is, they can alert downstream users and systems, right? So this in essence is how fraudulent activity detection works. Uh, and this entire pipeline is powered by clutters technology. Uh, Cindy, next slide please. >>Right. And so, uh, the IRS is one of, uh, clutter as customers. That's leveraging our platform today and implementing a very similar architecture, uh, to detect fraud, waste, and abuse across a very large set of, uh, historical facts, data. Um, and one of the neat things with the IRS is that they've actually recently leveraged the partnership between Cloudera and Nvidia to accelerate their Spark-based analytics and their machine learning. Uh, and the results have been nothing short of amazing, right? And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the research analytics and statistics division group within the IRS with zero changes to our fraud detection workflow, we're able to obtain eight times to performance simply by adding GPS to our mainstream big data servers. This improvement translates to half the cost of ownership for the same workloads, right? So embedding GPU's into the reference architecture I covered earlier has enabled the IRS to improve their time to insights by as much as eight X while simultaneously reducing their underlying infrastructure costs by half, uh, Cindy back to you >>Chef. Thank you. Um, and I hope that you found, uh, some of the, the analysis, the information that Sheva and I have provided, uh, to give you some insights on how cloud era is actually helping, uh, with the fraud waste and abuse challenges within the, uh, the public sector, um, specifically looking at any and all types of data, how the clutter a platform is bringing together and analyzing information, whether it be you're structured you're semi-structured to unstructured data, both in a fast or in a real-time perspective, looking at anomalies, being able to do some of those on detection methods, uh, looking at neural network analysis, time series information. So next steps we'd love to have an additional conversation with you. You can also find on some additional information around how called areas working in federal government, by going to cloudera.com solutions slash public sector. And we welcome scheduling a meeting with you again, thank you for joining us today. Uh, we greatly appreciate your time and look forward to future conversations. Thank you.

Published Date : Jul 22 2021

SUMMARY :

So as we look at fraud and across So as we also look at a report So as we look at those areas, what are the areas that we see additional So I think one of the key elements is, you know, you can look at your, Um, and we can also look at more, uh, advanced data sources So as we're looking at, you know, from a, um, an audit planning or looking and the value behind this is, um, how do we actually look at increasing Um, also looking at increasing the amount of, uh, the level of compliance, um, And on that, I'm going to turn it over to Chevy to talk about, uh, the reference architecture for doing Um, and you know, before I get into the technical details, uh, I want to talk about how this It could be in the data center or even on edge devices, and this data needs to be collected so At the same time, we can be collecting data from an edge device that's streaming in every second, So the data has been enrich. So the next step in the architecture is to leverage, uh, clutter SQL stream builder, obtain the accuracy of the performance, the X one, uh, scores that we want, And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the the analysis, the information that Sheva and I have provided, uh, to give you some insights

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Cindy MikeyPERSON

0.99+

NvidiaORGANIZATION

0.99+

MollyPERSON

0.99+

Nasheb IsmailyPERSON

0.99+

PWCORGANIZATION

0.99+

JoePERSON

0.99+

CindyPERSON

0.99+

ClouderaORGANIZATION

0.99+

2017DATE

0.99+

Cindy MaikePERSON

0.99+

TodayDATE

0.99+

over $65 billionQUANTITY

0.99+

todayDATE

0.99+

NIFAORGANIZATION

0.99+

over $51 billionQUANTITY

0.99+

57 billionQUANTITY

0.99+

saltyPERSON

0.99+

singleQUANTITY

0.98+

firstQUANTITY

0.98+

JasonPERSON

0.98+

oneQUANTITY

0.97+

91 billionQUANTITY

0.97+

IRSORGANIZATION

0.96+

ShevPERSON

0.95+

bothQUANTITY

0.95+

AvroPERSON

0.94+

ApacheORGANIZATION

0.93+

eightQUANTITY

0.93+

$148 billionQUANTITY

0.92+

zero changesQUANTITY

0.91+

RichmondLOCATION

0.91+

ShevaPERSON

0.88+

single technologyQUANTITY

0.86+

ClouderaTITLE

0.85+

Monte CarloTITLE

0.84+

eight timesQUANTITY

0.83+

cloudera.comOTHER

0.79+

KafkaTITLE

0.77+

secondQUANTITY

0.77+

one individualQUANTITY

0.76+

coffinPERSON

0.72+

KafkaPERSON

0.69+

JaceTITLE

0.69+

SQLTITLE

0.68+

17QUANTITY

0.68+

over halfQUANTITY

0.63+

ChevyORGANIZATION

0.57+

elementsQUANTITY

0.56+

halfQUANTITY

0.56+

mini fiveCOMMERCIAL_ITEM

0.54+

Apache FlinkORGANIZATION

0.52+

HBaseTITLE

0.45+

Morgan McLean, Google Cloud Platform & Ben Sigelman, LightStep | KubeCon + CloudNativeCon EU 2019


 

>> Live from Barcelona, Spain it's theCUBE, covering KubeCon, CloudNativeCon, Europe 2019. Brought to you by Red Hat, the Cloud Native Computing Foundation and Ecosystem Partners. >> Welcome back. This is theCUBE's coverage of KubeCon, CloudNativeCon 2019. I'm Stu Miniman, my co-host for two days wall-to-wall coverage is Corey Quinn. Happy to welcome back to the program first Ben Sigelman, who is the co-founder and CEO of LightStep. And welcome to the program a first time Morgan McLean, who's a product manager at Google Cloud Platform. Gentlemen, thanks so much for joining us. >> Thanks for having us. >> Yeah. >> All right so, this was a last minute ad for us because you guys had some interesting news in the keynote. I think the feedback everybody's heard is there's too many projects and everything's overlapping, and how do I make a decision, but interesting piece is OpenCensus, which Morgan was doing, and OpenTracing, which Ben and LightStep were doing are now moving together for OpenTelemetry if I got it right. >> Yup. >> So, is it just everybody's holding hands and singing Kumbaya around the Kubernetes campfire, or is there something more to this? >> Well I mean, it started when the CNCF locked us in a room and told us there were too many projects. (Stu and Ben laughing) Really wouldn't let us leave. No, to be fair they did actually take us to a room and really start the ball rolling, but conversations have picked up for the last few months and personally I'm just really excited that it's gone so well. Initially if you told me six or nine months ago that this would happen, I would've been, given just the way the projects were going, both were growing very quickly, I would've been a little skeptical. But seriously, this merger's gone beyond my wildest dreams. It's awesome, both to unite the communities, it's awesome to unite the projects together. >> What has the response been from the communities on this merger? >> Very positive. >> Yeah. >> Very positive. I mean OpenTracing and OpenCensus are both projects with healthy user bases that are growing quickly and all that, but the reason people adopt them is to future-proof their own software. Because they want to adopt something that's going to be here to stay. And by having these two things out in the world that are both successful, and were overlapping in terms of their goals, I think the presence of two projects was actually really problematic for people. So, the fact that they're merging is net positive, absolutely for the end user community, also for the vendor community, it's a similar, it's almost exactly the same parallel thought process. When we met, the CNCF did broker an in-person meeting where they gave us some space and we all got together and, I don't know how many people were there, like 20 or 30 people in that room. >> They did let us leave the room though, yesterday, yeah that was nice. >> They did let us leave the room, that's true. We were not locked in there, (Morgan laughing) but they asked us in the beginning, essentially they asked everyone to state what their goals were. And almost all of us really had the same goal, which is just to try and make it easy for end users to adopt a telemetry project that they can stick with for the long haul. And so when you think of it in that respect, the merger seems completely obvious. It is true that it doesn't happen very often, and we could speculate about why that is. But I think in this case it was enabled by the fact that we had pretty good social relationships with OpenCensus people. I think Twitter tends to amplify negativity in the world in general, as I'm sure people, not a controversial statement. >> News alert, wait, absolutely the negatives are, it's something in the algorithm I think. >> Yeah, yeah. >> Maybe they should fix that. >> Yeah, yeah (laughs) exactly. And it was funny, there was a lot of perceived animosity between OpenTracing and OpenCensus a year ago, nine months ago, but when you actually talk to the principals in the projects and even just the general purpose developers who are doing a huge amount of work for both projects, that wasn't a sentiment that was widely held or widely felt I think. So, it has been a very kind of happy, it's a huge relief frankly, this whole thing has been a huge relief for all of us I think. >> Yeah it feels like the general ask has always been that, for tracing that doesn't suck. And that tends to be a bit of a tall order. The way that they have seemed to have responded to it is a credit to the maturity of the community. And I think it also speaks to a growing realization that no one wants to have a monoculture of just one option, any color you want so long as it's black. (Ben laughing) Versus there's 500 different things you can pick that all stand in that same spot, and at that point analysis paralysis kicks in. So this feels like it's a net positive for, absolutely everyone involved. >> Definitely. Yeah, one of the anecdotes that Ben and I have shared throughout a lot of these interviews is there were a lot of projects that wanted to include distributed tracing in them. So various web frameworks, I think, was it Hadoop or HBase was-- >> HBase and HDFS were jointly deciding what to do about instrumentation. >> Yeah, and so they would publish an issue on GitHub and someone from OpenTracing would respond saying hey, OpenTracing does this. And they'd be like oh, that's interesting, we can go build an implementation file and issue, someone from OpenCensus would respond and say, no wait, you should use OpenCensus. And with these being very similar yet incompatible APIs, these groups like HBase would sit it and be like, this isn't mature enough, I don't want to deal with this, I've got more important things to focus on right now. And rather than even picking one and ignoring the other, they just ignored tracing, right? With things moving to microservices with Kubernetes being so popular, I mean just look at this conference. Distributed tracing is no longer this kind of nice to have when you're a big company, you need it to understand how your app works and understand the cause of an outage, the cause of a problem. And when you had organizations like this that were looking at tracing instrumentation saying this is a bit of joke with two competing projects, no one was being served well. >> All right, so you talked about there were incompatible APIs, so how do we get from where we were to where we're going? >> So I can talk about that a little bit. The APIs are conceptually incredibly similar. And the part of the criteria for any new language, for OpenTelemetry, are that we are able to build a software bridge to both OpenTracing and OpenCensus that will translate existing instrumentation alongside OpenTelemetry instrumentation, and omit the correct data at the end. And we've built that out in Java already and then starting working a few other languages. It's not a tremendously difficult thing to do if that's your goal. I've worked on this stuff, I started working on Dapper in 2004, so it's been 15 years that I've been working in this space, and I have a lot of regrets about what we did to OpenTracing. And I had this unbelievably tempting thing to start Greenfield like, let's do it right this time, and I'm suppressing every last impulse to do that. And the only goal for this project technically is backwards compatibility. >> Yeah. >> 100% backwards compatibility. There's the famous XKCD comic where you have 14 standards and someone says, we need to create a new standard that will unify across all 14 standards, and now you have 15 standards. So, we don't want to follow that pattern. And by having the leadership from OpenTracing and OpenCensus involved wholesale in this new effort, as well as having these compatibility bridges, we can avoid the fate of IPv6, of Python 3 and things like that. Where the new thing is very appealing but it's so far from the old thing that you literally can't get there incrementally. So that's, our entire design constraint is make sure that backwards compatibility works, get to one project and then we can think about the grand unifying theory of a provability-- >> Ben you are ruining the best thing about standards is that there is so many of them to choose from. (everyone laughing) >> There's still plenty more growing in other areas (laughs) just in this particular space it's smaller. >> One could argue that your approach is nonstandard in its own right. (Ben laughing) And in my own experiments with distributed tracing it seems like step one is, first you have to go back and instrument everything you've built. And step two, hey come back here, because that's a lot of work. The idea of an organization going back and reinstrumenting everything they've already instrumented the first time. >> It's unlikely. >> Unless they build things very modularly and very portably to do exactly that, it's a bit of a heavy lift. >> I agree, yeah, yeah. >> So going forward, are people who have deployed one or the other of your projects going to have to go back and do a reinstrumentation, or will they unify and continue to work as they are? >> So, I would pause at the, I don't know, I would be making up the statistic, so I shouldn't. But let's say a vast majority, I'm thinking like 95, 98% of instrumentation is actually embedded in frameworks and libraries that people depend on. So you need to get Dropwizard, and Spring, and Django, and Flask, and Kafka, things like that need to be instrumented. The application code, the instrumentation, that burden is a bit lower. We announced something called SpecialAgent at LightStep last week, separate to all of this. It's kind of a funny combination, a typical APM agent will interpose on individual function calls, which is a very complicated and heavyweight thing. This doesn't do any of that, but it takes, it basically surveys what you have in your process, it looks for OpenTracing, and in the future OpenTelemetry instrumentation that matches that, and then installs it for you. So you don't have to do any manual work, just basically gluing tab A into slot B or whatever, you don't have to do any of that stuff which is what most OpenTracing instrumentation actually looks like these days. And you can get off the ground without doing any code modifications. So, I think that direction, which is totally portable and vendor neutral as well, as a layer on top of telemetry makes a ton of sense. There are also data translation efforts that are part of OpenCensus that are being ported in to OpenTelemetry that also serve to repurpose existing sources of correlated data. So, all these things are ways to take existing software and get it into the new world without requiring any code changes or redeploys. >> The long-term goal of this has always been that because web framework and client library providers will go and build the instrumentation into those, that when you're writing your own service that you're deploying in Kubernetes or somewhere else, that by linking one of the OpenTelemetry implementations that you get all of that tracing and context propagation, everything out of the box. You as a sort of individual developer are only using the APIs to define custom metrics, custom spans, things that are specific to your business. >> So Ben, you didn't name LightStep the same as your project. But that being said, a major piece of your business is going through a change here, what does this mean for LightStep? >> That's actually not the way I see it for what it's worth. LightStep as a product, since you're giving me an opportunity to talk about it, (laughs) foolish move on your part. No, I'm just kidding. But LightStep as a product is totally omnivorous, we don't really care where the data comes from. And translating any source of data that has a correlation ID and a timestamp is a pretty trivial exercise for us. So we do support OpenTracing, we also support OpenCensus for what it's worth. We'll support OpenTelemetry, we support a bunch of weird in-house things people have already built. We don't care about that at all. The reason that we're pursuing OpenTelemetry is two-fold, one is that we do want to see high quality data coming out of projects. We said at the keynote this morning, but observability literally cannot be better than your telemetry. If your telemetry sucks, your observability will also suck. It's just definitionally true, if you go back to the definition of observability from the '60s. And so we want high quality telemetry so our product can be awesome. Also, just as an individual, I'm a nerd about this stuff and I just like it. I mean a lot of my motivation for working on this is that I personally find it gratifying. It's not really a commercial thing, I just like it. >> Do you find that, as you start talking about this more and more with companies that are becoming cloud-native rapidly, either through digital transformation or from springing fully formed from the forehead of some God, however these born in the cloud companies tend to be, that they intuitively are starting to grasp the value of tracing? Or does this wind up being a much heavier lift as you start, showing them the golden path as it were? >> It's definitely grown like I-- >> Well I think the value of tracing, you see that after you see the negative value of a really catastrophic outage. >> Yes. >> I mean I was just talking to a bank, I won't name the bank but a bank at this conference, and they were talking about their own adoption of tracing, which was pretty slow, until they had a really bad outage where they couldn't transact for an hour and they didn't know which of the 200 services was responsible for the issue. And that really put some muscle behind their tracing initiative. So, typically it's inspired by an incident like that, and then, it's a bit reactive. Sometimes it's not but either way you end up in that place eventually. >> I'm a strong proponent of distributed tracing and I feel very seen by your last answer. (Ben laughing) >> But it's definitely made a big impact. If you came to conferences like this two years ago you'd have Adrian, or Yuri or someone doing a talk on distributed tracing. And they would always start by asking the 100 to 200 person audience, who here knows what distributed tracing is? And like five people would raise their hand and everyone else would be like no, that's why I'm here at the talk, I want to find out about it. And you go to ones now, or even last year, and now they have 400 people at the talk and you ask, who knows what distributed tracing is? And last year over half the people would raise their hand, now it's going to be even higher. And I think just beyond even anecdotes, clearly businesses are finding the value because they're implementing it. And you can see that through the number of companies that have an interest in OpenTracing, OpenTelemetry, OpenCensus. You can see that in the growth of startups in this space, LightStep and others. >> The other thing I like about OpenTelemetry as a name, it's a bit of a mouthful but that's, it's important for people to understand the distinction between telemetry and tracing data and actual solutions. I mean OpenTelemetry stops when the correct data is being omitted. And then what you do with that data is your own business. And I also think that people are realizing that tracing is more than just visualizing a single distributed trace. >> Yeah. >> The traces have an enormous amount of information in there about resource usage, security patterns, access patterns, large-scale performance patterns that are embedded in thousands of traces, that sort of data is making its way into products as well. And I really like that OpenTelemetry has clearly delineated that it stops with the telemetry. OpenTracing was confusing for people, where they'd want tracing and they'd adopt OpenTracing, and then be like, where's my UI? And it's like well no, it's not that kind of project. With OpenTelemetry I think we've been very clear, this is about getting >> The name is more clear yeah. >> very high quality data in a portable way with minimal effort. And then you can use that in any number of ways, and I like that distinction, I think it's important. >> Okay so, how do we make sure that the combination of these two doesn't just get watered-down to the least common denominator, or that Ben just doesn't get upset and say, forget it, I'm going to start from scratch and do it right this time? (Ben laughing) >> I'm not sure I see either of those two happening. To your comment about the least common denominator, we're starting from what I was just commenting about like two years ago, from very little prior art. Like yeah, you had projects like Zipkin, and Zipkin had its own instrumentation, but it was just for tracing, it was just for Zipkin. And you had Jaeger with its own. And so, I think we're so far away, in a few years the least common denominator will be dramatically better than what we have today. (laughs) And so at this stage, I'm not even remotely worried about that. And secondly to some vendor, I know, because Ben had just exampled this, >> Some vendor, some vendor. >> that's probably not, probably not the best one. But for vendor interference in this projects, I really don't see it. Both because of what we talked about earlier where the vendors right now want more telemetry. I meet with them, Ben meets with 'em, we all meet with 'em all the time, we work with them. And the biggest challenge we have is just the data we get is bad, right? Either we don't support certain platforms, we'll get traces that dead end at certain places, we don't get metrics with the same name for certain types of telemetry. And so this project is going to fix that and it's going to solve this problem for a lot of vendors who have this, frankly, a really strong economic incentive to play ball, and to contribute to it. >> Do you see that this, I guess merging of the two projects, is offering an opportunity to either of you to fix some, or revisit if not fix, some of the mistakes, as they were, of the past? I know every time I build something I look back and it was frankly terrible because that's the kind of developer I am. But are you seeing this, as someone who's probably, presumably much better at developing than I've ever been, as the opportunity to unwind some of the decisions you made earlier on, out of either ignorance or it didn't work out as well as you hoped? >> There are a couple of things about each project that we see an opportunity to correct here without doing any damage to the compatibility story. For OpenTracing it was just a bit too narrow. I mean I would talk a lot about how we want to describe the software, not the tracing system. But we kind of made a mistake in that we called it OpenTracing. Really people want, if a request comes in, they want to describe that request and then have it go to their tracing system, but also to their metric system, and to their logging stack, and to anywhere else, their security system. You should only have to instrument that once. So, OpenTracing was a bit too narrow. OpenCensus, we've talked about this a lot, built a really high quality reference implementation into the product, if OpenCensus, the product I mean. And that coupling created problems for vendors to adopt and it was a bit thick for some end users as well. So we are still keeping the reference implementation, but it's now cleanly decoupled. >> Yeah. >> So we have loose coupling, a la OpenTracing, but wider scope a la OpenCensus. And in that aspect, I think philosophically, this OpenTelemetry effort has taken the best of both worlds from these two projects that it started with. >> All right well, Ben and Morgan thank you so much for sharing. Best of luck and let us know if CNCF needs to pull you guys in a room a little bit more to help work through any of the issues. (Ben laughing) But thanks again for joining us. >> Thank you so much. >> Thanks for having us, it's been a pleasure. >> Yeah. >> All right for Corey Quinn, I'm Stu Miniman we'll be back to wrap up our day one of two days live coverage here from KubeCon, CloudNativeCon 2019, Barcelona, Spain. Thanks for watching theCUBE. (soft instrumental music)

Published Date : May 21 2019

SUMMARY :

Brought to you by Red Hat, the Cloud Native Happy to welcome back to the program first Ben Sigelman, because you guys had some interesting news in the keynote. and really start the ball rolling, like 20 or 30 people in that room. They did let us leave the room though, And so when you think of it in that respect, in the algorithm I think. and even just the general purpose developers And that tends to be a bit of a tall order. Yeah, one of the anecdotes that Ben and I have shared HBase and HDFS were jointly deciding And rather than even picking one and ignoring the other, And the only goal for this project There's the famous XKCD comic where you have 14 standards is that there is so many of them to choose from. growing in other areas (laughs) just in this One could argue that your to do exactly that, it's a bit of a heavy lift. and get it into the new world without requiring that by linking one of the OpenTelemetry implementations But that being said, a major piece of your business one is that we do want to see high quality data you see that after you see the negative value And that really put some muscle and I feel very seen by your last answer. You can see that in the growth of startups And then what you do with that data is your own business. And I really like that OpenTelemetry has clearly delineated and I like that distinction, I think it's important. And you had Jaeger with its own. Some vendor, And so this project is going to fix that and it's going to solve is offering an opportunity to either of you to fix some, and then have it go to their tracing system, And in that aspect, I think philosophically, Best of luck and let us know if CNCF needs to pull you guys Thanks for having us, Thanks for watching theCUBE.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Ben SigelmanPERSON

0.99+

2004DATE

0.99+

Corey QuinnPERSON

0.99+

Stu MinimanPERSON

0.99+

MorganPERSON

0.99+

20QUANTITY

0.99+

BenPERSON

0.99+

Red HatORGANIZATION

0.99+

Cloud Native Computing FoundationORGANIZATION

0.99+

StuPERSON

0.99+

100QUANTITY

0.99+

Python 3TITLE

0.99+

two projectsQUANTITY

0.99+

yesterdayDATE

0.99+

last yearDATE

0.99+

JavaTITLE

0.99+

five peopleQUANTITY

0.99+

15 yearsQUANTITY

0.99+

thousandsQUANTITY

0.99+

LightStepORGANIZATION

0.99+

AdrianPERSON

0.99+

last weekDATE

0.99+

bothQUANTITY

0.99+

400 peopleQUANTITY

0.99+

two daysQUANTITY

0.99+

KubeConEVENT

0.99+

30 peopleQUANTITY

0.99+

Morgan McLeanPERSON

0.99+

twoQUANTITY

0.99+

200 servicesQUANTITY

0.99+

each projectQUANTITY

0.99+

CNCFORGANIZATION

0.99+

nine months agoDATE

0.99+

YuriPERSON

0.99+

two thingsQUANTITY

0.99+

OpenCensusTITLE

0.99+

BothQUANTITY

0.99+

TwitterORGANIZATION

0.99+

oneQUANTITY

0.99+

OpenCensusORGANIZATION

0.99+

Barcelona, SpainLOCATION

0.99+

OpenTracingTITLE

0.99+

CloudNativeConEVENT

0.98+

two years agoDATE

0.98+

95, 98%QUANTITY

0.98+

200 personQUANTITY

0.98+

Ecosystem PartnersORGANIZATION

0.98+

one optionQUANTITY

0.98+

one projectQUANTITY

0.98+

first timeQUANTITY

0.98+

two-foldQUANTITY

0.98+

both projectsQUANTITY

0.97+

sixDATE

0.97+

GoogleORGANIZATION

0.97+

two years agoDATE

0.97+

15 standardsQUANTITY

0.97+

firstQUANTITY

0.97+

LightStepTITLE

0.96+

GitHubORGANIZATION

0.96+

CloudNativeCon 2019EVENT

0.96+

'60sDATE

0.96+

OpenTracingORGANIZATION

0.96+

ZipkinORGANIZATION

0.96+

Susan St. Ledger, Splunk | Splunk .conf18


 

live from Orlando Florida it's the cube covered conf 18 got to you by Splunk welcome back to our land Oh everybody I'm Dave Volante with my co-hosts two minima and you're watching the cube the leader in live tech coverage we're brought here by Splunk toises Splunk off 18 hashtag spunk conf 18 Susan st. Leger is here she's the president of worldwide field operations at Splunk Susan thanks for coming on the cube thanks so much for having me today so you're welcome so we've been reporting actually this is our seventh year we've been watching the evolution of Splunk going from sort of hardcore IT OPSEC ops now really evolving in doing some of the things that when everybody talked about big data back in the day and spunk really didn't they talked about doing all these things that actually they're using Splunk for now so it's really interesting to see that this has been a big tailwind for you guys but anyway big week for you guys how do you feel I feel incredible we had you know we've it announced more innovations today just today then we have probably in the last three years combined we have another big set of innovations to announce tomorrow and you know just as an indicator of that I think you heard Tim today our CTO say on stage we to date have 282 patents and we are one of the world leaders in terms of the number of patents that we have and we have 500 pending right so if you think about 282 since the inception of the company and 500 pending it's a pretty exciting time for spunk people talk about that flywheel we were talking stew and I were talking earlier about some of the financial metrics and you know you have a lot of a large deal seven-figure deals which which you guys pointed out on your call let's see that's the outcome of having happy customers it's not like you turn to engineer that you just serving customers and that's what what they do I talk about how Splunk next is really bringing you into new areas yeah so spike next is so exciting there's really three three major pillars if you will design principles to spunk next one is to help our customers access data wherever it lives another one is to get actionable outcomes from the data and the third one is to allow unleash the power spunk to more users so there really the three pillars and if you think about maybe how we got there we have all of these people within IT and security that are the experts on Splunk the swing ninjas ful and their being they see the power of spunk and how it can help all these other departments and so they're being pulled in to help those other departments and they're basically saying Splunk help us help our business partners make it easier to get there to help them unleash the power spunk for them so they don't necessarily need us for all of their needs and so that's really what's what next is all about it's about making it again access data easier actionable outcomes and then more users and so we're really excited about it so talk about those new users I mean obviously the ITA ops they're your peeps so are they sort of advocating to you into the line of business or are you probably being dragged into the line of business what's that dynamic like yeah it's definitely we're customer success first and we're listening to our customers and they're asking us to take them that should go there with them right there being pulled that they know that what we what we say with our customers what are what our deepest customers understand about us is everybody needs funk it's just not everyone knows it yet and I said they're teaching their business why they need it and so it's really a powerful thing and so we're partnering with them to say how do we help them create business applications more which you'll see tomorrow in our announcements to help their business users you know one of the things that strikes us if we were talking it was the DevOps gentleman when you look at the companies that are successful with so-called digital transformation they have data at the core and they have sort of I guess I don't want to say a single data model but it's not a data model of stovepipes and that's what he described and essentially if I understand the power of Splunk just in talking to some of your customers it's really that singular data model that everybody can collaborate on with get advice from each other across the organization so not this sort of stovepipe model it seems like a fundamental linchpin of digital transformation even though you guys haven't been using that overusing that term thank you sort of a sign of smug you didn't use the big data term when big data was all hot now you use it same thing with digital transformation you're a fundamental it would seem to me to a lot of companies digital transformation that's exactly if you think about we started nineteen security but the reason for that is they were the first ones to truly do digital transformation right those are just the two the two organizations that started but exactly the way that they did it now all the other business units are trying to do it and that same exact platform that same exact platform that we use there's no reason we can't use it for those other areas those other functions but but if we want to go there faster we have to make it easier to use spunk and that's what you're seeing with spunk next you know I look at my career the last couple of decades we've been talking about oh well there's going to we're gonna leverage data and there's go where we want to be predictive on the models but that the latest wave of kind of AI ml and deep learning what I heard what you're talking about and in the Splunk next maybe you could talk a little bit about why it's real now and why we're actually going to be able to do more with our data to be able to extract the value out of it and really enable businesses sure so I think machine learning is that is at the heart of it and you know we we actually do two things from a machine learning perspective number one is within each of our market groups so IT security IT operations we have data scientists that work to build models within our applications so we build our own models and then we're hugely transparent with our customers about what those models are so they can tweak them if they like but we pre build those so that they have them in each of those applications so that's number one and and that's part of the actionable outcomes right ml helps drive actionable outcomes so much faster the second aspect is the ML TK right which is we give the our customers in ml TK so they can you know build their own algorithms and leverage everything all of the models that are out there as well so I think that two-fold approach really helps us accelerate the insights that we give to our customers Susan how are you evolving your go-to-market model as you think about Splunk next and just think about more line of business interactions so what are you doing on the go-to-market side yeah so the go to market when you think about reaching all of those other verticals if you will right it's very much going to be about the ecosystem all right so it's it's going to be about the solution provider ecosystem about the ISV ecosystem about the big the si is both boutique and the global s is to help us really Drive Splunk into all the verticals and meet their needs and so that will be one of the big things that you see we will obviously still have our horizontal focus across IT and security but we are really understanding what are the use cases within financial services what are the use cases within healthcare that can be repeated thousands of times and if you saw some of the announcements today in particular the data stream processor which allows you to act on data in motion with millisecond response that now puts you as close to real-time as anything we've ever seen in the data landscape and that's going to open up just a series of use cases that nobody ever thought of using spoil for so I wonder what you're hearing from customers when they talk about how do they manage that that pace of change out there I really like I walked around the show floor stuff I've been hearing lots people talking about you know containers and we had one of the your customers talking about how kubernetes fits into what they're doing seems like it really is a sweet spot for spunk that you can deal with all of these different types of information and it makes it even more important for customers to come to you yeah as you heard from Doug today in our keynote our CEO and the keynote it is a messy world right and part of the message just because it's a digital explosion and it's not going to get any slower it's just going to continue to get faster and I know you met with some of our customers earlier today and if'n carnival if you think about the landscape of NIF right I mean their mission is to protect the arsenal of nuclear weapons for the country right to make them more efficient to make them safer and if you think about all of it they not only have traditional IT operations and security they have to worry about but they have this landscape of lasers and all these sensors everywhere and that and when you look at that that's the messy data landscape and I think that's where Splunk is so uniquely positioned because of our approach you can operate on data in motion or at rest and because there is no structuring upfront I would I want to come back to what you said about real-time because that you know I oh I've said this now for a couple years but never used to use the term when Big Data was at its the peak of what does a gardener call it the hype cycle you guys didn't use that term and and so when you think about the use cases and in the Big Data world you've been hearing about real time forever now you're talking about it enterprise data warehouse you know cheaper EDW is fraud detection better analytics for the line of business obviously security and IT ops these are some of the use cases that we used to hear about in Big Data you're doing like all these now and sort of your platform can be used in all of these sort of traditional Big Data use cases am i understanding that problem 100% understanding it properly you know Splunk has again really evolved and if you think about again some of the announcements today think about date of fabric search right rather than saying you have to put everything into one instance or everything into one place right we're saying we will let you operate across your entire landscape and do your searches at scale and you know spunk was already the fastest at searching across your global enterprise to start with and when we were two to three times faster than anybody who compete it with us and now we improve that today by fourteen hundred percent I don't I don't even know where like you just look at again it ties back to the innovations and what's being done in our developer community within our engineering and team in those traditional use cases that I talked about in big data it was it was kind of an open source mess really complex zookeeper is the big joke right and always you know hive and pig and you know HBase and blah blah blah and we're practitioners of a lot of that stuff that's it's very complex essentially you've got a platform that now can be used the same platform that you're using in your traditional base that you're bringing to the line of business correct okay right it's the same exact platform we are definitely putting the power of Splunk in in the users hand so by doing things like mobile use on mobile and AR today and again I wish I could talk about what's coming tomorrow but let's just say our business users are going to be pretty blown away by what they're going to see tomorrow in our announcements yeah so I mean I'm presuming these are these are modern it's modern software micro services API base so if I want to bring in those open source tool tools I can in fact what you'll actually see when you understand more about the architecture is we're actually leveraging a lot of open-source and what we do so you know capabilities a spark and flink and but what we're doing is we're masking the complex the complexity of those from the user so instead of you having to do your own spark environment your own flink environment and you know having to figure out Kafka on your own and how you subscribe to what we're giving you all that we're we're masking all that for you and giving you the power of leveraging those tools so this becomes increasingly important my opinion especially as you start bringing in things like AI and machine learning and deep learning because that's going to be adopted both within a platform like use as yours but outside as well so you have to be able to bring in innovations from others but at the same time to simplify it and reduce that complexity you've got to infuse AI into your own platform and that's exactly what you're doing it's exactly what we're doing it's in our platform it's in our applications and then we provide the toolkit the SDK if you will so users can take it to another level all right so you've got 16,000 customers today if I understand the vision of SPARC next you're looking to get an order of magnitude more customers that you of it as addressable market talk to us about the changes that need to happen in the field is it just you're hitting an inflection point you've got those you know evangelists out there and I you know I see the capes and the fezzes all over the show so how is your field get ready to reach that broader audience yeah I think that's a great question again once again it will I'll tell you what we're doing internally but it's also about the ecosystem right in order to go broader it has to be about this this Splunk ecosystem and on the technology side we're opening the aperture right it's micro services it's ap eyes it's cloud there's there's so much available for that ecosystem and then from a go-to-market perspective it's really about understanding where the use cases are that can be repeated thousands of times right that the the the big problems that each of those verticals are trying to solve as opposed to the one corner use case that you know you could you could solve for one customer and that was actually one of the things we found is when we did analysis we used to do case studies on Big Data number one use case that always came back was custom because nothing was repeatable and that's how we were seeing you know a little bit more industry specific issues I was at soft ignite last week and you know Microsoft is going deep on verticals to get specific as to you know for IOT and AI how they can get specific in those environments I agreed I think again one of the things that so unique about Splunk platform is because it is the same platform that's at the underlying aspect that serves all of those use cases we have the ability in my opinion to do it in a way that's far less custom than anybody else and so we've seen the ecosystem evolve as well again six seven years ago it was kind of a tiny technology ecosystem and last year in DC we saw it really starting to expand now you walk around here you see you know some big booths from some of the SI partners that's critical because that's global scale deep deep industry expertise but also board level relationships absolutely that's another part of the the go-to markets Splunk becomes more strategic this is a massive Tam expansion that where we are potentially that we're witnessing with Splunk how do you see those conversations changing are you personally involved in more of those boardroom discussions definitely personally involved in your spot on to say that that's what's happening and I think a perfect example is you talk to Carnival today right we didn't typically have a lot of CEOs at the Splunk conference right now we have CEOs coming to the spunk conference right because it is at that level of strategic to our customers and so when you think about Carnival and yes they're using it for the traditional IT ops and security use cases but they're also using it for their customer experience and who would ever think you know ten years ago or even five years ago of Splunk as a customer experience platform but really what's at the heart of customer experience it's data so speaking of the CEO of Carnival Arnold Donald it's kind of an interesting name and and so he he stood up in the States today talking about diversity doubling down on diversity as an african-american you know you frankly in our industry you don't see a lot of african-americans CEOs you don't see a ton of women CEOs you don't see the son of women with with president in their title so he he made a really kind of interesting statement where he said something to the effect of forty years ago when I started in the business I didn't work with a lot of people like me and I thought that was a very powerful statement and he also said essentially look at if we're diverse we're gonna beat you every time your thoughts as an executive and in tech and a woman in tech so first of all i 100% agree with him and i can actually go back to my start i was a computer scientist at NSA so i didn't see a lot of people who looked like me and so from that perspective I know exactly where he's coming from and I am I'll tell you at Splunk we have a huge investment in diversity and not because it's a checkbox but because we believe in exactly what he says it's a competitive edge when you get people who think differently because you came from a different background because you're a different ethnicity because you were educated differently whatever it is whether it's gender whether it's ethnicity whether it's just a different approach to thinking all differentiation puts a different lens and and that way you don't get stove you don't have stovepipe thinking and I what I love about our culture at spunk is that we we call it a high growth mindset and if you're not intellectually curious and you don't want to think beyond the boundaries then it's probably not a good fit for you and a big part of that is having a diverse environment we do a lot of spunk to drive that we actually posted our gender diversity statistics last year because we believe if you don't measure it you're never going to improve it and it was a big step right to say we want to publish it we want to hold herself accountable and we've done a really nice job of moving it a little over 1% in one year which for our population is pretty big but we're doing really unique things like we have all job descriptions are now analyzed there's actually a scientific analysis that can be done to make sure that the job description does not bias whether men are women whether men alone or whether it's you know gender neutral so that that's exciting obviously we have a big women in technology program and we have a high potential focus on our top women as well what's interesting about your story Susan and we spent a lot of time on the cube talking about diversity generally in women in tech specifically we support a lot of WI t and we always talk him frequently we're talking about women and engineering roles or computer science roles and how they they oftentimes even when they graduate with that degree they don't come into tech and what strikes me about your path is your technical and yet now you've become this business executive so and I would imagine that having that background that technical background only helped in terms of especially in this industry so there are paths beyond just the technical role one hundred percent it first of all it's a huge advantage I believe it's the core reason why I am where I am today because I have the technical aptitude and while I enjoyed the business side of it as much and I love the sales side and the marketing side and all of the above the truth of the matter is at my core I think it's that intellectual curiosity that came out of my technical background that kept me going and really made me very I took risks right and if you look at my career it's much more of a jungle gym than a ladder and the way you know I always give advice to young people who generally it's young women who ask but oh sometimes it's the young men as well which is like how did you get to where you are how do I plan that how do I get and the truth of the matter is you can't if you try and plan it it's probably not going to work out the exactly the way you plan and so my advice is to make sure that you every time you're going to make a move your ask yourself what am I going to learn Who am I going to learn from and what is it going to add to my experience that I can materially you know say is going to help me on a path to where I ultimately want to be but I think if you try and figure it out and plan a perfect ladder I also think that when you try and do a ladder you don't have what I call pivots which is looking at things from different lenses right so me having been on the engineering side on the sales side on the services side of things it gives me a different lens and understanding the entire experience of our customers as well as the internals of an organization and I think that people who pivot generally are people who are intellectually curious and have intellectual capacity to learn new things and that's what I look for when I hire people I love that you took a nonlinear progression to the path that you're in now and it's speaking of you know the the technical I think if you're in this business you better like tech or what are you doing in this business but the more you understand technology the more you can connect the dots between how technology is impacting business and then how it can be applied in new ways so well congratulations on your careers you got a long way to go and thanks so much for coming on the queue so much David I really appreciate it thank you okay keep it right - everybody stew and I'll be back with our next guest we're live from Splunk Don Capcom 18 you're watching the cube [Music]

Published Date : Oct 2 2018

SUMMARY :

it's the cube covered conf 18 got to you

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave VolantePERSON

0.99+

DavidPERSON

0.99+

SusanPERSON

0.99+

twoQUANTITY

0.99+

Susan St. LedgerPERSON

0.99+

fourteen hundred percentQUANTITY

0.99+

282 patentsQUANTITY

0.99+

100%QUANTITY

0.99+

SplunkORGANIZATION

0.99+

last yearDATE

0.99+

second aspectQUANTITY

0.99+

todayDATE

0.99+

tomorrowDATE

0.99+

Orlando FloridaLOCATION

0.99+

DougPERSON

0.99+

NSAORGANIZATION

0.99+

last weekDATE

0.99+

last yearDATE

0.99+

seventh yearQUANTITY

0.99+

MicrosoftORGANIZATION

0.98+

16,000 customersQUANTITY

0.98+

TimPERSON

0.98+

thousands of timesQUANTITY

0.98+

CarnivalORGANIZATION

0.98+

two organizationsQUANTITY

0.98+

forty years agoDATE

0.98+

two-foldQUANTITY

0.97+

one yearQUANTITY

0.97+

oneQUANTITY

0.97+

DCLOCATION

0.97+

five years agoDATE

0.97+

african-americansOTHER

0.97+

one customerQUANTITY

0.97+

Susan st. LegerPERSON

0.97+

eachQUANTITY

0.96+

third oneQUANTITY

0.96+

three pillarsQUANTITY

0.96+

african-americanOTHER

0.96+

bothQUANTITY

0.96+

ten years agoDATE

0.95+

stewPERSON

0.95+

two thingsQUANTITY

0.95+

six seven years agoDATE

0.94+

one hundred percentQUANTITY

0.94+

one cornerQUANTITY

0.93+

first onesQUANTITY

0.93+

three timesQUANTITY

0.93+

over 1%QUANTITY

0.92+

singleQUANTITY

0.91+

seven-figure dealsQUANTITY

0.9+

thousands of timesQUANTITY

0.89+

earlier todayDATE

0.89+

500 pendingQUANTITY

0.88+

spunkPERSON

0.86+

last couple of decadesDATE

0.84+

EDWORGANIZATION

0.82+

one placeQUANTITY

0.81+

lotQUANTITY

0.8+

threeQUANTITY

0.8+

last three yearsDATE

0.79+

KafkaTITLE

0.79+

three major pillarsQUANTITY

0.78+

a ton of womenQUANTITY

0.77+

Pandit Prasad, IBM | DataWorks Summit 2018


 

>> From San Jose, in the heart of Silicon Valley, it's theCube. Covering DataWorks Summit 2018. Brought to you by Hortonworks. (upbeat music) >> Welcome back to theCUBE's live coverage of Data Works here in sunny San Jose, California. I'm your host Rebecca Knight along with my co-host James Kobielus. We're joined by Pandit Prasad. He is the analytics, projects, strategy, and management at IBM Analytics. Thanks so much for coming on the show. >> Thanks Rebecca, glad to be here. >> So, why don't you just start out by telling our viewers a little bit about what you do in terms of in relationship with the Horton Works relationship and the other parts of your job. >> Sure, as you said I am in Offering Management, which is also known as Product Management for IBM, manage the big data portfolio from an IBM perspective. I was also working with Hortonworks on developing this relationship, nurturing that relationship, so it's been a year since the Northsys partnership. We announced this partnership exactly last year at the same conference. And now it's been a year, so this year has been a journey and aligning the two portfolios together. Right, so Hortonworks had HDP HDF. IBM also had similar products, so we have for example, Big Sequel, Hortonworks has Hive, so how Hive and Big Sequel align together. IBM has a Data Science Experience, where does that come into the picture on top of HDP, so it means before this partnership if you look into the market, it has been you sell Hadoop, you sell a sequel engine, you sell Data Science. So what this year has given us is more of a solution sell. Now with this partnership we go to the customers and say here is NTN experience for you. You start with Hadoop, you put more analytics on top of it, you then bring Big Sequel for complex queries and federation visualization stories and then finally you put Data Science on top of it, so it gives you a complete NTN solution, the NTN experience for getting the value out of the data. >> Now IBM a few years back released a Watson data platform for team data science with DSX, data science experience, as one of the tools for data scientists. Is Watson data platform still the core, I call it dev ops for data science and maybe that's the wrong term, that IBM provides to market or is there sort of a broader dev ops frame work within which IBM goes to market these tools? >> Sure, Watson data platform one year ago was more of a cloud platform and it had many components of it and now we are getting a lot of components on to the (mumbles) and data science experience is one part of it, so data science experience... >> So Watson analytics as well for subject matter experts and so forth. >> Yes. And again Watson has a whole suit of side business based offerings, data science experience is more of a a particular aspect of the focus, specifically on the data science and that's been now available on PRAM and now we are building this arm from stack, so we have HDP, HDF, Big Sequel, Data Science Experience and we are working towards adding more and more to that portfolio. >> Well you have a broader reference architecture and a stack of solutions AI and power and so for more of the deep learning development. In your relationship with Hortonworks, are they reselling more of those tools into their customer base to supplement, extend what they already resell DSX or is that outside of the scope of the relationship? >> No it is all part of the relationship, these three have been the core of what we announced last year and then there are other solutions. We have the whole governance solution right, so again it goes back to the partnership HDP brings with it Atlas. IBM has a whole suite of governance portfolio including the governance catalog. How do you expand the story from being a Hadoop-centric story to an enterprise data-like story, and then now we are taking that to the cloud that's what Truata is all about. Rob Thomas came out with a blog yesterday morning talking about Truata. If you look at it is nothing but a governed data-link hosted offering, if you want to simplify it. That's one way to look at it caters to the GDPR requirements as well. >> For GDPR for the IBM Hortonworks partnership is the lead solution for GDPR compliance, is it Hortonworks Data Steward Studio or is it any number of solutions that IBM already has for data governance and curation, or is it a combination of all of that in terms of what you, as partners, propose to customers for soup to nuts GDPR compliance? Give me a sense for... >> It is a combination of all of those so it has a HDP, its has HDF, it has Big Sequel, it has Data Science Experience, it had IBM governance catalog, it has IBM data quality and it has a bunch of security products, like Gaurdium and it has some new IBM proprietary components that are very specific towards data (cough drowns out speaker) and how do you deal with the personal data and sensitive personal data as classified by GDPR. I'm supposed to query some high level information but I'm not allowed to query deep into the personal information so how do you blog those queries, how do you understand those, these are not necessarily part of Data Steward Studio. These are some of the proprietary components that are thrown into the mix by IBM. >> One of the requirements that is not often talked about under GDPR, Ricky of Formworks got in to it a little bit in his presentation, was the notion that the requirement that if you are using an UE citizen's PII to drive algorithmic outcomes, that they have the right to full transparency. It's the algorithmic decision paths that were taken. I remember IBM had a tool under the Watson brand that wraps up a narrative of that sort. Is that something that IBM still, it was called Watson Curator a few years back, is that a solution that IBM still offers, because I'm getting a sense right now that Hortonworks has a specific solution, not to say that they may not be working on it, that addresses that side of GDPR, do you know what I'm referring to there? >> I'm not aware of something from the Hortonworks side beyond the Data Steward Studio, which offers basically identification of what some of the... >> Data lineage as opposed to model lineage. It's a subtle distinction. >> It can identify some of the personal information and maybe provide a way to tag it and hence, mask it, but the Truata offering is the one that is bringing some new research assets, after GDPR guidelines became clear and then they got into they are full of how do we cater to those requirements. These are relatively new proprietary components, they are not even being productized, that's why I am calling them proprietary components that are going in to this hosting service. >> IBM's got a big portfolio so I'll understand if you guys are still working out what position. Rebecca go ahead. >> I just wanted to ask you about this new era of GDPR. The last Hortonworks conference was sort of before it came into effect and now we're in this new era. How would you say companies are reacting? Are they in the right space for it, in the sense of they're really still understand the ripple effects and how it's all going to play out? How would you describe your interactions with companies in terms of how they're dealing with these new requirements? >> They are still trying to understand the requirements and interpret the requirements coming to terms with what that really means. For example I met with a customer and they are a multi-national company. They have data centers across different geos and they asked me, I have somebody from Asia trying to query the data so that the query should go to Europe, but the query processing should not happen in Asia, the query processing all should happen in Europe, and only the output of the query should be sent back to Asia. You won't be able to think in these terms before the GDPR guidance era. >> Right, exceedingly complicated. >> Decoupling storage from processing enables those kinds of fairly complex scenarios for compliance purposes. >> It's not just about the access to data, now you are getting into where the processing happens were the results are getting displayed, so we are getting... >> Severe penalties for not doing that so your customers need to keep up. There was announcement at this show at Dataworks 2018 of an IBM Hortonwokrs solution. IBM post-analytics with with Hortonworks. I wonder if you could speak a little bit about that, Pandit, in terms of what's provided, it's a subscription service? If you could tell us what subset of IBM's analytics portfolio is hosted for Hortonwork's customers? >> Sure, was you said, it is a a hosted offering. Initially we are starting of as base offering with three products, it will have HDP, Big Sequel, IBM DB2 Big Sequel and DSX, Data Science Experience. Those are the three solutions, again as I said, it is hosted on IBM Cloud, so customers have a choice of different configurations they can choose, whether it be VMs or bare metal. I should say this is probably the only offering, as of today, that offers bare metal configuration in the cloud. >> It's geared to data scientist developers and machine-learning models will build the models and train them in IBM Cloud, but in a hosted HDP in IBM Cloud. Is that correct? >> Yeah, I would rephrase that a little bit. There are several different offerings on the cloud today and we can think about them as you said for ad-hoc or ephemeral workloads, also geared towards low cost. You think about this offering as taking your on PRAM data center experience directly onto the cloud. It is geared towards very high performance. The hardware and the software they are all configured, optimized for providing high performance, not necessarily for ad-hoc workloads, or ephemeral workloads, they are capable of handling massive workloads, on sitcky workloads, not meant for I turned this massive performance computing power for a couple of hours and then switched them off, but rather, I'm going to run these massive workloads as if it is located in my data center, that's number one. It comes with the complete set of HDP. If you think about it there are currently in the cloud you have Hive and Hbase, the sequel engines and the stories separate, security is optional, governance is optional. This comes with the whole enchilada. It has security and governance all baked in. It provides the option to use Big Sequel, because once you get on Hadoop, the next experience is I want to run complex workloads. I want to run federated queries across Hadoop as well as other data storage. How do I handle those, and then it comes with Data Science Experience also configured for best performance and integrated together. As a part of this partnership, I mentioned earlier, that we have progress towards providing this story of an NTN solution. The next steps of that are, yeah I can say that it's an NTN solution but are the product's look and feel as if they are one solution. That's what we are getting into and I have featured some of those integrations. For example Big Sequel, IBM product, we have been working on baking it very closely with HDP. It can be deployed through Morey, it is integrated with Atlas and Granger for security. We are improving the integrations with Atlas for governance. >> Say you're building a Spark machine learning model inside a DSX on HDP within IH (mumbles) IBM hosting with Hortonworks on HDP 3.0, can you then containerize that machine learning Sparks and then deploy into an edge scenario? >> Sure, first was Big Sequel, the next one was DSX. DSX is integrated with HDP as well. We can run DSX workloads on HDP before, but what we have done now is, if you want to run the DSX workloads, I want to run a Python workload, I need to have Python libraries on all the nodes that I want to deploy. Suppose you are running a big cluster, 500 cluster. I need to have Python libraries on all 500 nodes and I need to maintain the versioning of it. If I upgrade the versions then I need to go and upgrade and make sure all of them are perfectly aligned. >> In this first version will you be able build a Spark model and a Tesorflow model and containerize them and deploy them. >> Yes. >> Across a multi-cloud and orchestrate them with Kubernetes to do all that meshing, is that a capability now or planned for the future within this portfolio? >> Yeah, we have that capability demonstrated in the pedestal today, so that is a new one integration. We can run virtual, we call it virtual Python environment. DSX can containerize it and run data that's foreclosed in the HDP cluster. Now we are making use of both the data in the cluster, as well as the infrastructure of the cluster itself for running the workloads. >> In terms of the layers stacked, is also incorporating the IBM distributed deep-learning technology that you've recently announced? Which I think is highly differentiated, because deep learning is increasingly become a set of capabilities that are across a distributed mesh playing together as is they're one unified application. Is that a capability now in this solution, or will it be in the near future? DPL distributed deep learning? >> No, we have not yet. >> I know that's on the AI power platform currently, gotcha. >> It's what we'll be talking about at next year's conference. >> That's definitely on the roadmap. We are starting with the base configuration of bare metals and VM configuration, next one is, depending on how the customers react to it, definitely we're thinking about bare metal with GPUs optimized for Tensorflow workloads. >> Exciting, we'll be tuned in the coming months and years I'm sure you guys will have that. >> Pandit, thank you so much for coming on theCUBE. We appreciate it. I'm Rebecca Knight for James Kobielus. We will have, more from theCUBE's live coverage of Dataworks, just after this.

Published Date : Jun 19 2018

SUMMARY :

Brought to you by Hortonworks. Thanks so much for coming on the show. and the other parts of your job. and aligning the two portfolios together. and maybe that's the wrong term, getting a lot of components on to the (mumbles) and so forth. a particular aspect of the focus, and so for more of the deep learning development. No it is all part of the relationship, For GDPR for the IBM Hortonworks partnership the personal information so how do you blog One of the requirements that is not often I'm not aware of something from the Hortonworks side Data lineage as opposed to model lineage. It can identify some of the personal information if you guys are still working out what position. in the sense of they're really still understand the and interpret the requirements coming to terms kinds of fairly complex scenarios for compliance purposes. It's not just about the access to data, I wonder if you could speak a little that offers bare metal configuration in the cloud. It's geared to data scientist developers in the cloud you have Hive and Hbase, can you then containerize that machine learning Sparks on all the nodes that I want to deploy. In this first version will you be able build of the cluster itself for running the workloads. is also incorporating the IBM distributed It's what we'll be talking next one is, depending on how the customers react to it, I'm sure you guys will have that. Pandit, thank you so much for coming on theCUBE.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
RebeccaPERSON

0.99+

James KobielusPERSON

0.99+

Rebecca KnightPERSON

0.99+

EuropeLOCATION

0.99+

IBMORGANIZATION

0.99+

AsiaLOCATION

0.99+

Rob ThomasPERSON

0.99+

San JoseLOCATION

0.99+

Silicon ValleyLOCATION

0.99+

PanditPERSON

0.99+

last yearDATE

0.99+

PythonTITLE

0.99+

yesterday morningDATE

0.99+

HortonworksORGANIZATION

0.99+

three solutionsQUANTITY

0.99+

RickyPERSON

0.99+

NorthsysORGANIZATION

0.99+

HadoopTITLE

0.99+

Pandit PrasadPERSON

0.99+

GDPRTITLE

0.99+

IBM AnalyticsORGANIZATION

0.99+

first versionQUANTITY

0.99+

bothQUANTITY

0.99+

one year agoDATE

0.98+

HortonworkORGANIZATION

0.98+

threeQUANTITY

0.98+

todayDATE

0.98+

DSXTITLE

0.98+

FormworksORGANIZATION

0.98+

this yearDATE

0.98+

AtlasORGANIZATION

0.98+

firstQUANTITY

0.98+

GrangerORGANIZATION

0.97+

GaurdiumORGANIZATION

0.97+

oneQUANTITY

0.97+

Data Steward StudioORGANIZATION

0.97+

two portfoliosQUANTITY

0.97+

TruataORGANIZATION

0.96+

DataWorks Summit 2018EVENT

0.96+

one solutionQUANTITY

0.96+

one wayQUANTITY

0.95+

next yearDATE

0.94+

500 nodesQUANTITY

0.94+

NTNORGANIZATION

0.93+

WatsonTITLE

0.93+

HortonworksPERSON

0.93+

Alan Gates, Hortonworks | Dataworks Summit 2018


 

(techno music) >> (announcer) From Berlin, Germany it's theCUBE covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. >> Well hello, welcome to theCUBE. We're here on day two of DataWorks Summit 2018 in Berlin, Germany. I'm James Kobielus. I'm lead analyst for Big Data Analytics in the Wikibon team of SiliconANGLE Media. And who we have here today, we have Alan Gates whose one of the founders of Hortonworks and Hortonworks of course is the host of DataWorks Summit and he's going to be, well, hello Alan. Welcome to theCUBE. >> Hello, thank you. >> Yeah, so Alan, so you and I go way back. Essentially, what we'd like you to do first of all is just explain a little bit of the genesis of Hortonworks. Where it came from, your role as a founder from the beginning, how that's evolved over time but really how the company has evolved specifically with the folks on the community, the Hadoop community, the Open Source community. You have a deepening open source stack with you build upon with Atlas and Ranger and so forth. Gives us a sense for all of that Alan. >> Sure. So as I think it's well-known, we started as the team at Yahoo that really was driving a lot of the development of Hadoop. We were one of the major players in the Hadoop community. Worked on that for, I was in that team for four years. I think the team itself was going for about five. And it became clear that there was an opportunity to build a business around this. Some others had already started to do so. We wanted to participate in that. We worked with Yahoo to spin out Hortonworks and actually they were a great partner in that. Helped us get than spun out. And the leadership team of the Hadoop team at Yahoo became the founders of Hortonworks and brought along a number of the other engineering, a bunch of the other engineers to help get started. And really at the beginning, we were. It was Hadoop, Pig, Hive, you know, a few of the very, Hbase, the kind of, the beginning projects. So pretty small toolkit. And we were, our early customers were very engineering heavy people, or companies who knew how to take those tools and build something directly on those tools right? >> Well, you started off with the Hadoop community as a whole started off with a focus on the data engineers of the world >> Yes. >> And I think it's shifted, and confirm for me, over time that you focus increasing with your solutions on the data scientists who are doing the development of the applications, and the data stewards from what I can see at this show. >> I think it's really just a part of the adoption curve right? When you're early on that curve, you have people who are very into the technology, understand how it works, and want to dive in there. So those tend to be, as you said, the data engineering types in this space. As that curve grows out, you get, it comes wider and wider. There's still plenty of data engineers that are our customers, that are working with us but as you said, the data analysts, the BI people, data scientists, data stewards, all those people are now starting to adopt it as well. And they need different tools than the data engineers do. They don't want to sit down and write Java code or you know, some of the data scientists might want to work in Python in a notebook like Zeppelin or Jupyter but some, may want to use SQL or even Tablo or something on top of SQL to do the presentation. Of course, data stewards want tools more like Atlas to help manage all their stuff. So that does drive us to one, put more things into the toolkit so you see the addition of projects like Apache Atlas and Ranger for security and all that. Another area of growth, I would say is also the kind of data that we're focused on. So early on, we were focused on data at rest. You know, we're going to store all this stuff in HDFS and as the kind of data scene has evolved, there's a lot more focus now on a couple things. One is data, what we call data-in-motion for our HDF product where you've got in a stream manager like Kafka or something like that >> (James) Right >> So there's processing that kind of data. But now we also see a lot of data in various places. It's not just oh, okay I have a Hadoop cluster on premise at my company. I might have some here, some on premise somewhere else and I might have it in several clouds as well. >> K, your focus has shifted like the industry in general towards streaming data in multi-clouds where your, it's more stateful interactions and so forth? I think you've made investments in Apache NiFi so >> (Alan) yes. >> Give us a sense for your NiFi versus Kafka and so forth inside of your product strategy or your >> Sure. So NiFi is really focused on that data at the edge, right? So you're bringing data in from sensors, connected cars, airplane engines, all those sorts of things that are out there generating data and you need, you need to figure out what parts of the data to move upstream, what parts not to. What processing can I do here so that I don't have to move upstream? When I have a error event or a warning event, can I turn up the amount of data I'm sending in, right? Say this airplane engine is suddenly heating up maybe a little more than it's supposed to. Maybe I should ship more of the logs upstream when the plane lands and connects that I would if, otherwise. That's the kind o' thing that Apache NiFi focuses on. I'm not saying it runs in all those places by my point is, it's that kind o' edge processing. Kafka is still going to be running in a data center somewhere. It's still a pretty heavy weight technology in terms of memory and disk space and all that so it's not going to be run on some sensor somewhere. But it is that data-in-motion right? I've got millions of events streaming through a set of Kafka topics watching all that sensor data that's coming in from NiFi and reacting to it, maybe putting some of it in the data warehouse for later analysis, all those sorts of things. So that's kind o' the differentiation there between Kafka and NiFi. >> Right, right, right. So, going forward, do you see more of your customers working internet of things projects, is that, we don't often, at least in the industry of popular mind, associate Hortonworks with edge computing and so forth. Is that? >> I think that we will have more and more customers in that space. I mean, our goal is to help our customers with their data wherever it is. >> (James) Yeah. >> When it's on the edge, when it's in the data center, when it's moving in between, when it's in the cloud. All those places, that's where we want to help our customers store and process their data. Right? So, I wouldn't want to say that we're going to focus on just the edge or the internet of things but that certainly has to be part of our strategy 'cause it's has to be part of what our customers are doing. >> When I think about the Hortonworks community, now we have to broaden our understanding because you have a tight partnership with IBM which obviously is well-established, huge and global. Give us a sense for as you guys have teamed more closely with IBM, how your community has changed or broadened or shifted in its focus or has it? >> I don't know that it's shifted the focus. I mean IBM was already part of the Hadoop community. They were already contributing. Obviously, they've contributed very heavily on projects like Spark and some of those. They continue some of that contribution. So I wouldn't say that it's shifted it, it's just we are working more closely together as we both contribute to those communities, working more closely together to present solutions to our mutual customer base. But I wouldn't say it's really shifted the focus for us. >> Right, right. Now at this show, we're in Europe right now, but it doesn't matter that we're in Europe. GDPR is coming down fast and furious now. Data Steward Studio, we had the demonstration today, it was announced yesterday. And it looks like a really good tool for the main, the requirements for compliance which is discover and inventory your data which is really set up a consent portal, what I like to refer to. So the data subject can then go and make a request to have my data forgotten and so forth. Give us a sense going forward, for how or if Hortonworks, IBM, and others in your community are going to work towards greater standardization in the functional capabilities of the tools and platforms for enabling GDPR compliance. 'Cause it seems to me that you're going to need, the industry's going to need to have some reference architecture for these kind o' capabilities so that going forward, either your ecosystem of partners can build add on tools in some common, like the framework that was laid out today looks like a good basis. Is there anything that you're doing in terms of pushing towards more Open Source standardization in that area? >> Yes, there is. So actually one of my responsibilities is the technical management of our relationship with ODPI which >> (James) yes. >> Mandy Chessell referenced yesterday in her keynote and that is where we're working with IBM, with ING, with other companies to build exactly those standards. Right? Because we do want to build it around Apache Atlas. We feel like that's a good tool for the basis of that but we know one, that some people are going to want to bring their own tools to it. They're not necessarily going to want to use that one platform so we want to do it in an open way that they can still plug in their metadata repositories and communicate with others and we want to build the standards on top of that of how do you properly implement these features that GDPR requires like right to be forgotten, like you know, what are the protocols around PIII data? How do you prevent a breach? How do you respond to a breach? >> Will that all be under the umbrella of ODPI, that initiative of the partnership or will it be a separate group or? >> Well, so certainly Apache Atlas is part of Apache and remains so. What ODPI is really focused up is that next layer up of how do we engage, not the programmers 'cause programmers can gage really well at the Apache level but the next level up. We want to engage the data professionals, the people whose job it is, the compliance officers. The people who don't sit and write code and frankly if you connect them to the engineers, there's just going to be an impedance mismatch in that conversation. >> You got policy wonks and you got tech wonks so. They understand each other at the wonk level. >> That's a good way to put it. And so that's where ODPI is really coming is that group of compliance people that speak a completely different language. But we still need to get them all talking to each other as you said, so that there's specifications around. How do we do this? And what is compliance? >> Well Alan, thank you very much. We're at the end of our time for this segment. This has been great. It's been great to catch up with you and Hortonworks has been evolving very rapidly and it seems to me that, going forward, I think you're well-positioned now for the new GDPR age to take your overall solution portfolio, your partnerships, and your capabilities to the next level and really in terms of in an Open Source framework. In many ways though, you're not entirely 100% like nobody is, purely Open Source. You're still very much focused on open frameworks for building fairly scalable, very scalable solutions for enterprise deployment. Well, this has been Jim Kobielus with Alan Gates of Hortonworks here at theCUBE on theCUBE at DataWorks Summit 2018 in Berlin. We'll be back fairly quickly with another guest and thank you very much for watching our segment. (techno music)

Published Date : Apr 19 2018

SUMMARY :

Brought to you by Hortonworks. of Hortonworks and Hortonworks of course is the host a little bit of the genesis of Hortonworks. a bunch of the other engineers to help get started. of the applications, and the data stewards So those tend to be, as you said, the data engineering types But now we also see a lot of data in various places. So NiFi is really focused on that data at the edge, right? So, going forward, do you see more of your customers working I mean, our goal is to help our customers with their data When it's on the edge, when it's in the data center, as you guys have teamed more closely with IBM, I don't know that it's shifted the focus. the industry's going to need to have some So actually one of my responsibilities is the that GDPR requires like right to be forgotten, like and frankly if you connect them to the engineers, You got policy wonks and you got tech wonks so. as you said, so that there's specifications around. It's been great to catch up with you and

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
IBMORGANIZATION

0.99+

James KobielusPERSON

0.99+

Mandy ChessellPERSON

0.99+

AlanPERSON

0.99+

YahooORGANIZATION

0.99+

Jim KobielusPERSON

0.99+

EuropeLOCATION

0.99+

HortonworksORGANIZATION

0.99+

Alan GatesPERSON

0.99+

four yearsQUANTITY

0.99+

JamesPERSON

0.99+

INGORGANIZATION

0.99+

BerlinLOCATION

0.99+

yesterdayDATE

0.99+

ApacheORGANIZATION

0.99+

SQLTITLE

0.99+

JavaTITLE

0.99+

GDPRTITLE

0.99+

PythonTITLE

0.99+

100%QUANTITY

0.99+

Berlin, GermanyLOCATION

0.99+

SiliconANGLE MediaORGANIZATION

0.99+

DataWorks SummitEVENT

0.99+

AtlasORGANIZATION

0.99+

DataWorks Summit 2018EVENT

0.98+

Data Steward StudioORGANIZATION

0.98+

todayDATE

0.98+

oneQUANTITY

0.98+

NiFiORGANIZATION

0.98+

Dataworks Summit 2018EVENT

0.98+

HadoopORGANIZATION

0.98+

one platformQUANTITY

0.97+

2018EVENT

0.97+

bothQUANTITY

0.97+

millions of eventsQUANTITY

0.96+

HbaseORGANIZATION

0.95+

TabloTITLE

0.95+

ODPIORGANIZATION

0.94+

Big Data AnalyticsORGANIZATION

0.94+

OneQUANTITY

0.93+

theCUBEORGANIZATION

0.93+

NiFiCOMMERCIAL_ITEM

0.92+

day twoQUANTITY

0.92+

about fiveQUANTITY

0.91+

KafkaTITLE

0.9+

ZeppelinORGANIZATION

0.89+

AtlasTITLE

0.85+

RangerORGANIZATION

0.84+

JupyterORGANIZATION

0.83+

firstQUANTITY

0.82+

Apache AtlasORGANIZATION

0.82+

HadoopTITLE

0.79+

Greg Fee, Lyft | Flink Forward 2018


 

>> Narrator: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> This is George Gilbert. We are at Data Artisan's conference Flink Forward. It is for the Apache Flink commmunity, sponsored by Data Artisans, and all the work they're doing to move Flink Forward, and to surround it with additional value that makes building stream-processing applications accessible to mainstream companies. Right now though, we are not talking to a mainstream company, we're talking to Greg Fee from Lyft. Not Uber. (laughs) And Greg tell us a little bit about what you're doing with Flink. What's the first-use case, that comes to mind that really exercises its capabilities? >> Sure, yeah, so the process of adopting Flink at Lyft has really started with a use case, which was, we're trying to make machine learning more accessible across all of Lyft. So we already use machine learning in quite a few applications, but we want to make sure that we use machine learning as much as possible, we really think that's the path forward. And one of the fundamental difficulties with that is having consistent feature generation between these offline batch-y training scenarios and the online real-time streaming scenarios. And the unified processing engine of Flink really helps us bridge that gap, so. >> When you say unified processing engine, are you saying that the fact that you can manage code and data, as sort of an application version, and some of the, either code or data, is part of the model, and so your versioning? >> That's even a step beyond what I'm talking about. >> Okay. >> Just the basic fundamental ability to have one piece of business logic that you can apply at the batch bulk layer, and in the real-time layer. >> George: Yeah. >> So that's sort of like the core of what Flink gives you. >> Are you running both batch and streaming on Flink? >> Yes, that's right. >> And using the, so, you're using the windows? Or just periodic execution on a stream to simulate batch? >> That's right. So we have, so feature generation crosses a broad spectrum of possible use cases in Flink. >> George: Yeah. >> And this is where we sort of transition more into what dA platform could give for us. So, we're looking to have thousands of different features across all of our machine learning models. So having a platform that can help us host many of these little programs running, help with the application life-cycle of each of these features, as we version them over time. So, we're very excited about what dA platform can do for us. >> Can you tell us a little more about how the stream processing helps you with the feature selection engineering, and is it that you're using streaming, or simulated batch, or batch using the same programming model to train these models, and you're using, you're picking up different derived data, is that how it's working? >> So, typical life-cycle is, it's going to be a feature engineering stage, so the data scientist is looking at their data, they're trying figure out patterns in the data, and they're going to, how you apply Flink there, is as you come up with potential algorithms for how you generate your feature, can run that through Flink, generate some data, apply machine learning model on top of it, and sort of play around with that data, prototype things. >> So, what you're doing is offline, or out of the platform, you're doing the feature selection and the engineering. >> Man: Right. >> Then you attach a stream to it that has just the relevant, perhaps, the relevant features. >> Man: Right. >> And then that model gets sort of, well maybe not yet, but eventually versioned as part of the application, which includes the application, the rest of the application logic and the data. >> Right. So, like some of the stuff that was touched on this morning at the keynotes, the versioning and maintaining machine learning applications, is a much, is a very complex ecosystem there. So being able to say, okay, going from the prototype stage, doing stuff in batch, to doing stuff in production, and real-time, then being able to version those over time, to move to better and better versions of the future generation, is very important to us. >> I don't know if this is the most politically correct thing, but you just explained it better than everyone else we have talked to. >> Great. (laughs) >> About how it all fits together with the machine learning. So, once you've got that in place, it sounds like you're using the dA platform, as well as, you know, perhaps some extensions for machine learning, to sort of add that as a separate life-cycle, besides the application code. Then, is that going to be the enterprise-wide platform for deploying, developing and deploying, machine learning applications? >> Yes, certainly we think there's probably a broad ecosystem to do machine learning. It's a very, sort of, wide open area. Certainly my agenda is to push it across the company and get as many things running in this system as possible. I think the real-time aspects of it, a unifying aspect, of what Flink can give us, and the platform can give us, in terms of the life-cycles. >> So, are you set up essentially like where you're the, a shared resource, a shared service, which is the platform group? >> Man: Right. >> And then, all the business units, adopt that platform and build their apps on it. >> Right. So my initiative is part of a greater data science platform at Lyft, so, my goal is to have, we have hundreds of data scientists who are going to be looking at this data, giving me little features that they want to do, and we're probably going to end up numbering in the thousands of features, being able to generate all those, maintain all those little programs. >> And when you say generate all those little programs, that's the application logic, and the models specific to that application? >> That's right, well. >> Or is it this? >> There's features that are typically shared across many models. >> Okay. >> So there's like two layers of things happening. >> So you're managing features separately from the models. >> That's right. >> Interesting. Okay, haven't heard that. And is the application manager tooling going to help address that, or is that custom stuff that you have to do? >> So, I think there's, I think there's a potential that that's the way we're going to manage the model stuff as well, but it's still little new over there. >> That you put it on the application platform? >> Right. >> Then that's sort of at the boundary of what you're doing right now, or what you will be doing shortly. >> Right. It's all, it's a matter of use-case, whether it's online or offline, and how it fits best in with the rest of the Lyft engineering system. >> When you're talking about your application landscape, do you have lots of streaming applications that feed other streaming applications, going through a hub. Or, are they sort of more discrete, you know, artifacts, discrete programs, and then when do you keep, stay within the streaming processors, and when do you have it in a shared database? >> That's a, that's a lot of questions, kind of a deep question. So, the goal is to have a central hub, where sort of all of our event data passes through it, and that allows us to decouple. >> So that's to be careful, that's not a database central hub, that's a, like a? >> An event hub. >> Event hub. >> Right. >> Yeah, okay. >> So, an event hub in the middle allows us to decompose the different, sort of smaller programs, which again are probably going to number in the thousands, so that being able to have different parts of the company maintain their own part of the overall system is very important to us. I think we'll probably see Flink as a major player, in terms of how those programs run, but we'll probably be shooting things off to other systems like Druid, like Hive, like Presto, like Elasticsearch. >> As derived data? >> As all derived data, from these Flink jobs. And then also, pushing data directly out into some of our production systems to feed into machine learning decisions. >> Okay, this is quite, sounds like the most ambitious infrastructure that we've heard, in that it sounds like pretty ubiquitous. >> We want to be a machine-learning first company. So, it's everywhere. >> So, now help me clarify for me, when? Because this is, you know, for mainstream companies who've programmed with, you know, DBMS, as a shared state manager for decades, help explain to them when you would still use a DBMS for shared state, and when you would start using the distributed state that's embedded in Flink, and the derived data, you know, at the endpoints, at the syncs. >> So I mean, I guess this kind of gets into your exact, your use cases and, you know, your opinions and thoughts about how to use these things best, but. >> George: Your opinion is what we're interested in. >> Right. From where I'm coming, I see basically databases as potential one sync for this data. They do things very well, right? They do structured queries very well. You can have indices built off that, aggregates, really feed into a lot of visualization stuff. >> George: Yeah. >> But, from where I am sitting, like we're really moving away from databases as something that feeds production data. We've got other stores to do that, that are sort of more tailored towards those scenarios. >> When you say to feed production data, this is transaction capture, or data capture. >> Right. So we don't have a lot of atomic transactions, outside the payments at Lyft, most of the stuff is eventually consistent. So we have stores, more like Dynamo or Cassandra HBase that feed a lot of our production data. >> And those databases, are they for like ambient information like influencing an interaction, it doesn't sound like automating a transaction. It would be, it sounds like, context that helps with analytics, but very separate from the OLTP apps. >> That's right. So we have, you can kind of bifurcate the company into the data that's used in production to make decisions that are like facing the user, and then our analytics back end, that really helps business analysts and like the executives make decisions about how we proceed. >> And so that second part, that backend, is more like operational efficiency. >> Man: Right. >> And coding new business processes to support new ways of doing business, but the customer-facing stuff specifically like with payments, that still needs a traditional OLTP. >> Man: Right. >> But there not, those use cases aren't growing that much. >> That's right. So, basically we have very specific use-cases for like a traditional database, but in terms of capturing the types of scale, and the type of growth, we're looking for at Lyft, we think some of the other storage engines suit those better. >> So in that use-case, would the OLTP DBMS be at the front end, would it be a source, or a sync? It sounds like it's a source. >> So we actually do it both ways. Right, so, it's great to get our transactional data flowing through our streaming system, it's a lot of value in that, but also then pushing it out, back to some of the aggregate results to DBMS, helps with our analytics pipeline. >> Okay, okay. Well this is actually really interesting. So, where do you see the dA platform helping, you know, going forward; is it something you don't really need because you've built all that scaffolding to help with sort of application life-cycle management, or or do you see it as something that'll help sort of push Flink sort of enterprise-wide? >> I think the dA platform really helps people sort of adopt Flink at an enterprise level. Maintaining the applications is a core part of what it means to run it as a business. And so we're looking at dA platform as a way of managing our applications, and I think, like I'm just talking about one, I'm mostly talking about one application we have for Flink at Lyft. >> Yeah. >> We have many other Flink programs actually running, that are sort of unrelated to my project. >> What about managing non-Flink applications? Do you need an application manager? Is it okay that it's associated with one service, or platform like Flink, or is there a desire you know among bleeding edge customers to have an overall, sort of infrastructure management, application management kind of suite. >> Yes, for sure. You're touching on something I have started to push inside of Lyft, which is the need for an overall application life-cycle management product that's not technology specific. >> Would these sort of plug into the dA platform and whatever the confluent, you know, equivalent is, or is it going to to directly tie to the, you know, operational capabilities, or the functional capabilities, not the management capabilities. In other words would it plug into like core Flink, core Kafka, core Spark, that sort of stuff? >> I think that's sort of largely to be determined. If you go back to sort of how distributed design system works, typically. We have a user plane, which is going to be our data users. Then you end up with the thing we're probably most familiar with, which is our data plane, technologies like Flink and Kafka and Hive, all those guys. What's missing in the middle right now is a control plane. It's a map from the user desire, from the user intention, to what we do with all of that data plane stuff. So launch a new program, maybe you need a new Kafka topic, maybe you need to provision in Kafka. Higher, you need to get some Flink programs running, and whether that talks directly talks to Flink, and goes against Kubernetes, or something like that, or whether it talks to a higher level, like more application-specific platform. >> Man: Yeah. >> I think, you know it's certainly a lot easier, if we have some of these platforms in the way. >> Because they give you better abstractions. >> That's right. >> To talk to the platforms. >> That's right. >> That's interesting. Okay, geesh, we learn something really, really interesting with each interview. I'm curious though, if you look out a couple years, how much of your application landscape will be continuous processing, and is that something you can see mainstream enterprises adopting, or has decades of work with, you know, batch and interactive sort of made people too difficult to learn something so radically new? >> I think it's all going to be driven by the business needs, and whether the value is there for people to make that transition 'cause it is quite expensive to invest in new infrastructure. For companies like Lyft, where we're trying to make decisions very quickly, you know, users get down to two seconds makes a difference for the customer, so we're trying to be as, you know, real-time as possible. I used to work at Salesforce. Salespeople are a little less sensitive to these things, and you know it's very, very traditional world. >> That's interesting. (background applauding) >> But even Salesforce is moving towards that style. >> Even Salesforce is moving? >> Is moving toward streaming processing. >> Really? >> George: So like, I think we're going to see it slowly be adopted across the big enterprises. >> George: I imagine that's probably for their analytics. >> That's where they're starting, of course, yeah. >> Okay. So, this was, a little more affirmation on to how we're going to see the control plane evolve, and the interesting use-cases that you're up to. I hope we can see you back next year. And you can tell us how far you've proceeded. >> I certainly hope so, yeah. >> This was really interesting. So, Greg Fee from Lyft. We will hopefully see you again. And this is George Gilbert. We're at the Data Artisans Flink Forward conference in San Francisco. We'll be back after this break. (techno music)

Published Date : Apr 12 2018

SUMMARY :

brought to you by Data Artisans. What's the first-use case, that comes to mind And one of the fundamental difficulties with that That's even a step beyond what Just the basic fundamental ability to have So we have, so feature generation crosses a broad So having a platform that can help us host with potential algorithms for how you So, what you're doing is offline, or out of the platform, Then you attach a stream to it that has just of the application logic and the data. So, like some of the stuff that was touched on politically correct thing, but you just explained (laughs) Then, is that going to be the enterprise-wide platform in terms of the life-cycles. and build their apps on it. in the thousands of features, being able to generate There's features that are typically And is the application manager tooling going to help that that's the way we're going to manage the model stuff Then that's sort of at the boundary of what you're of the Lyft engineering system. and when do you have it in a shared database? So, the goal is to have a central hub, So, an event hub in the middle allows us to decompose some of our production systems to feed into Okay, this is quite, sounds like the most ambitious So, it's everywhere. and the derived data, you know, at the endpoints, about how to use these things best, but. into a lot of visualization stuff. We've got other stores to do that, that are sort of When you say to feed production data, outside the payments at Lyft, most of the stuff And those databases, are they for like ambient information So we have, you can kind of bifurcate the company And so that second part, that backend, is more like of doing business, but the customer-facing stuff the types of scale, and the type of growth, we're looking be at the front end, would it be a source, or a sync? some of the aggregate results to DBMS, So, where do you see the dA platform helping, you know, Maintaining the applications is a core part actually running, that are sort of unrelated to my project. you know among bleeding edge customers to have an overall, inside of Lyft, which is the need for an overall application or is it going to to directly tie to the, you know, to what we do with all of that data plane stuff. I think, you know it's certainly a lot easier, or has decades of work with, you know, and you know it's very, That's interesting. that style. adopted across the big enterprises. I hope we can see you back next year. We're at the Data Artisans Flink Forward conference

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
George GilbertPERSON

0.99+

GeorgePERSON

0.99+

GregPERSON

0.99+

Greg FeePERSON

0.99+

Data ArtisansORGANIZATION

0.99+

San FranciscoLOCATION

0.99+

LyftORGANIZATION

0.99+

thousandsQUANTITY

0.99+

next yearDATE

0.99+

second partQUANTITY

0.99+

UberORGANIZATION

0.99+

each interviewQUANTITY

0.99+

DynamoORGANIZATION

0.99+

SalesforceORGANIZATION

0.99+

ApacheORGANIZATION

0.98+

FlinkORGANIZATION

0.98+

one serviceQUANTITY

0.98+

two layersQUANTITY

0.98+

two secondsQUANTITY

0.98+

eachQUANTITY

0.97+

thousands of featuresQUANTITY

0.97+

both waysQUANTITY

0.97+

KafkaTITLE

0.93+

first-use caseQUANTITY

0.92+

one applicationQUANTITY

0.92+

DruidTITLE

0.92+

Flink ForwardTITLE

0.92+

decadesQUANTITY

0.91+

ElasticsearchTITLE

0.89+

Data Artisans Flink ForwardEVENT

0.89+

oneQUANTITY

0.89+

ArtisanEVENT

0.87+

first companyQUANTITY

0.87+

hundreds of data scientistsQUANTITY

0.87+

both batchQUANTITY

0.84+

one pieceQUANTITY

0.83+

2018DATE

0.81+

FlinkTITLE

0.8+

HiveTITLE

0.77+

PrestoTITLE

0.76+

this morningDATE

0.75+

featuresQUANTITY

0.74+

coupleQUANTITY

0.73+

Flink ForwardEVENT

0.69+

HiveORGANIZATION

0.65+

SparkTITLE

0.62+

KubernetesORGANIZATION

0.61+

DataORGANIZATION

0.6+

Cassandra HBaseORGANIZATION

0.57+

Steve Wilkes, Striim | Big Data SV 2018


 

>> Narrator: Live from San Jose it's theCUBE. Presenting Big Data Silicon Valley. Brought to you by SiliconANGLE Media and its ecosystem partners. (upbeat music) >> Welcome back to San Jose everybody, this is theCUBE, the leader in live tech coverage and you're watching BigData SV, my name is Dave Vellante. In the early days of Hadoop everything was batch oriented. About four or five years ago the market really started to focus on real time and streaming analytics to try to really help companies affect outcomes while things were still in motion. Steve Wilks is here, he's the co-founder and CTO of a company called Stream, a firm that's been in this business for around six years. Steve welcome to theCUBE, good to see you. Thanks for coming on. >> Thanks Dave it's a pleasure to be here. >> So tell us more about that, you started about six years ago, a little bit before the market really started talking about real time and streaming. So what led you to that conclusion that you should co-found Steam way ahead of its time? >> It's partly our heritage. So the four of us that founded Stream, we were executives at GoldenGate Software. In fact our CEO Ali Kutay was the CEO of GoldenGate Software. So when we were acquired by Oracle in 2009, after having to work for Oracle for a couple years, we were trying to work out what to do next. And GoldenGate was replication software right? So it's moving data from one place to another. But customers would ask us in customer advisory boards, that data seems valuable, it's moving. Can you look at it while it's moving and analyze it while it's moving, get value out of that moving data? And so that was kind of set in our heads. And then we were thinking about what to do next, that was kind of the genesis of the idea. So the concept around Stream when we first started the company was we can't just give people streaming data, we need to give them the ability to process that data, analyze it, visualize it, play with it and really truly understand the data. As well as being able to collect it and move it somewhere else. And so the goal from day one was always to build a full end-to-end platform that did everything customers needed to do for streaming integration analytics out of the box. And that's what we've done after six years. >> I got to ask a really basic question, so you're talking about your experience at GoldenGate moving data from point a to point b and somebody said well why don't we put that to work. But is there change data or was it static data? Why couldn't I just analyze it in place? >> GoldenGate works on change data. >> Okay so that's why, there was changes going through. Why wait until it hits its target, let's do some work in real time and learn from that, get greater productivity. And now you guys have taken that to a new level. That new level being what? Modern tools, modern technologies? >> A platform built from the ground up to be inherently distributed, scalable, reliable with exactly one's processing guarantees. And to be a complete end-to-end platform. There's a recognition that the first part of being able to do streaming data integration or analytics is that you need to be able to collect the data right? And while change data captured from databases is the way to get data out of databases in a streaming fashion, you also have to deal with files and devices and message queues and anywhere else the data can reside. So you need a large number of different data collectors that all turn the enterprise data sources into streaming data. And similarly if you want to store data somewhere you need a large collection of target adapters that deliver to things. Not just on premise but also in the cloud. So things like Amazon S3 or the cloud databases like Redshift and Google BigQuery. So the idea was really that we wanted to give customers everything they need and that everything they need isn't trivial. It's not just, well we take Apache Kafka and then we stuff things into it and then we take things out. Pretty often, for example, you need to be able to enrich data and that means you need to be able to join streaming data with additional context information, reference data. And that reference data may come form a database or from files or somewhere else. So you can't call out to the database and maintain the speeds of streaming data. We have customers that are doing hundreds of thousands of events per second. So you can't call out to a database for every event and ask for records to enrich it with. And you can't even do that with an external cache because it's just not fast enough. So we built in an in-memory data grid as part of our platform. So you can join streaming data with the context information in real time without slowing anything down. So when you're thinking about doing streaming integration, it's more than just moving data around. It's ability to process it and get it in the right form, to be able to analyze it, to be able to do things like complex event processing on that data. And also to be able to visualize it and play with it is an essential part of the whole platform. >> So I wanted to ask you about end-to-end. I've seen a lot of products from larger, maybe legacy companies that will say it's end-to-end but what it really is, is a cobbled together pieces that they bought in and then, this is our end-to-end platform, but it's not unified. Or I've seen others "Well we've got an end-to-end platform" oh really, can I see the visualization? "Well we don't have visualization "we use this third party for visualization". So convince me that you're end-to-end. >> So our platform when you start with it you go into a UI, you can start building data flows. Those data flows start from connectors, we have all the connectors that you need to get your enterprise data. We have wizards to help you build those. And so now you have a data stream. Now you want to start processing that, we have SQL-based processing so you can do everything from filtering, transformation, aggregation, enrichment of data. If you want to load reference data into memory you use a cache component to drag that in, configure that. You now have data in-memory you can join with your streams. If you want to now take the results of all that processing and write it somewhere, use one of our target connectors, drag that in so you've got a data flow that's getting bigger and bigger, doing more and more processing. So now you're writing some of that data out to Kafka, oh I'm going to also add in another target adaptor write some of it into Azure Blob Storage and some of it's going to Amazon Redshift. So now you have a much bigger data flow. But now you say okay well I also want to do some analytics on that. So you take the data stream, you build another data flow that is doing some aggregation of a Windows, maybe some complex event processing, and then you use that dashboard builder to build a dashboard to visualize all of that. And that's all in one product. So it literally is everything you need to get value immediately. And you're right, the big vendors they have multiple different products and they're very happy to sell you consulting to put them all together. Even if you're trying to build this from open source and you know, organizations try and do that, you need five or six major pieces of open source, a lot of support in libraries, and a huge team of developers to just build a platform that you can start to build applications on. And most organizations aren't software platform companies, they're finance companies, oil and gas companies, healthcare companies. And they really want to focus on solving business problems and not on reinventing the wheel by building a software platform. So we can just go in there and say look; value immediately. And that really, really helps. >> So what are some of your favorite use cases, examples, maybe customer examples that you can share with me? >> So one of the great examples, one of my customers they have a lot of data in our HP non-stop system. And they needed to be able to get visibility into that immediately. And this was like order processing, supply chain, ERP data. And it would've taken a very large amount of time to do analytics directly on the HP nonstop. And finding resources to do that is hard as well. So they needed to get the data out and they need to get it into the appropriate place. And they recognize that use the right technology to ask the right question. So they wanted some of it in Hadoop so they could do some machine learning on that. They wanted some of it to go into Kafka so they could get real time analytics. And they wanted some of it to go into HBase so they could query it immediately and use that for reference purposes. So they utilized us to do change data capture against the HP nonstop, deliver that datastream out immediately into Kafka and also push some of it into HEFS and some of it into HBase. So they immediately got value out of that, because then they could also build some real-time analytics on it. It would sent out alerts if things were taking too long in their order processing system. And allowed them to get visibility directly into their process that they couldn't get before with much fewer resources and more modern technologies than they could have used before. So that's one example. >> Can I ask you a question about that? So you talked about Kafka, HBase, you talk about a lot of different open source projects. You've integrated those or you've got entries and exits into those? >> So we ship with Kafka as part of our product. It's an optional messaging bus. So, our platform has two different ways of moving data around. We have a high-speed, in-memory only message bus and that works almost network speed and it's great for a lot of different use cases. And that is what backs our data streams. So when you build a data flow, you have streams in between each step, that is backed by an in-memory bus. Pretty often though, in use cases, you need to be able to potentially rewind data for recovery purposes or have different applications running at different speeds and that's where a persistent message bus like Kafka comes in but you don't want to use a persistent message bus for everything because it's doing IO and it's slowing things down. So you typically use that at the beginning, at the sources, especially things like IOT where you can't rewind into them. Things like databases and files, you can rewind into them and replay and recover but IOT sources, you can't do that. So you would push that into a Kafka backed stream and then subsequent processing is in-memory. So we have that as part of our product. We also have Elastic as part of our product for results storage. You can switch to other results storage but that's our default. And we have a few other key components that are part of our product but then on the periphery, we have adapters integrate with a lot of the other things that you mentioned. So we have adapters to read and write HDFS, Hive, HBase, Across, Cloudera, Autumn Works, even MapR. So we have the MapR versions of the file system and MapR streams and MapR DB and then there's lots of other more proprietary connectors like CVC from Oracle, and SQL server, and MySQL and MariaDB. And then database connectors for delivery to virtually any JDBC compliant database. >> I took you down a tangent before you had a chance. You were going to give us another example. We're pretty much out of time but if you can briefly share either that or the last word, I'll give it to you. >> I think the last word would be that that is one example. We have lots and lots of other types of use cases that we do including things like: migrating data from on-premise to the cloud, being able to distribute log data, and being able to analyze that log data being able to do in-memory analytics and get real-time insights immediately and send alerts. It's a very comprehensive platform but each one of those use cases are very easy to develop on their own and you can do them very quickly. And of course as the use case expands within a customer, they build more and more and so they end up using the same platform for lots of different use cases within the same account. >> And how large is the company? How many people? >> We are around 70 people right now. >> 70 People and you're looking for funding? What rounds are you in? Where are you at with funding and revenue and all that stuff? >> Well I'd have to defer to my CEO for those questions. >> All right, so you've been around for what, six years you said? >> Yeah, we have a number of rounds of funding. We had initial seed funding then we had the investment by Summit Partners that carried us through for a while. Then subsequent investment from Intel Capital, Dell EMC, Atlantic Bridge. And that's where we are right now. >> Good, excellent. Steve, thanks so much for coming on theCUBE, really appreciate your time. >> Great, it's awesome. Thank you Dave. >> Great to meet you. All right, keep it right there everybody, we'll be back with our next guest. This is theCUBE. We're live from BigData SV in San Jose. We'll be right back. (techno music)

Published Date : Mar 9 2018

SUMMARY :

Brought to you by SiliconANGLE Media the market really started to focus So what led you to that conclusion So it's moving data from one place to another. I got to ask a really basic question, And now you guys have taken that to a new level. and that means you need to be able to So I wanted to ask you about end-to-end. So our platform when you start with it And they needed to be able to get visibility So you talked about Kafka, HBase, So when you build a data flow, you have streams We're pretty much out of time but if you can briefly to develop on their own and you can do them very quickly. And that's where we are right now. really appreciate your time. Thank you Dave. Great to meet you.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavePERSON

0.99+

Dave VellantePERSON

0.99+

Steve WilksPERSON

0.99+

StevePERSON

0.99+

2009DATE

0.99+

Steve WilkesPERSON

0.99+

fiveQUANTITY

0.99+

Intel CapitalORGANIZATION

0.99+

GoldenGate SoftwareORGANIZATION

0.99+

Ali KutayPERSON

0.99+

OracleORGANIZATION

0.99+

hundredsQUANTITY

0.99+

GoldenGateORGANIZATION

0.99+

KafkaTITLE

0.99+

San JoseLOCATION

0.99+

StreamORGANIZATION

0.99+

MySQLTITLE

0.99+

SiliconANGLE MediaORGANIZATION

0.99+

Atlantic BridgeORGANIZATION

0.99+

six yearsQUANTITY

0.99+

SteamORGANIZATION

0.99+

AmazonORGANIZATION

0.99+

MapRTITLE

0.99+

HPORGANIZATION

0.99+

fourQUANTITY

0.99+

70 PeopleQUANTITY

0.99+

Dell EMCORGANIZATION

0.99+

MariaDBTITLE

0.99+

StriimPERSON

0.99+

SQLTITLE

0.99+

oneQUANTITY

0.98+

each stepQUANTITY

0.98+

Summit PartnersORGANIZATION

0.98+

two different waysQUANTITY

0.97+

first partQUANTITY

0.97+

around six yearsQUANTITY

0.97+

around 70 peopleQUANTITY

0.96+

HBaseTITLE

0.96+

one exampleQUANTITY

0.96+

theCUBEORGANIZATION

0.95+

BigData SVORGANIZATION

0.94+

Big DataORGANIZATION

0.92+

HadoopTITLE

0.92+

one productQUANTITY

0.92+

each oneQUANTITY

0.91+

six major piecesQUANTITY

0.91+

About fourDATE

0.91+

CVCTITLE

0.89+

firstQUANTITY

0.89+

about six years agoDATE

0.88+

day oneQUANTITY

0.88+

ElasticTITLE

0.87+

Silicon ValleyLOCATION

0.87+

WindowsTITLE

0.87+

five years agoDATE

0.86+

S3TITLE

0.82+

JDBCTITLE

0.81+

AzureTITLE

0.8+

CEOPERSON

0.79+

one placeQUANTITY

0.78+

RedshiftTITLE

0.76+

AutumnORGANIZATION

0.75+

secondQUANTITY

0.74+

thousandsQUANTITY

0.72+

Big Data SV 2018EVENT

0.71+

couple yearsQUANTITY

0.71+

GoogleORGANIZATION

0.69+

Mark Grover & Jennifer Wu | Spark Summit 2017


 

>> Announcer: Live from San Francisco, it's the Cube covering Spark Summit 2017, brought to you by databricks. >> Hi, we're back here where the Cube is live, and I didn't even know it Welcome, we're at Spark Summit 2017. Having so much fun talking to our guests I didn't know the camera was on. We are doing a talk with Cloudera, a couple of experts that we have here. First is Mark Grover, who's a software engineer and an author. He wrote the book, "Dupe Application Architectures." Mark, welcome to the show. >> Mark: Thank you very much. Glad to be here. And just to his left we also have Jennifer Wu, and Jennifer's director of product management at Cloudera. Did I get that right? >> That's right. I'm happy to be here, too. >> Alright, great to have you. Why don't we get started talking a little bit more about what Cloudera is maybe introducing new at the show? I saw a booth over here. Mark, do you want to get started? >> Mark: Yeah, there are two exciting things that we've launched at least recently. There Cloudera Altus, which is for transient work loads and being able to do ETL-Like workloads, and Jennifer will be happy to talk more about that. And then there's Cloudera data science workbench, which is this tool that allows folks to use data science at scale. So, get away from doing data science in silos on your personal laptops, and do it in a secure environment on cloud. >> Alright, well, let's jump into Data Science Workbench first. Tell me a little bit more about that, and you mentioned it's for exploratory data science. So give us a little more detail on what it does. >> Yeah, absolutely. So, there was private beta for Cloudera Data Science Workbench earlier in the year and then it was GA a few months ago. And it's like you said, an exploratory data science tool that brings data science to the masses within an enterprise. Previously people used to have, it was this dichotomy, right? As a data scientist, I want to have the latest and greatest tools. I want to use the latest version of Python, the latest notebook kernel, and I want to be able to use R and Python to be able to crunch this data and run my models in machine learning. However, on the other side of this dichotomy are the IT organization of the organization, where if they want to make sure that all tools are compliant and that your clusters are secure, and your data is not going into places that are not secured by state of the art security solutions, like Kerberos for example, right? And of course if the data scientists are putting the data on their laptops and taking the laptop around to wherever they go, that's not really a solution. So, that was one problem. And the other one was if you were to bring them all together in the same solution, data scientists have different requirements. One may want to use Python 2.6. Another one maybe want to use 3.2, right? And so Cloudera Data Science Workbench is a new product that allows data scientists to visualize and do machine learning through this very nice notebook-like interface, share their work with the rest of their colleagues in the organization, but also allows you to keep your clusters secure. So it allows you to run against a Kerberized cluster, allows single sign on to your web interface to Data Science Workbench, and provides a really nice developer experience in the sense that My workflow and my tools and my version of Python does not conflict with Jennifer's version of Python. We all have our own docker and Kubernetes-based infrastructure that makes sure that we use the packages that we need, and they don't interfere with each other. We're going to go to Jennifer on Altus in just a few minutes, but George first give you a chance to maybe dig in on Data Science workshop. >> Two questions on the data science side: some of the really toughest nuts to crack have been Sort of a common environment for the collaborators, but also the ability to operationalize the models once you've sort of agreed on them, and manage the lifecycle across teams, you know? Like, challenger champion, promote something, or even before that doing the ab testing, and then sort of what's in production is typically in a different language from what, you know, it was designed in and sort of integrating it with the apps. Where is that on the road map? Cause no one really has a good answer for that. >> Yeah, that's an excellent question. In general I think it's the problem to crack these days. How do you productionalize something that was written by a data scientist in a notebook-like system onto the production cluster, right? And I think the part where the data scientist works in a different language than the language that's in production, I think that problem, the best I can say right now is to actually have someone rewrite that. Have someone rewrite that in the language you're going to make in production, right? I don't see that to be the more common part. I think the more widespread problem is even when the language is production, how do you go making the part that the data scientist wrote, the model or whatever that would be, into a prodution cluster? And so, Data Science Workbench in particular runs on the same cluster that is being managed by Cloudera manager, right? So this is a tool that you install, but that is available to you as a web server, as a web interface, and so that allows you to move your development machine learning algorithms from your data science workbench to production much more easier, because it's all running on the same hardware and same systems. There's no separate Cloudera managers that you have to use to manage the workbench compared to your actual cluster. >> Okay. A tangential question, but one of the, the difficulties of doing machine learning is finding all the training data and, and sort of data science expertise to sit with the domain expert to, you know, figure out proper model of features, things like that. One of the things we've seen so far from the cloud vendors is they take their huge datasets in terms of voice, you know, images. They do the natural language understanding, speech or rather text to speech, you know, facial recognition. Cause they have such huge datasets they can train on. We're hearing noises that they'd going to take that down to the more mundane statistical kind of machine learning algorithms, so that you wouldn't be, like, here's a algorithm to do churn, you know, go to town, but that they might have something that's already kind of pre-populated that you would just customize. Is that something that you guys would tackle, too? >> I can't speak for the road map in that sense, but I think some of that problem needs to be tackled by projects like Spark for example. So I think as the stack matures, it's going to raise the level of abstraction as time goes on. And I think whatever benefits Spark ecosystem will have will come directly to distributions like Cloudera. >> George: That's interesting. >> Yeah >> Okay >> Alright, well let's go to Jennifer now and talk about Altus a little bit. Now you've been on the Cube show before, right? >> I have not. >> Okay, well, familiar with your work. Tell us again, you're the product manager for Altus. What does it do, and what was the motivation to build it? >> Yeah, we're really excited about Cloudera Altus. So, we released Cloudera Altus in its first GA form in April, and we launched Cloudera Altus in a public environment in Strata London about two weeks ago, so we're really excited about this and we are very excited to now open this up to all of the customer base. And what it is is a platform as a service offering designed to leverage, basically, the agility and the scale of cloud, and make a very easy to use type of experience to expose Cloudera capacity for, in particular for data engineering type of workloads. So the end user will be able to very easily, in a very agile manner, get data engineering capacity on Cloudera in the cloud, and they'll be able to do things like ETL and large scale data processing, productionized machine learning workflows in the cloud with this new data engineering as a service experience. And we wanted to abstract away the cloud, and cluster operations, and make the end user a really, the end user experience very easy. So, jobs and workloads as first class objects. You can do things like submit jobs, clone jobs, terminate jobs, troubleshoot jobs. We wanted to make this very, very easy for the data engineering end user. >> It does sound like you've sort of abstracted away a lot of the infrastructure that you would associate with on-prem, and sort of almost make it, like, programmable and invisible. But, um, I guess my, one of my questions is when you put it in a cloud environment, when you're on-prem you have a certain set of competitors which is kind of restrictive, because you are the standalone platform. But when you go on the cloud, someone might say, "I want to use red shift on Amazon," or Snowflake, you know, as the MPP sequel database at the end of a pipeline. And it's not just, I'm using those as examples. There's, you know, dozens, hundreds, thousands of other services to choose from. >> Yes. >> What happens to the integrity of that platform if someone carves off one piece? >> Right. So, interoperability and a unified data pipeline is very important to us, so we want to make sure that we can still service the entire data pipeline all the way from ingest and data processing to analytics. So our team has 24 different open source components that we deliver in the CDH distribution, and we have committers across the entire stack. We know the application, and we want to make sure that everything's interoperable, no matter how you deploy the cluster. So if you deploy data engineering clusters through Cloudera Altus, but you deployed Impala clusters for data marks in the cloud through Cloudera Director or through any other format, we want all these clusters to be interoperable, and we've taken great pains in order to make everything work together well. >> George: Okay. So how do Altus and Sata Science Workbench interoperate with Spark? Maybe start with >> You want to go first with Altus? >> Sure, so, we, in terms of interoperability we focus on things like making sure there are no data silos so that the data that you use for your entire data lake can be consumed by the different components in our system, the different compute engines and different tools, and so if you're processing data you can also look at this data and visualize this data through Data Science Workbench. So after you do data ingestion and data processing, you can use any of the other analytic tools and then, and this includes Data Science Workbench. >> Right, and for Data Science Workbench runs, for example, with the latest version of Spark you could pick, the currently latest released version of Spark, Spark 2.1, Spark 2.2 is being boarded of course, and that will soon be integrated after its release. For example you could use Data Science Workbench with your flavor of Spark two's version and you can run PySpark or Scala jobs on this notebook-like interface, be able to share your work, and because you're using Spark Underneath the hood it uses yarn for resource management, the Data Science Workbench itself uses Docker for configuration management, and Kubernetes for resource managing these Docker containers. >> What would be, if you had to describe sort of the edge conditions and the sweet spot of the application, I mean you talked about data engineering. One thing, we were talking to Matei Zaharia and Ronald Chin about was, and Ali Ghodsi as well was if you put Spark on a database, or at least a, you know, sophisticated storage manager, like Kudu, all of a sudden there're a whole new class of jobs or applications that open up. Have you guys thought about what that might look like in the future, and what new applications you would tackle? >> I think a lot of that benefit, for example, could be coming from the underlying storage engine. So let's take Spark on Kudu, for example. The inherent characteristics of Kudu today allow you to do updates without having to either deal with the complexity of something like Hbase, or the crappy performance of dealing HDFS compactions, right? So the sweet spot comes from Kudu's capabilities. Of course it doesn't support transactions or anything like that today, but imagine putting something like Spark and being able to use the machine learning libraries and, we have been limited so far in the machine learning algorithms that we have implemented in Spark by the storage system sometimes, and, for example new machine learning algorithms or the existing ones could rewritten to make use of the update features for example, in Kudu. >> And so, it sounds like it makes it, the machine learning pipeline might get richer, but I'm not hearing that, and maybe this isn't sort of in the near term sort of roadmap, the idea that you would build sort of operational apps that have these sophisticated analytics built in, you know, where the analytics, um, you've done the training but at run time, you know, the inferencing influences a transaction, influences a decision. Is that something that you would foresee? >> I think that's totally possible. Again, at the core of it is the part that now you have one storage system that can do scans really well, and it can also do random reads and writes any place, right? So as your, and so that allows applications which were previously siloed because one appication that ran off of HDFS, another application that ran out of Hbase, and then so you had to correlate them to just being one single application that can use to train and then also use their trained data to then make decisions on the new transactions that come in. >> So that's very much within the sort of scope of imagination, or scope. That's part of sort of the ultimate plan? >> Mark: I think it's definitely conceivable now, yeah. >> Okay. >> We're up against a hard break coming up in just a minute, so you each get a 30-second answer here, so it's the same question. You've been here for a day and a half now. What's the most surprising thing you've learned that you thing should be shared more broadly with the Spark community? Let's start with you. >> I think one of the great things that's happening in Spark today is people have been complaining about latency for a long time. So if you saw the keynote yesterday, you would see that Spark is making forays into reducing that latency. And if you are interested in Spark, using Spark, it's very exciting news. You should keep tabs on it. We hope to deliver lower latency as a community sooner. >> How long is one millisecond? (Mark laughs) >> Yeah, I'm largely focused on cloud infrastructure and I found here at the conference that, like, many many people are very much prepared to actually start taking more, you know, more POCs and more interest in cloud and the response in terms of all of this in Altus has been very encouraging. >> Great. Well, Jennifer, Mark, thank you so much for spending some time here on the Cube with us today. We're going to come by your booth and chat a little bit more later. It's some interesting stuff. And thank you all for watching the Cube today here at Spark Summit 2017, and thanks to Cloudera for bringing us these two experts. And thank you for watching. We'll see you again in just a few minutes with our next interview.

Published Date : Jun 7 2017

SUMMARY :

covering Spark Summit 2017, brought to you by databricks. I didn't know the camera was on. And just to his left we also have Jennifer Wu, I'm happy to be here, too. Mark, do you want to get started? and being able to do ETL-Like workloads, and you mentioned it's for exploratory data science. And the other one was if you were to bring them all together and manage the lifecycle across teams, you know? and so that allows you to move your development machine the domain expert to, you know, I can't speak for the road map in that sense, and talk about Altus a little bit. to build it? on Cloudera in the cloud, and they'll be able to do things a lot of the infrastructure that you would associate with We know the application, and we want to make sure Maybe start with so that the data that you use for your entire data lake and you can run PySpark in the future, and what new applications you would tackle? or the existing ones could rewritten to make use the idea that you would build sort of operational apps Again, at the core of it is the part that now you have That's part of sort of the ultimate plan? that you thing should be shared more broadly So if you saw the keynote yesterday, you would see that and the response in terms of all of this on the Cube with us today.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
JenniferPERSON

0.99+

Mark GroverPERSON

0.99+

Jennifer WuPERSON

0.99+

Ali GhodsiPERSON

0.99+

GeorgePERSON

0.99+

MarkPERSON

0.99+

AprilDATE

0.99+

Ronald ChinPERSON

0.99+

San FranciscoLOCATION

0.99+

Matei ZahariaPERSON

0.99+

30-secondQUANTITY

0.99+

ClouderaORGANIZATION

0.99+

Dupe Application ArchitecturesTITLE

0.99+

dozensQUANTITY

0.99+

PythonTITLE

0.99+

yesterdayDATE

0.99+

Two questionsQUANTITY

0.99+

todayDATE

0.99+

SparkTITLE

0.99+

AmazonORGANIZATION

0.99+

two expertsQUANTITY

0.99+

a day and a halfQUANTITY

0.99+

FirstQUANTITY

0.99+

one problemQUANTITY

0.99+

Python 2.6TITLE

0.99+

Strata LondonLOCATION

0.99+

one pieceQUANTITY

0.99+

firstQUANTITY

0.98+

Spark Summit 2017EVENT

0.98+

Cloudera AltusTITLE

0.98+

ScalaTITLE

0.98+

DockerTITLE

0.98+

OneQUANTITY

0.97+

KuduORGANIZATION

0.97+

one millisecondQUANTITY

0.97+

PySparkTITLE

0.96+

RTITLE

0.95+

oneQUANTITY

0.95+

two weeks agoDATE

0.93+

Data Science WorkbenchTITLE

0.92+

ClouderaTITLE

0.91+

hundredsQUANTITY

0.89+

HbaseTITLE

0.89+

eachQUANTITY

0.89+

24 different open source componentsQUANTITY

0.89+

few months agoDATE

0.89+

singleQUANTITY

0.88+

kernelTITLE

0.88+

AltusTITLE

0.88+

Matthew Hunt | Spark Summit 2017


 

>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, we're talking about data signs and engineering at scale, and we're having a great time, aren't we, George? >> We are! >> Well, we have another guest now we're going to talk to, I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg, Matt, thanks for joining us! >> My pleasure. >> Alright, we're going to talk about a lot of exciting stuff here today, but I want to first start with, you're a long-time member of the Spark community, right? How many Spark Summits have you been to? >> Almost all of them, actually, it's quite amazing to see the 10th one, yes. >> And you're pretty actively involved with the user group on the east coast? >> Matt: Yeah, I run the New York users group. >> Alright, well, what's that all about? >> We have some 2,000 people in New York who are interested in finding out what goes on, and which technologies to use, and what are people working on. >> Alright, so hopefully, you saw the keynote this morning with Matei? >> Yes. >> Alright, any comments or reactions from the things that he talked about as priorities? >> Well, I've always loved the keynotes at the Spark Summits, because they announce something that you don't already know is coming in advance, at least for most people. The second Spark Summit actually had people gasping in the audience while they were demoing, a lot of senior people-- >> Well, the one millisecond today was kind of a wow one-- >> Exactly, and I would say that the one thing to pick out of the keynote that really stood out for me was the changes in improvements they've made for streaming, including potentially being able to do sub-millisecond times for some workloads. >> Well, maybe talk to us about some of the apps that you're building at Bloomberg, and then I want you to join in, George, and drill down some of the details. >> Sure. And Bloomberg is a large company with 4,000-plus developers, we've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware, but there's some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in, that's a natural application, another one would be regulatory, there's a regulatory system called MiFID, or MiFID II, the regulations required for Europe, you have to be able to record every trade for seven years, provide daily reports, there's clearly a lot around that, and then I would also just say, our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. >> These applications sound like they're oriented towards streaming solutions, and really low latency. Has that been a constraint on what you can build so far? >> I would definitely say that we have some things that are latency constrained, it tends to be not like high frequency trading, where you care about microseconds, but milliseconds are important, how long does it take to get an answer, but I would say equally important with latency is efficiency, and those two often wind up being coupled together, though not always. >> And so when you say coupled, is it because it's a trade-off, or 'cause you need both? >> Right, so it's a little bit of both, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often, greater efficiencies. Micro-batching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in, and your method of processing them isn't efficient enough, it gets too slow simply from that, and that's why it's not just one or the other. >> So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency, and is that relevant for the apps that you're building? >> I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have, so that's one explicit one, but part of what's useful is, often when you sit down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through, and then you discover other inefficiencies that can be ironed out, and so it just makes it faster overall. I would say, a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on, well, all these things that make things much slower, much less efficient than they need to be, and we can close that gap a lot, I would say that from the very beginning. >> This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take N 10 control of continuous apps, for simplicity and performance, and so there's this, we'll write with transactional consistency, so we're assuring the customer of exactly one's semantics when we write to a file system or database or something like that. But, Spark has never really done native storage, whereas Matei came here on the show earlier today and said, "Well, Databricks as a company "is going to have to do something in that area," and he talks specifically about databases, and he said, he implied that Apache Spark, separate from Databricks, would also have to do more in state management, I don't know if he was saying key value store, but how would that open up a broader class of apps, how would it make your life simpler as a developer? >> Right. Interesting and great question, this is kind of a subject that's near and dear to my own heart, I would say. So part of that, when you take a step back, is about some of the potential promise of what Spark could be, or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value, if you can learn one small skillset, but it can work in a wide variety of use cases, whether it's streaming or at rest or analytics, and plug other things in. As always, there's a gap in any such system between theory and reality, and how much can you close that gap, but as for storage systems, this is something that, you and I have talked about this before, and I've written about it a fair amount too, Spark is historically an analytic system, so you have a bunch of data, and you can do analytics on it, but where's that data come from? Well, either it's streaming in, or you're reading from files, but most people need, essentially, an actual database. So what constitutes the universal system? You need file store, you need a distributive file store, you need a database with generally transactional semantics because the other forms are too hard for people to understand, you need analytics that are extensible, and you need a way to stream data in, and there's how close can you get to that, versus how much do you have to fit other parts that come together, very interesting question. >> So, so far, they've sort of outsourced that to DIY, do-it-yourself, but if they can find a sufficiently scalable relational database, they can do the sort of analytical queries, and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that, like Cassandra would be the, sort of the database that would handle all updates, and then some amount of those would be filtered through to a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you would have the option to drop that out, what applications would you undertake that you couldn't use right now, where the theme was, we're going to take big data apps into production, and then the competition that they show for streaming is of Kafka and Flink, so what does that do to that competitive balance? >> Right, so how many pieces do you need, and how well do they fit together is maybe the essence of that question, and people ask that all the time, and one of the limits has been, how mature is each piece, how efficient is it, and do they work together? And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real-world usage. I think we're coalescing around a smaller set of options, so in the, Kafka, for example, has a lot of usage, and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data, for ingest, I see that everywhere I go. But what happens when you move from Kafka into Spark, or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it, or Oracle, or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability, so we actually just opensourced that a few weeks ago, it's called ComDB2, and the reason we had to do that was the industry solutions at the time, when we started working on that, were inadequate to our needs, but we look at how long that took to develop for these other systems and think, that's really hard for someone else to get right, and so, if you need a database, which everyone does, how can you make that work better with Spark? And I think there're a number of very interesting developments that can make that a lot better, short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too. How do you make them work well together, we could talk about for a while, 'cause that's a fascinating question. >> On that one topic, maybe the Databricks guys don't want to assume responsibility for the development, because then they're picking a winner, perhaps? Maybe, as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Splice Machine and SnappyData do the work, take it upon themselves to take data frames, the core data structure, in Spark, and give it transactional semantics. Does that sound promising? >> There're multiple avenues for potential success, and who can use which, in a way, depends on the audience. If you look at things like Cassandra and HBase, they're distributing key value stores that additional things are being built on, so they started as distributed, and they're moving towards more encompassing systems, versus relational databases, which generally started as single image on single machine, and are moving towards federation distribution, and there's been a lot with that with post grads, for example. One of the questions would be, is it just knobs, or why don't they work well together? And there're a number of reasons. One is, what can be pushed down, how much knowledge do you have to have to make that decision, and optimizing that, I think, is actually one of the really interesting things that could be done, just as we have database query optimizers, why not, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted, but are coming. One is the very efficient copy of data between systems, and Apache Arrow, for example, is very, very interesting, and it's nearing the time when I think it's just going to explode, because it lets you connect these systems radically more efficiently in a standardized way, and that's one of the things that was missing, as soon as you hop from one system to another, all of a sudden, you have the semantic computational expense, that's a problem, we can fix that. The other is, the next level of integration requires, basically, exposing more hooks. In order to know, where should a query be executed and which operator should I push down, you need something that I think of as a meta-optimizer, and also, knowledge about the shape of the data, or statistics underlying, and ways to exchange that back and forth to be able to do it well. >> Wow, Matt, a lot of great questions there. We're coming up on a break, so we have to wrap things up, and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community, do over the next year. What are the top issues, things you'd love to see worked on? >> Right. It's an exciting time for Spark, because as time goes by, it gets more and more mature, and more real-world applications are viable. The hardest thing of all is to get, anywhere you in any organization's to get people working together, but the more people work together to enable these pieces, how do I efficiently work with databases, or have these better optimizations make streaming more mature, the more people can use it in practice, and that's why people develop software, is to actually tackle these real-world problems, so, I would love to see more of that. >> Can we all get along? (chuckling) Well, that's going to be the last word of this segue, Matt, thank you so much for coming on and spending some time with us here to share the story! >> My pleasure. >> Alright, thank you so much. Thank you George, and thank you all for watching this segment of theCUBE, please stay with us, as Spark Summit 2017 will be back in a few moments.

Published Date : Jun 6 2017

SUMMARY :

covering Spark Summit 2017, brought to you by Databricks. it's quite amazing to see the 10th one, yes. and what are people working on. that you don't already know is coming in advance, and I would say that the one thing and then I want you to join in, George, you have to be able to record every trade for seven years, Has that been a constraint on what you can build so far? where you care about microseconds, On the other hand, if you have a really large volume and then you discover other inefficiencies and so there's this, we'll write and there's how close can you get to that, what applications would you undertake and so, if you need a database, which everyone does, and give it transactional semantics. it's just going to explode, because it lets you and I wanted to give you at least 30 seconds and that's why people develop software, Alright, thank you so much.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
GeorgePERSON

0.99+

Matt HuntPERSON

0.99+

BloombergORGANIZATION

0.99+

Matthew HuntPERSON

0.99+

MattPERSON

0.99+

MateiPERSON

0.99+

New YorkLOCATION

0.99+

San FranciscoLOCATION

0.99+

30 yearsQUANTITY

0.99+

seven yearsQUANTITY

0.99+

each pieceQUANTITY

0.99+

DatabricksORGANIZATION

0.99+

oneQUANTITY

0.99+

one millisecondQUANTITY

0.99+

5,000 skillsQUANTITY

0.99+

bothQUANTITY

0.99+

twoQUANTITY

0.99+

two thingsQUANTITY

0.99+

OneQUANTITY

0.99+

OracleORGANIZATION

0.99+

SparkTITLE

0.98+

EuropeLOCATION

0.98+

Spark Summit 2017EVENT

0.98+

DB2TITLE

0.98+

200 different productsQUANTITY

0.98+

Spark SummitsEVENT

0.98+

Spark SummitEVENT

0.98+

todayDATE

0.98+

one systemQUANTITY

0.97+

next yearDATE

0.97+

4,000-plus developersQUANTITY

0.97+

firstQUANTITY

0.96+

HBaseORGANIZATION

0.95+

secondQUANTITY

0.94+

decadesQUANTITY

0.94+

MiFID IITITLE

0.94+

one topicQUANTITY

0.92+

this morningDATE

0.92+

single machineQUANTITY

0.91+

One ofQUANTITY

0.91+

ComDB2TITLE

0.9+

few weeks agoDATE

0.9+

CassandraPERSON

0.89+

earlier todayDATE

0.88+

10th oneQUANTITY

0.88+

2,000 peopleQUANTITY

0.88+

one thingQUANTITY

0.87+

KafkaTITLE

0.87+

single imageQUANTITY

0.87+

MiFIDTITLE

0.85+

SparkORGANIZATION

0.81+

Splice MachineTITLE

0.81+

Project TungstenORGANIZATION

0.78+

theCUBEORGANIZATION

0.78+

at least 30 secondsQUANTITY

0.77+

CassandraORGANIZATION

0.72+

Apache SparkORGANIZATION

0.71+

questionsQUANTITY

0.7+

thingsQUANTITY

0.69+

Apache ArrowORGANIZATION

0.69+

SnappyDataTITLE

0.66+

Show Wrap - Data Platforms 2017 - #DataPlatforms2017


 

>> Announcer: Live from the Wigwam in Phoenix, Arizona. It's theCUBE. Covering Data Platforms 2017. Brought to you by Kubo. >> Hey welcome back everybody. Jeff Frick here with theCUBE along with George Gilbert from Wikibon. We've had a tremendous day here at DataPlatforms 2017 at the historic Wigwam Resort, just outside of Phoenix, Arizona. George, you've been to a lot of big data shows. What's your impression? >> I thought we're at the, we're sort of at the edge of what could be a real bridge to something new, which is, we've built big data systems for like out of traditional, as traditional software for deployment on traditional infrastructure. Even if you were going to put it in a virtual machine, it's still not a cloud. You're still dealing with server abstractions. But what's happening with Kubo is, they're saying, once you go to the cloud, whether it's Amazon, Azure, Google or Oracle, you're going to be dealing with services. Services are very different. It greatly simplifies the administrative experience, the developer experience, and more than that, they're focused on, they're focused on turning Kubo, the product on Kubo the service, so that they can automate the management of it. And we know that big data has been choking itself on complexity. Both admin and developer complexity. And they're doing something unique, both on sort of the big data platform management, but also data science operations. And their point, their contention, which we still have to do a little more homework on, is that the vendors who started with software on-prem, can't really make that change very easily without breaking what they've done on-prem. Cuz they have traditional perpetual license physical software as opposed to services, which is what is in the cloud. >> The question is, are people going to wait for them to figure it out. I talked to somebody in the hallway earlier this morning and we were talking about their move to put all their data into, it was S3, on their data lake. And he said, it's part of a much bigger transformational process that we're doing inside the company. And so, this move, from his cloud, public cloud viable, to tell me, give me a reason why it shouldn't go to the cloud, has really kicked in big time. And hear over and over and over that speed and agility, not just in deploying applications, but in operating as a company, is the key to success. And we hear over and over how many, how short the tenure is on the Fortune 500 now, compared to what it used to be. So if you're not speed and agile, which you pretty much have to use cloud, and software driven automated decision-making >> Yeah. >> that's powered by machine learning to eat. >> Those two things. >> A huge percentage of your transaction and decision-making, you're going to get smoked by the person that is. >> Let's let's sort of peel that back. I was talking to Monte Zweben who is the co-founder of Splice Machine, one of the most advance databases that sort of come out of nowhere over the last couple of years. And it's now, I think, in close beta on Amazon. He showed me, like a couple of screens for spinning it up and configuring it on Amazon. And he said, if I were doing that on-prem, he goes I needed Hadoop cluster with HBase. It would take me like four plus months. And that's an example of software versus services. >> Jeff: Right. >> And when you said, when you pointed out that, automated decision-making, powered by machine learning, that's the other part, which is these big data systems ultimately are in the service of creating machine learning models that will inform ever better decisions with ever greater speed and the key then is to plug those models into existing systems of record. >> Jeff: Right. Right. >> Because we're not going to, >> We're not going to to rip those out and rebuild them from scratch. >> Right. But as you just heard, you can pull the data out that you need, run it through a new age application. >> George: Yeah. >> And then feed it back into the old system. >> George: Yes. >> The other thing that came up, it was Oskar, I have to look him up, Oskar Austegard from Gannett was on one of the panels. We always talk about the flexibility to add capacity very easily in a cloud-based solution. But he talked about in the separation of storage and cloud, that they actually have times where they turn off all their compute. It's off. Off. >> And that was If you had to boil down the fundamental compatibility break between on-prem and in the cloud, the Kubo folks, both the CEO and CMO said, look, you cannot reconcile what's essentially server send, where the storage is attached to the compute node, the server. With cloud where you have storage separate from compute and allowing you to spin it down completely. He said those are just the fundamentally incompatible. >> Yeah, yeah. And also, Andretti, one of the founders in his talk, he talked about the big three trends, which we just kind of talked about, he summarized them right in serverless. This continual push towards smaller and smaller units >> George: Yeah. >> of store compute. And the increasing speed of networks is one, from virtual servers to just no servers, to just compute. The second one is automation, you've got to move to automation. >> George: Right. If you're not, you're going to get passed by your competitor that is. Or the competitor you that you don't even know that exists that's going to come out from over your shoulder. And the third one was the intelligence, right. There is a lot of intelligence that can be applied. And I think the other cusp that we're on, is this continuing crazy increase in compute horsepower. Which just keeps going. That the speed and the intelligence of these machines is growing at an exponential curve, not a linear curve. It's going to be bananas in the not too distance future. >> We're soaking up more and more that intelligence with machine learning. The training part of machine learning where the datasets to train a model are immense. Not only the dataset are large, but the amount of time to sort of chug through them to come up with the, just the right mix of variables and values for those variables. Or maybe even multiple models. So that we're going to see in the cloud. And that's going to chew up more and more cycles. Even as we have >> Jeff: Right. Right. >> specialized processors. >> Jeff: Right. But in the data ops world, in theory yes, but I don't have to wait to get it right. Right? I can get it 70% right. >> George: Yeah. >> Which is better than not right. >> George: Yeah. >> And I can continue to iterate over time. In that, I think was the the genius of dev-ops. To stop writing PRDs and MRDs. >> George: Yeah. >> And deliver something. And then listen and adjust. >> George: Yeah. >> And within the data ops world, it's the same thing. Don't try to figure it all out. Take the data you know, have some hypothesis. Build some models and iterate. That's really tough to compete with. >> George: Yeah. >> Fast, fast, fast iteration. >> We're doing actually a fair amount of research on that. On the Wikibon side. Which is, if you build, if you build an enterprise application that has, that is reinforced or informed by models in many different parts, in other words, you're modeling more and more digital entities within the business. >> Jeff: Right. >> Each of those has feedback loops. >> Jeff: Right. Right. >> And when you get the whole thing orchestrated and moving or learning in concert then you have essentially what Michael Porter many years ago called competitive advantage. Which is when each business process reinforces all the other business processes in service of a delivering a value proposition. And those models represent business processes and when they're learning and orchestrated all together, you have a, what Trump called a fined-tuned machine. >> I won't go there. >> Leaving out that it was Bigley and it was finely-tuned machine. >> Yeah, yeah. But the end of the day, if you're using resources and effort to improve an different resource and effort, you're getting a multiplier effect. >> Yes. >> And that's really the key part. Final thought as we go out of here. Are you excited about this? Do you see, they showed the picture the NASA headquarters with the big giant snowball truck loading up? Do you see more and more of this big enterprise data going into S3, going into Google Cloud, going into Microsoft Azure? >> You're asking-- >> Is this the solution for the data lake swamp issue that we've been talking about? >> You're asking the 64 dollar question. Which is, companies, we sensed a year ago at the at the Hortonworks DataWorks Summit in, was in June, down in San Jose last year. That was where we first got the sense that, people were sort of throwing in the towel on trying to build, large scale big data platforms on-prem. And what changes now is, are they now evaluating Hortonworks versus Cloudera versus MapR in the cloud or are they widening their consideration as Kubo suggests. Because now they want to look, not only at Cloud Native Hadoop, but they actually might want to look at Cloud Native Services that aren't necessarily related to Hadoop. >> Right. Right. And we know as a service wins. It's continue. PAS is a service. Software is a service. Time and time again, as a service either eats a lot of share from the incumbent or knocks the incumbent out. So, Hadoop as a service, regardless of your distro, via one of these types of companies on Amazon, it seems like it's got to win, right. It's going to win. >> Yeah but the difference is, so far, so far, the Clouderas and the MapRs and the Hortonworks of the world are more software than service when they're in the cloud. They don't hide all the knobs. You still need You still a highly trained admin to get them up-- >> But not if you buy it as a service, in theory, right. It's going to be packaged up by somebody else and they'll have your knobs all set. >> They're not designed yet that way. >> HD Insight >> Then, then, then, then, They better be careful cuz it might be a new, as a service distro, of the Hadoop system. >> My point, which is what this is. >> Okay, very good, we'll leave it at that. So George, thanks for spending the day with me. Good show as always. >> And I'll be in a better mood next time when you don't steal my candy bars. >> All right. He's George Goodwin. I'm Jeff Frick. You're watching theCUBE. We're at the historic 99 years young, Wigwam Resort, just outside of Phoenix, Arizona. DataPlatforms 2017. Thanks for watching. It's been a busy season. It'll continue to be a busy season. So keep it tuned. SiliconAngle.TV or YouTube.com/SiliconAngle. Thanks for watching.

Published Date : May 26 2017

SUMMARY :

Brought to you by Kubo. at the historic Wigwam Resort, is that the vendors who started with software on-prem, but in operating as a company, is the key to success. you're going to get smoked by the person that is. over the last couple of years. and the key then is to plug those models Jeff: Right. We're not going to to rip those out But as you just heard, We always talk about the flexibility to add capacity And that was And also, Andretti, one of the founders in his talk, And the increasing speed of networks is one, And the third one was the intelligence, right. but the amount of time to sort of chug through them Jeff: Right. But in the data ops world, in theory yes, And I can continue to iterate over time. And then listen and adjust. Take the data you know, have some hypothesis. On the Wikibon side. Jeff: Right. And when you get the whole thing orchestrated Leaving out that it was Bigley But the end of the day, if you're using resources And that's really the key part. You're asking the 64 dollar question. a lot of share from the incumbent and the Hortonworks of the world It's going to be packaged up by somebody else of the Hadoop system. which is what this is. So George, thanks for spending the day with me. And I'll be in a better mood next time We're at the historic 99 years young, Wigwam Resort,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Jeff FrickPERSON

0.99+

JeffPERSON

0.99+

GeorgePERSON

0.99+

George GoodwinPERSON

0.99+

George GilbertPERSON

0.99+

Michael PorterPERSON

0.99+

AndrettiPERSON

0.99+

San JoseLOCATION

0.99+

AmazonORGANIZATION

0.99+

64 dollarQUANTITY

0.99+

70%QUANTITY

0.99+

TrumpPERSON

0.99+

Oskar AustegardPERSON

0.99+

JuneDATE

0.99+

OracleORGANIZATION

0.99+

OskarPERSON

0.99+

GoogleORGANIZATION

0.99+

NASAORGANIZATION

0.99+

KuboORGANIZATION

0.99+

oneQUANTITY

0.99+

last yearDATE

0.99+

HortonworksORGANIZATION

0.99+

four plus monthsQUANTITY

0.99+

99 yearsQUANTITY

0.99+

third oneQUANTITY

0.99+

Phoenix, ArizonaLOCATION

0.99+

a year agoDATE

0.99+

Splice MachineORGANIZATION

0.98+

BothQUANTITY

0.98+

MicrosoftORGANIZATION

0.98+

HadoopTITLE

0.98+

bothQUANTITY

0.97+

AzureORGANIZATION

0.97+

EachQUANTITY

0.96+

Monte ZwebenPERSON

0.96+

firstQUANTITY

0.94+

MapRsORGANIZATION

0.94+

earlier this morningDATE

0.92+

Wigwam ResortLOCATION

0.92+

two thingsQUANTITY

0.92+

2017DATE

0.92+

#DataPlatforms2017EVENT

0.89+

WikibonORGANIZATION

0.89+

second oneQUANTITY

0.89+

three trendsQUANTITY

0.89+

each business processQUANTITY

0.87+

DataPlatformsTITLE

0.86+

theCUBEORGANIZATION

0.85+

ClouderaORGANIZATION

0.85+

Hortonworks DataWorks SummitEVENT

0.85+

Wigwam ResortORGANIZATION

0.85+

KuboPERSON

0.84+

GannettORGANIZATION

0.82+

MapRORGANIZATION

0.8+

S3TITLE

0.8+

many years agoDATE

0.78+

DataPlatforms 2017EVENT

0.74+

yearsDATE

0.73+

YouTube.com/SiliconAngleOTHER

0.72+

ClouderasORGANIZATION

0.7+

Cloud NativeTITLE

0.67+

PlatformsTITLE

0.67+

Google CloudTITLE

0.64+

Cloud Native HadoopTITLE

0.64+

last coupleDATE

0.64+

AzureTITLE

0.61+

Carlo Vaiti | DataWorks Summit Europe 2017


 

>> Announcer: You are CUBE Alumni. Live from Munich, Germany, it's theCUBE. Covering, DataWorks Summit Europe 2017. Brought to you by Hortonworks. >> Hello, everyone, welcome back to live coverage at DataWorks 2017, I'm John Furrier with my cohost, Dave Vellante. Two days of coverage here in Munich, Germany, covering Hortonworks and Yahoo, presenting Hadoop Summit, now called DataWorks 2017. Our next guest is Carlo Vaiti, who's the HPE chief technology strategist, EMEA Digital Solutions, Europe, Middle East, and Africa. Welcome to theCUBE. >> Thank you, John. >> So we were just chatting before we came on, of your historic background at IBM, Oracle, and now HPE, and now back into the saddle there. >> Don't forget Sun Microsystems. >> Sun Microsystems, sorry, Sun, yeah. I mean, great, great run. >> It was a long run. >> You've seen the computer revolution happen. I worked at HP for nine years, from '88 to '97. Again, Dave was a premier analyst during that run of client-server. We've seen the computer revolution happen. Now we're seeing the digital revolution where the iPhone is now 10 years old, Cloud is booming, data's at the center of the value proposition, so a completely new disruptive capability. >> Carlo: Sure, yes. >> So what are you doing as the CTO, chief technologist for HPE, how are you guys bringing this story together? 'Cause there's so much going on at HPE. You got the services spit, you got the software split, and HP's focusing on the new style of IT, as Meg Whitman calls it. >> So, yeah. My role in EMEA is actually all about having basically a visionary kind of strategy role for what's going to be HP in the future, in terms of IT. And one of the things that we are looking at is, is specifically to have, we split our strategy in three different aspects, so three transformation areas. The first one which we usually talk is what I call hybrid IT, right, which is basically making services around either On-Premise or on Cloud for our customer base. The second one is actually power the Intelligent Edge, so is actually looking after our collaboration and when we acquire Aruba components. And the third one, which is in the middle, and that's why I'm here at the DataWorks Summit, is actually the data-analytics aspects. And we have a couple of solution in there. One is the Enterprise great Hadoop, which is part of this. This is actually how we generalize all the figure and the strategy for HP. >> It's interesting, Dave and I were talking yesterday, being in Europe, it's obviously a different sideshow, it's smaller than the DataWorks or Hadoop Summit in North America in San Jose, but there's a ton of Internet of things, IoT or IIoT, 'cause here in Germany, obviously, a lot of industrial nations, but in Europe in general, a lot of smart cities initiatives, a lot of mobility, a ton of Internet of things opportunity, more than in the US. >> Absolutely. >> Can you comment on how you guys are tackling the IoT? Because it's an Intelligent Edge, certainly, but it's also data, it's in your wheelhouse. >> Yes, sure. So I'm actually working, it's a good question, because I'm actually working a couple of projects in Eastern Europe, where it's all about Industrial IoT Analytics, IIoTA. That's the new terminology we use. So what we do is actually, we analyze from a business perspective, what are the business pain points, in an oil and gas company for example. And we understand for example, what kind of things that they need and must have. And what I'm saying here is, one of the aspects for example, is the drilling opportunity. So how much oil you can extract from a specific rig in the middle of the North Sea, for example. This is one of the key question, because the customer want to understand, in the future, how much oil they can extract. The other one is for example, the upstream business. So doing on the retail side and having, say, when my customer is stopping in a gas station, I want go in the shop, immediately giving, I dunno, my daughter, a kind of campaign for the Barbie, because they like the Barbie. So IoT, Industrial IoT help us in actually making a much better customer experience, and that's the case of the upstream business, but is also helping us in actually much faster business outcomes. And that's what the customer wants, right? 'Cause, and was talking with your colleague before, I'm talking to the business guy. I'm not talking to the IT anymore in these kind of place, and that's how IoT allow us a chance to change the conversation at the industry level. >> These are first-time conversations too. You're getting at the kinds of business conversations that weren't possible five years ago. >> Carlo: Yes, sure. >> I mean and 10 years ago, they would have seemed fantasy. Now they're reality. >> The role of analytics in my opinion, is becoming extremely key, and I said this morning, for me my best center is that the detail, is the stone foundation of the digital economy. I continue to repeat this terminology, because it's actually where everything is starting from. So what I mean is, let's take a look at the analytic aspect. So if I'm able to analyze the data close to the shop floor, okay, close to the shop manufacturing floor, if I'm able to analyze my data on the rig, in the oil and gas industry, if I'm able to analyze doing preprocessing analytics, with Kafka, Druid, these kind of open-source software, where close to the Intelligent Edge, then my customers going to be happy, because I give them very fast response, and the decision-maker can get to decision in a faster time. Today, it takes a long time to take these type of decision. So that's why we want to move into the power Intelligent Edge. >> So you're saying, data's foundational, but if you get to the Intelligent Edge, it's dynamic. So you have a dynamic reactive, realtime time series, or presences of data, but you need the foundational pre-data. >> Perfect. >> Is that kind of what you're getting at? >> Yes, that's the first step. Preprocessing analytics is what we do. In the next generation of, we think is going to be Industrial IoT Analytics, we're going to actually put massive amount of compute close to the shop manufacturing floor. We call internally or actually externally, convergent planned infrastructure. And that's the key point, right? >> John: Convergent plan? >> Convergent planned infrastructure, CPI. If you look at in Google, you will find. It's a solution we bring in the market a few months ago. We announce it in December last year. >> Yeah, Antonio's smart. He also had a converged systems as well. One of the first ones. >> Yeah, so that's converge compute at the edge basically. >> Correct, converge compute-- >> Very powerful. >> Very powerful, and we run analytics on the edge. That's the key point. >> Which we love, because that means you don't have to send everything back to the Cloud because it's too expensive, it's going to take too long, it's not going to work. >> Carlo: The bandwidth on the network is much less. >> There's no way that's going to be successful, unless you go to the edge and-- >> It takes time. >> With a cost. >> Now the other thing is, of course, you've got the Aruba asset, to be able to, I always say, joke, connect the windmill. But, Carlo, can we go back to the IoTA example? >> Carlo: Correct, yeah. >> I want to help, help our audience understand, sort of, the new HP, post these spin merges. So perviously you would say, okay, we have Vertica. You still have partnership, or you still own Vertica, but after September 1st-- >> Absolutely, absolutely. It's part of the columnar side-- >> Right, yes, absolutely, but, so. But the new strategy is to be more of a platform for a variety of technology. So how for instance would you solve, or did you solve, that problem that you described? What did you actually deliver? >> So again, as I said, we're, especially in the Industrial IoT, we are an ecosystem, okay? So we're one element of the ecosystem solution. For the oil and gas specifically, we're working with other system integrator. We're working with oil and the industry gas expertise, like DXC company, right, the company that we just split a few days ago, and we're working with them. They're providing the industry expertise. We are a infrastructure provided around that, and the services around that for the infrastructure element. But for the industry expertise, we try to have a kind of little bit of knowledge, to start the conversation with the customer. But again, my role in the strategy is actually to be a ecosystem digital integrator. That's the new terminology we like to bring in the market, because we really believe that's the way HP role is going to be. And the relevance of HP is totally depending if we are going to be successful in these type of things. >> Okay, now a couple other things you talked about in your keynote. I'm just going to list them, and then we can go wherever we want. There was Data Link 3.0, Storage Disaggregation, which is kind of interesting, 'cause it's been a problem. Hadoop as a service, Realtime Everywhere, and then Analytics at the Edge, which we kind of just talked about. Let's pick one. Let's start with Data Link 3.0. What is that? John doesn't like the term data link. He likes data ocean. >> I like data ocean. >> Is Data Link 3.0 becoming an ocean? >> It's becoming an ocean. So, Data Link 3.0 for us is actually following what is going to be the future for HDFS 3.0. So we have three elements. The erasure coding feature, which is coming on HDFS. The second element is around having HDFS data tier, multi-data tier. So we're going to have faster SSD drives. We're going to have big memory nodes. We're going to have GPU nodes. And the reason why I say disaggregation is because some of the workload will be only compute, and some of the workload will be only storage, okay? So we're going to bring, and the customer require this, because it's getting more data, and they need to have for example, YARN application running on compute nodes, and the same level, they want to have storage compute block, sorry, storage components, running on the storage model, like HBase for example, like HDFS 3.0 with the multi-tier option. So that's why the data disaggregation, or disaggregation between compute and storage, is the key point. We call this asymmetric, right? Hadoop is becoming asymmetric. That's what it mean. >> And the problem you're solving there, is when I add a node to a cluster, I don't have to add compute and storage together, I can disaggregate and choose whatever I need, >> Everyone that we did. >> based on the workload. >> They are all multitenancy kind of workload, and they are independent and they scale out. Of course, it's much more complex, but we have actually proved that this is the way to go, because that's what the customer is demanding. >> So, 3.0 is actually functional. It's erasure coding, you said. There's a data tier. You've got different memory levels. >> And I forgot to mention, the containerization of the application. Having dockerized the application for example. Using mesosphere for example, right? So having the containerization of the application is what all of that means, because what we do in Hadoop, we actually build the different clusters, they need to talk to each other, and change data in a faster way. And a solution like, a product like SQL Manager, from Hortonworks, is actually helping us to get this connection between the cluster faster and faster. And that's what the customer wants. >> And then Hadoop as a service, is that an on-premise solution, is that a hybrid solution, is it a Cloud solution, all three? >> I can offer all of them. Hadoop is a service could be run on-premise, could be run on a public Cloud, could be run on Azure, or could be mix of them, partially on-premise, and partially on public. >> And what are you seeing with regard to customer adoption of Cloud, and specifically around Hadoop and big data? >> I think the way I see that option is all the customer want to start very small. The maturity is actually better from a technology standpoint. If you're asking me the same question maybe a year ago, I would say, it's difficult. Now I think they've got the point. Every large customer, they want to build this big data ocean, note the delay, ocean, whatever you want to call it. >> John: Love that. (laughs) >> All right. They want to build this data ocean, and the point I want to make is, they want to start small, but they want to think very high. Very big, right, from their perspective. And the way they approach us is, we have a kind of methodology. We establish the maturity assessment. We do a kind of capability maturity assessment, where we find that if the customer is actually a pioneer, or is actually a very traditional one, so it's very slow-going. Once we determine where is the stage of the customer is, we propose some specific proof of concept. And in three months usually, we're putting this in place. >> You also talked about realtime everywhere. We in our research, we talk about the, historically, you had batchy of interactive, and now you have what we call continuous, or realtime streaming workloads. How prevalent is that? Where do you see it going in the future? >> So I think is another train for the future, as I mentioned this morning in my presentation. So and Spark is actually doing the open-source memory engine process, is actually the core of this stuff. We see 60 to 70 time faster analytics, compared to not to use Spark. So many customer implemented Spark because of this. The requirement are that the customer needs an immediate response time, okay, for a specific decision-making that they have to do, in order to improve their business, in order to improve their life. But this require a different architecture. >> I have a question, 'cause you, you've lived in the United States, you're obviously global, and spent a lot of time in Europe as well, and a lot of times, people want to discuss the differences between, let's make it specific here, the European continent and North America, and from a sophistication standpoint, same, we can agree on that, but there are still differences. Maybe, more greater privacy concerns. The whole thing with the Cloud and the NSA in the United States, created some concerns. What do you see as the differences today between North America and Europe? >> From my perspective, I think we are much more for example take IoT, Industrial IoT. I think in Europe we are much more advanced. I think in the manufacturing and the automotive space, the connected car kind of things, autonomous driving, this is something that we know already how to manage, how to do it. I mean, Tesla in the US is a good example that what I'm saying is not true, but if I look at for example, large German manufacturing car, they always implemented these type of things already today. >> Dave: For years, yeah. >> That's the difference, right? I think the second step is about the faster analytic approach. So what I mentioned before. The Power the Intelligent Edge, in my opinion at the moment, is much more advanced in the US compared to Europe. But I think Europe is starting to run back, and going on the same route. Because we believe that putting compute capacity on the edge is what actually the customer wants. But that's the two big differences I see. >> The other two big external factors that we like to look at, are Brexit and Trump. So (laughs) how 'about Brexit? Now that it's starting to sort of actually become, begin the process, how should we think about it? Is it overblown? It is critical? What's your take? >> Well, I think it's too early to say. UK just split a few days ago, right, officially. It's going to take another 18 months before it's going to be completed. From a commercial standpoint, we don't see any difference so far. We're actually working the same way. For me it's too early to say if there's going to be any implication on that. >> And we don't know about Trump. We don't have to talk about it, but the, but I saw some data recently that's, European sentiment, business sentiment is trending stronger than the US, which is different than it's been for the last many years. What do you see in terms of just sentiment, business conditions in Europe? Do you see a pick up? >> It's getting better, it is getting better. I mean, if I look at the major countries, the P&L is going positive, 1.5%. So I think from that perspective, we are getting better. Of course we are still suffering from the Chinese, and Japanese market sometimes. Especially in some of the big large deals. The inclusion of the Japanese market, I feel it, and the Chinese market, I feel that. But I think the economy is going to be okay, so it's going to be good. >> Carlo, I want to thank you for coming on and sharing your insight, final question for you. You're new to HPE, okay. We have a lot of history, obviously I was, spent a long part of my career there, early in my career. Dave and I have covered the transformation of HP for many, many years, with theCUBE certainly. What attracted you to HP and what would you say is going on at HP from your standpoint, that people should know about? >> So I think the number one thing is that for us the word is going to be hybrid. It means that some of the services that you can implement, either on-premise or on Cloud, could be done very well by the new Pointnext organization. I'm not part of Pointnext. I'm in the EG, Enterprise Group division. But I am fan for Pointnext because I believe this is the future of our company, is on the services side, that's where it's going. >> I would just point out, Dave and I, our commentary on the spin merge has been, create these highly cohesive entities, very focused. Antonio now running EG, big fans, of where it's actually an efficient business model. >> Carlo: Absolutely. >> And Chris Hsu is running the Micro Focus, CUBE Alumni. >> Carlo: It's a very efficient model, yes. >> Well, congratulations and thanks for coming on and sharing your insights here in Europe. And certainly it is an IoT world, IIoT. I love the analytics story, foundational services. It's going to be great, open source powering it, and this is theCUBE, opening up our content, and sharing that with you. I'm John Furrier, Dave Vellante. Stay with us for more great coverage, here from Munich after the short break.

Published Date : Apr 6 2017

SUMMARY :

Brought to you by Hortonworks. Welcome to theCUBE. and now back into the saddle there. I mean, great, great run. data's at the center of the value proposition, and HP's focusing on the new style And one of the things that we are looking at is, it's smaller than the DataWorks or Hadoop Summit Can you comment on how you guys are tackling the IoT? and that's the case of the upstream business, You're getting at the kinds of business conversations I mean and 10 years ago, they would have seemed fantasy. and the decision-maker can get to decision in a faster time. So you have a dynamic reactive, And that's the key point, right? It's a solution we bring in the market a few months ago. One of the first ones. That's the key point. it's going to take too long, it's not going to work. Now the other thing is, sort of, the new HP, post these spin merges. It's part of the columnar side-- But the new strategy is to be more That's the new terminology we like to bring in the market, John doesn't like the term data link. and the same level, they want to have but we have actually proved that this is the way to go, So, 3.0 is actually functional. So having the containerization of the application Hadoop is a service could be run on-premise, all the customer want to start very small. John: Love that. and the point I want to make is, they want to start small, and now you have what we call continuous, is actually the core of this stuff. in the United States, created some concerns. I mean, Tesla in the US is a good example is much more advanced in the US compared to Europe. actually become, begin the process, before it's going to be completed. We don't have to talk about it, but the, and the Chinese market, I feel that. Dave and I have covered the transformation of HP It means that some of the services that you can implement, our commentary on the spin merge has been, I love the analytics story, foundational services.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavePERSON

0.99+

Dave VellantePERSON

0.99+

CarloPERSON

0.99+

OracleORGANIZATION

0.99+

EuropeLOCATION

0.99+

IBMORGANIZATION

0.99+

GermanyLOCATION

0.99+

TrumpPERSON

0.99+

Meg WhitmanPERSON

0.99+

VerticaORGANIZATION

0.99+

PointnextORGANIZATION

0.99+

Chris HsuPERSON

0.99+

JohnPERSON

0.99+

Carlo VaitiPERSON

0.99+

John FurrierPERSON

0.99+

HPORGANIZATION

0.99+

MunichLOCATION

0.99+

HPEORGANIZATION

0.99+

YahooORGANIZATION

0.99+

Sun MicrosystemsORGANIZATION

0.99+

AntonioPERSON

0.99+

USLOCATION

0.99+

EGORGANIZATION

0.99+

second elementQUANTITY

0.99+

United StatesLOCATION

0.99+

second stepQUANTITY

0.99+

HortonworksORGANIZATION

0.99+

December last yearDATE

0.99+

iPhoneCOMMERCIAL_ITEM

0.99+

San JoseLOCATION

0.99+

1.5%QUANTITY

0.99+

yesterdayDATE

0.99+

North AmericaLOCATION

0.99+

September 1stDATE

0.99+

'97DATE

0.99+

'88DATE

0.99+

AfricaLOCATION

0.99+

oneQUANTITY

0.99+

TodayDATE

0.99+

three monthsQUANTITY

0.99+

Eastern EuropeLOCATION

0.99+

SunORGANIZATION

0.99+

Two daysQUANTITY

0.99+

60QUANTITY

0.99+

DataWorks 2017EVENT

0.99+

10 years agoDATE

0.99+

DXCORGANIZATION

0.98+

EMEA Digital SolutionsORGANIZATION

0.98+

five years agoDATE

0.98+

a year agoDATE

0.98+

TeslaORGANIZATION

0.98+

Scott Gnau | DataWorks Summit Europe 2017


 

>> More information, click here. (soothing technological music) >> Announcer: Live from Munich, Germany, it's theCUBE. Covering Dataworks Summit Europe 2017. Brought to you by Hortonworks. (soft technological music) >> Okay welcome back everyone, we're here in Munich, Germany for Dataworks Summit 2017 formerly Hadoop Summit powered by Hortonworks. It's their event, but now called Dataworks because data is at the center of the value proposition Hadoop plus Airal Data and storage. I'm John, my cohost David. Our next guest is Scott Gnau he's the CTO of Hortonworks joining us again from the keynote stage, good to see you again. >> Thanks for having me back, great to be here. >> Good having you back. Get down and dirty and get technical. I'm super excited about the conversations that are happening in the industry right now for a variety of reasons. One is you can't get more excited about what's happening in the data business. Machine learning AI has really brought up the hype around, to me is human America, people can visualize AI and see the self-driving cars and understand how software's powering all this. But still it's data driven and Hadoop is extending into data seeing that natural extension and CloudAIR has filed their S1 to go public. So it brings back the conversations of this opensource community that's been doin' all this work in the big data industry, originally riding in the horse of Hadoop. You guys have an update to your Hadoop data platform which we'll get to in a second, but I want to ask you a lot of stories around Hadoop, I say Hadoop was the first horse that everyone rode in on in the big data industry... When I say big data, I mean like DevOps, Cloud, the whole open sourcing he does, but it's evolving it's not being replaced. So I want you to clarify your position on this because we're just talkin' about some of the false premises, a lot of stories being written about the demise of Hadoop, long-live Hadoop. Yeah, well, how long do we have? (laughing) I think you hit it first, we're at Dataworks Summit 2017 and we rebranded and it was previously Hadoop Summit. We rebranded it to really recognize that there's this bigger thing going on and it's not just Hadoop. Hadoop is a big contributor, a big driver, a very important part of the ecosystem but it's more than that. It's really about being able to manage and deliver analytic content on all data across that data's lifecycle from when it gets created at the edge to its moving through networks, to its landed and store in a cluster to analytics run and decisions go back out. It's that entire lifecycle and you mentioned some of the megatrends and I talked about this morning in the opening keynote. With AI and streaming and IoT, all of these things kind of converging are creating a much larger problem set and frankly, opportunity for us as an industry to go soft. So that's the context that we're really looking-- >> And there's real demand there. This is not like, I mean there's certainly a hype factor on AI, but IoT is real. You have data now, not just a back office concept, you have a front-facing business centric... I mean there's real customer demand here. >> There's real customer demand and it really creates the ability to dramatically change a business. A simple example that I used onstage this morning is think about the electric utility business. I live in Southern California. 25 years ago, by the way I studied to be an electrical engineer, 20 years ago, 30 years ago, that business not entirely simple was about building a big power plant and distributing electrons out to all the consumers of electrons. One direction and optimization of that grid, network and that business was very hard and there was billions of dollars at stake. Fast forward to today, now you still got those generating plants online, but you've also got folks like me generating their own power and putting it back into the grid. So now you've got bidirectional electrons. The optimization is totally different. Then how do you figure out how most effectively to create capacity and distribute that capacity because created capacity that's not consumed is 100% spoiled. So it's a huge data problem but it's a huge data problem meeting IoT, right? Devices, smart meter devices out at the edge creating data doing it in realtime. A cloud blew over, my generating capacity on my roof went down so I've got to pull from the grid, combining all of that data to make realtime decisions is we're talking hundreds of billions of dollars and it's being done today in an industry, it's not a high-tech Silicon Valley kind of industry, electric utilities are taking advantage of this technology today. >> So we were talking off-camera about you know some commentary about the Hadoop is failed and obviously you take exception to that and I and you also made the point it's not just about Hadoop but in a way it is because Hadoop was the catalyst of all this open Why has Hadoop not failed in your view >> Well because we have customers and you know the great thing about conferences like this is we're actually able to get a lot of folks to come in and talk about what they're doing with the technology and how they're driving business benefit and share that business benefit to their colleagues so we see that that it's business benefit coming along you know In any hype cycle you know people can go down a path maybe they had false expectations right early on you know six years ago years ago we were talking about hey is open source of Hadoop is going to come along and replace EDW complete fallacy right what I talked about in that opportunity being able to store all kinds of disparate data being able to manage and maneuver analytics in real time that's the value proposition is very different than some of the legacy ten. So if you view it as hey this thing is going to replace that thing okay maybe not but the point is is very successful for what is not verified that-- >> Just to clarify what you just said there that was you guys never kicked that position. CloudAIR or did with their impala was their initial on you could give me that you don't agree with that? >> Publicly they would say oh it's not a replacement but you're right i mean the actions were maybe designed to do that >> And set in the marketplace that that might be one of the outcomes >> Yeah, but they pivoted quickly when they realized that was failed strategy but i mean that but that became a premise that people locked in on. >> If that becomes your yardstick for measuring then then so-- >> Oh but but wouldn't you agree that that Hadoop in many respects was designed to solve some of the problems that edw never could >> Exactly so so you know again when you think about the the variety of data when you think about the analytic content doing time series analysis is very hard to do in a relational model so it's a new tool in the workbench to go solve analytic problems and so when you look at it from that perspective and I use the utility example the manufacturing example financial consumer finance telco all of these companies are using this technology leveraging this technology to solve problems they couldn't solve or and frankly to build new businesses that they couldn't build before because they didn't have access to that real time-- >> And so money did shift from pouring money into the edw with limited returns because you were at the steep part or the flat part of the s-curve to hey let's put it over here and this so called big data thing and that's why the market I think was conditioned to sort of come to that simple conclusion but dollars the spending did shift did it not? >> Yeah I mean if you subscribe kind of that to that herd mentality and you know the net increase the net new expenditure in the new technology is always going to outpace the growth of the existing kind of plateau technologists. That's just math. >> The growth yes, but not the size not the absolute dollars and so you have a lot of companies right now struggling in the traditional legacy space and you got this rocket ship going in-- >> And again I think if you think about kind of the converging forces that are out there in addition to you know i OT and streaming the ability frankly Hadoop is an enabler of AI when you think about the success of AI and machine learning it's about having massive massive massive amounts of data right? And I think back 25 years ago my first data Mart was 30 gigabytes and we thought that was all the data in the world Now fits on your phone so so when you think about just having the utter capacity and the ability to actually process that capacity of data these are technology breakthroughs that have been driven in the poor open source in Hadoop community when combined with the ability then to execute in clouds and ephemeral kinds of workloads you combine all that stuff together now instead of going to capital committee for 20 millioin dollars for a bunch of hardware to do an exabyte kind of study where you may not get an answer that means anything you can now spin that up in the cloud and for a couple of thousand dollars get the answer take that answer and go build a new system of insight that's going to drive your business and this is a whole new area of opportunity or even by the convergence of all that >> So I agree i mean it's absurd to say Hadoop and big data has failed, it's crazy. Okay but despite the growth i called profitless prosperity can the industry fund itself I mean you've got to make big bets yarn tezz different clouds how does the industry turn into one that is profitable and growing well I mean obviously it creates new business models and new ways of monetizing software in deploying software you know one of the key things that is core to our belief system is really leveraging and working with and nurturing the community is going to be a key success factor for our business right nurturing that innovation in collaboration across the community to keep up with the rate of pace of change is one of the aspects of being relevant as a business and then obviously creating a great service experience for our customers so that they they know that they can depend on enterprise class support enterprise-class security and governance and operational management in the cloud and on-prem in creating that value propisition along with the the advanced and accelerated delivery of innovation is where I think you know we kind of intersect uniquely in in the in the industry. >> and one of the things that I think that people point out and I have this conversation all the time of people who try to squint through the you know the wall street implications of the value proposition of the industry and this and that and I want to get your thoughts on because open source at this era that we're living in today bringing so much value outside of just important works in your your company Dave would made a comment on the intro package we're doing is that the practitioners are getting a lot of value people out in the field so these are the white space as a value and they're actually transformative can you give some examples where things are getting done that are real of real value as use cases that are that are highlighted you guys can i light I think that's the unwritten story that no one thought about it that rising tide floating all boat happening? >> Yeah yes I mean what is the most use cases the white so you have some of those use cases again it really involves kind of integrating legacy traditional transactional information right very valuable information about a company its operations its customers its products and all this kind of thing about being able to combine that with the ability to do real-time sensor management and ultimately have a technology stack that enables kind of the connection of all of those sources of data for an analytic and that's an important differentiation you know for the first 25 years of my career right it was all about what school all this data into a place and then let's do something with it and then we can push analytics back not an entirely bad model but a model that breaks in the world of IOT connected devices it's just frankly isn't enough money to spend on bandwidth to make that happen and as fast as the speed of light is it creates latency so those decisions aren't going to be able to be made in time so we're seeing even in traditional i mentioned utility business think about manufacturing oil and gas right sensors everywhere being able to take advantage not not of collecting all the central data and all of that but being able to actually create analytics based on sensor data and put those analytics outs of the sensors to make real-time decisions that can affect hundreds of millions of dollars of production or equipment are the use cases that we're seeing be deployed today and that's complete white space that was unavailable before. >> Yeah and customer demand too I mean Dave and I were also debating about the this not being a new trend this is just big data happening the customers are demanding production workload so you've seen a lot more forcing function driven by the customer and you guys have some news I want to get to and give your thoughts on HTTP or worse data platform two points dicks what's the key news their house in real time you talking about real time. >> Yeah it's about real time real time flexibility and choice you know motherhood and apple pie >> And the major highlights of that operate >> So the upgrades really inside of hive we now have operational analytic query capabilities where when you do tactical response times second sub second kind of response time. >> You know Hadoop and Hive wasn't previously known for that kind of a tactical response we've been able to now add inside of that technology the ability to view that workload we have customers who building these white space applications who have hundreds or thousands of users or applications that depend on consistency of very quick analytic response time we now deliver that inside the platform what's really cool about it in addition to the fact that it works is is that we did it inside a pipe so we didn't create yet another project or yet another thing that a customer has to integrate to or rewrite their application so any high based application cannot take advantage of this performance enhancement and that's part of our thinking of it as a platform the second thing inside of that that we've done that really it creaks to those kinds of workload is is we've really enhance the ability to incremental data acquisition right whether it be streaming whether it be patch up certs right on the sequel person doing up service being able to do that data maintenance in an active compliant fashion completely automatically and behind the scenes so that those applications again can just kind of run without any heavy lifting >> Just staying in motion kind of thing going on >> Right it's anywhere from data in motion even to batch to mini batch and anywhere kind of in between but we're doing those incremental data loads you know, it's easy to get the same file twice by mistake you don't want to double count you want to have sanctity of the transactions we now handle that inside of Hive with acid compliance. >> So a layperson question for the CTO if I may you mentioned Hadoop was not known for a sort of real-time response you just mentioned acid it was never in the early days known for a sort of acid you know complies others would say you know Hadoop the original Big Data Platform is not designed for the matrix of the matrix math of AI for example are these misconceptions and like Tim Berners-lee when we met Tim Berners-lee web 2.0 this is what the web was designed for would you say the same thing about Hadoop? >> Yeah. Ultimately from my perspective and kind of mending it out, Hadoop was designed for the easy acquisition of data the easy onboarding of data and then once you've onboarded that data it it also was known for enabling new kinds of analytics that could be plugged in certainly starting out with MapReduce in HDFS was kind of before but the whole idea is I have now the flexible way to easily acquire data in its native form without having to apply schema without having to have any formatting distort I can get it exactly as it was and store it and then I can apply whatever schema whatever rules whatever analytics on top of that that I want so the center of gravity from my mind has really moved up to yarn which enables a multi-tenancy approach to having pluggable multiple different kinds of file formats and pluggable different kinds of analytics and data access methods whether it be sequel whether it be machine learning whether the HBase will look up and indexing and anywhere kind of in between it's that it's that Swiss Army knife as it were for handling all of this new stuff that is changing every second we sit here data has changed. >> And just a quick follow-up if I can just clarification so you said new types of analytics that can be plugged in by design because of its openness is that right? >> By design because of its openness and the flexibility that the platform was was built for in addition on the performance we've also got a new update to spark and usability consume ability and collaboration for data scientists using the latest versions of spark inside the platform we've got a whole lot of other features and functions as that our customers have asked for and then on the flexibility and choice it's available public cloud infrastructures of service public cloud platform as a service on Prem x and net new on prem with power >> Just got final question for you just as the industry evolves what are some of the key areas that open source can pivot to that really takes advantage of the machine learning the AI trends going on because you start to see that really increase the narrative around the importance of data and a lot of people are scratching their heads going okay i need to do the back office to set up my IT to have all those crates stuff always open source projects all that the Hadoop data platform but then I got to get down and dirty i might do multiple clouds on the hybrid cloud going on i might want to leverage the moles canoe cool containers and super Nettie's and micro services and almost devops where's that transition happening as a CTO what do you see that that how do you talk to customers about that this transition this evolution of how the data businesses in getting more and more mainstream? >> Yeah i mean i think i think the big thing that people had to get over is we've reverse polarity from again 30 years of I want a stack vendor to have an integrated stack of everything a plug-and-play it's integrated and end it might not be a hundred percent what I want but the cost leverage that I get out of the stack versus what I'm going to go do that's perfect in this world if the opposite it's about enabling the ecosystem and that's where having and by the way it's a combination of open source and proprietary software that you know some of our partners have proprietary software that's okay but it's really about enabling the ecosystem and I think the biggest service that we as an open source community can do is to continue to kind of keep that standard kernel for the platform and make it very usable and very easy for many apps and software providers and other folks. >> A thousand flower bloom and kind of concept and that's what you've done with the white spaces as these cases are evolving very rapidly and then the bigger apps are kind of going to settling into a workload with realtime. >> Yeah all time you know think about the next generation of IT professional the next generation of business professional grew up with iphones and here comes they grew up in a mini app world i mean it download an app i'm going to try it is a widget boom and it's going to help me get something done but it's not a big stack that I'm going to spend 30 years to implement and I liked it and then I want to take to those widgets and connect them together to do things that i haven't been able to do before and that's how this ecosystem is really-- >> Great DevOps culture very agile that's their mindset. So Scott congratulations on your 2.6 upgrade and >> Scott: We're thrilled about it. >> Great stuff acid compliance really big deal again these compliance because little things are important in the enterprise great all right thanks for coming to accuse the Dataworks in Germany Munich I'm John thanks for watching more coverage live here in Germany after this short break

Published Date : Apr 5 2017

SUMMARY :

(soothing technological music) Brought to you by Hortonworks. because data is at the center of the value proposition that are happening in the industry you have a front-facing business centric... combining all of that data to make realtime decisions and share that business benefit to their Just to clarify what you just said there a premise that people locked in on. that to that herd mentality and you know the community to keep up with the rate cases the white so you have some of debating about the this not being a new So the upgrades really inside of hive we it's easy to get the same file twice by mistake you the CTO if I may you mentioned Hadoop acquisition of data the easy onboarding the big thing that people had to get kind of going to settling into a So Scott congratulations on your 2.6 upgrade and

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
ScottPERSON

0.99+

100%QUANTITY

0.99+

JohnPERSON

0.99+

DavidPERSON

0.99+

DavePERSON

0.99+

GermanyLOCATION

0.99+

Southern CaliforniaLOCATION

0.99+

30 yearsQUANTITY

0.99+

30 gigabytesQUANTITY

0.99+

Scott GnauPERSON

0.99+

hundredsQUANTITY

0.99+

HortonworksORGANIZATION

0.99+

Swiss ArmyORGANIZATION

0.99+

six years ago years agoDATE

0.99+

AmericaLOCATION

0.99+

25 years agoDATE

0.99+

HadoopTITLE

0.99+

Munich, GermanyLOCATION

0.99+

todayDATE

0.98+

Dataworks Summit 2017EVENT

0.98+

30 years agoDATE

0.98+

two pointsQUANTITY

0.98+

iphonesCOMMERCIAL_ITEM

0.98+

telcoORGANIZATION

0.98+

HadoopORGANIZATION

0.98+

hundred percentQUANTITY

0.98+

billions of dollarsQUANTITY

0.98+

first 25 yearsQUANTITY

0.97+

DevOpsTITLE

0.97+

hundreds of millions of dollarsQUANTITY

0.97+

20 years agoDATE

0.97+

20 millioin dollarsQUANTITY

0.97+

twiceQUANTITY

0.97+

DataWorks SummitEVENT

0.97+

firstQUANTITY

0.97+

oneQUANTITY

0.97+

OneQUANTITY

0.96+

second thingQUANTITY

0.96+

Tim Berners-leePERSON

0.96+

Silicon ValleyLOCATION

0.96+

MunichLOCATION

0.96+

Hadoop SummitEVENT

0.96+

One directionQUANTITY

0.96+

first horseQUANTITY

0.95+

first dataQUANTITY

0.95+

DataworksORGANIZATION

0.94+

secondQUANTITY

0.92+

CloudTITLE

0.92+

EDWORGANIZATION

0.85+

2017EVENT

0.85+

couple of thousand dollarsQUANTITY

0.84+

Dataworks Summit Europe 2017EVENT

0.84+

MapReduceTITLE

0.84+

thousands of usersQUANTITY

0.83+

lot of folksQUANTITY

0.83+

this morningDATE

0.8+

S1TITLE

0.79+

EuropeLOCATION

0.78+

A thousand flower bloomQUANTITY

0.78+

2.6OTHER

0.76+

appsQUANTITY

0.73+

Day One Kickoff– DataWorks Summit Europe 2017 - #DW17 - #theCUBE


 

>> Narrator: Recovery. DataWorks Summit Europe 2017. Brought to you by Hortonworks. >> Hello everyone, welcome to The Cube's special presentation here in Munich, Germany for DataWorks Summit 2017. This is the Hadoop Summit powered by Hortonworks. This is their event and again, shows the transition from the Hadoop world to the big data world. I'm John Furrier. My co-host Dave Vellante, good to see you Dave. We're back in the seats together, usually on different events, but now here together in Munich. Great beer, great scene here. Small European event for Hortonworks and the ecosystem but it's called DataWorks 2017. Strata Hadoop is calling themselves Strata and Data. They're starting to see the word Hadoop being sunsetted from these events, which is a big theme of this year. The transition from Hadoop being the branded category to Data. >> Well, you're certainly seeing that in a number of ways. The titles of these events. Well, first of all, I love being in Europe. These venues are great, right? They're so Euro, very clean and magnificent. But back to your point. You're seeing the Hadoop Summit now called the DataWorks Summit. You're seeing the Strata Plus Hadoop is now Strata Plus, I don't even know what it is. Right, it's not Hadoop driven anymore. You see it also in Cloudera's IPO. They're going to talk about Hadoop and Hadoop Distro. They're a Hadoop Distro vendor but they talked about being a data management company and John, I think we are entering the era, or well deep into the era of what I have been calling for the last couple of years, profitless prosperity. Really where you see the Cloudera IPO, as you know, they raised money from Intel, over $600 million at a $4.1 billion dollar valuation. The Wall Street Journal says they'll have a tough time getting a billion dollar valuation. For every dollar each of these companies spends, Hortonworks and Cloudera, they lose between $1.70 and $2.50, so we've always said at SiliconANGLE, Wiki Bond and The Cube that people are going to make money in big data or the practitioners of big data, and it's hard to find those guys, it's hard to see them but that's really what's happening is the industries are transforming and those are the guys that are putting money into their bottom line. Not so much for technology vendors. >> Great to unpack that but first of all, I want to just say congratulations to Wiki Bond for getting it right again. As usual Wiki Bond, ahead of the curve and being out there and getting it right because I think you nailed it and I think Wiki Bond saw this first of all the research firms, kind of, you know, pat ourselves on the back here, but the truth is that practitioners are making the money and I think you're going to see more of that. In fact, last night as I'm having a nice beer here in Germany, I just like to listen to the conversations in the bar area and a lot of conversations around, real conversations around, you know, doing deals, and you know, deployments. You know, you're hearing about HBase, you're hearing about clusters, you're hearing about service revenue, and I think this is the focus. Cloudera, I think, in a classic Silicon Valley way, their hubris was tempered by their lack of scale. I mean, they didn't really blow it out. I mean, now they do 200 million in revenue. Nothing to shake a stick at, they did a great job, but they're buying revenue and Hortonworks is as well. But the ecosystem is the factor, and this is the wildcard. I'm making a prediction. Profitless prosperity that you point out is right, but I think that it has longevity with these companies like Hortonworks and Cloudera and others, like MapR because the ecosystem's robust. If you factor in the ecosystem revenue that is enough rising tide in my opinion. The question is how do they become sustainable as a standalone venture, that Red Hat for Hadoop never worked as Pat Gilson, you know, predicted. So, I think you're going to see a quick shift and pivot quickly by Hortonworks, certainly Cloudera's going to be under the microscope once they go public. I'm expecting that valuation to plummet like a rock. They're going to go public, Silicon Valley people are going to get their exits but. >> Excel will be happy. >> Everyone, yeah, they'll be happy. They already sold in 2013. They did a big sale, I mean, all of them cashed out two years ago when that liquidation event happened with Intel but that's fine. But now it's back to business building and Hortonworks has been doing it for years, so when you see your evaluation is less than a billion, so I'm expecting Cloudera to plummet like a rock. I would not buy the IPO at all because I think it's going to go well under a billion dollars. >> And I think it's the right call and as we know, last year, at the end of last year, Fidelity and other mutual funds devalued their holdings in Cloudera and so, you know, you've got this situation where, as you say, a couple hundred, maybe you know, on the way to 300 million in revenue, Hortonworks on the way to 200 million in revenue. Add up the ecosystem, yeah, maybe you get to a billion, throw in all of what IBM and Oracle call big data, and it's kind of a more interesting business, but you've called it same wine, new bottle. Is it a new bottle? Now, what I mean by that is the shift from Hadoop and then again, you read Cloudera's S1, it's all about AI, machine learning, you know, the cloud. Interesting, we'll talk about the cloud a little later, but is it same wine, new bottle, or is this really a shift toward a new era of innovation? >> It's not a new shift. It's the same innovation that the Hortonworks was founded on. Big data is a categorical and Hadoop was the horse they rode in on, but I think what's changing is the fact that customers are now putting real projects on the table and the scrutiny around those projects have to produce value, and the value comes down to total cost of ownership and business value. And that's becoming a data specific thing, and you look at all the successes in the big data world, Spark and others, you're seeing a focus on cloud integration and real-time workloads. These are real projects. This isn't fantasy. This isn't hype. This isn't early adopter. These are real companies saying we are moving to a new paradigm of digital transforming our companies and we need cost efficiencies but revenue-producing applications and workloads that are going to be running in the cloud with data at the heart of it. So, this is a customer-forcing function where the customers are generally excited about machine learning, moving to real-time classification of workloads. This is the deal and no hubris, no technology posturing, no open standards, jockeying can right the situation. Customers have demands and they want them filled, and we're going to have a lot of guests on here and I'm going to ask them those direct questions. What are you looking for and? >> Well, I totally agree with what you're saying and when we first met, it was right around the, you know, the mid point of the web 2.0 era, and I remember Tim Berners-Lee commenting on all this excitement, everybody's doing, he said this is what the web was invented to do, and this is what big data was invented to do. It was to produce deep analytics, deep learning, machine learning, you know, cognitive, as IBM likes to brand that, and so, it really is the next era even though people don't like to use the term big data anymore. We were talking to, you know, some of the folks in our community earlier, John, you and I, about some of the challenges. Why is it profitless, you know? Why is there so much growth but it's no profit? And you know, we have to point out here that people like Hortonworks and Cloudera, they've made some big bets, take HDSF of example. And now you have the cloud guys, particularly Amazon, coming in, you know, with S3. Look at YARN, big open source project. But you got Docker and Kubernetes seem to be mopping that up. Tez was supposed to replace MapReduce and now you've got. >> I mean, I wouldn't say mopping up, I mean. >> You've got Spark. >> At the end of the day the ecosystem's going to revolve around what the customers want, and portability of workloads, Kubernetes and microservices, these are areas that just absolutely make a lot of sense and I think, you know, people will move to where the frictionless action is and that's going to happen with Kubernetes and containers and microservices, but that just speaks to the devops culture, and I think Hadoop ecosystem, again, was grounded in the devops culture. So, yeah, there's some progress that are going to maybe go out of flavor, but there's other stuff coming up trough the ranks in open source and I think it's compelling. >> But where I disagree with what you're saying is well, the point I'm trying to make, is you have to, if you're Cloudera and Hortonworks, you have to support those multiple projects and it's expensive as hell. Whereas the cloud guys put all their wood behind one arrow, to use an old Scott McNealy phrase, and you know, Amazon, I would argue is mopping up in big data. I think the cloud guys, you know, it's ironic to me that Cloudera in the cloud era picked that name, you know, but really never had. >> John: They missed the cloud. >> They've never really had a strong cloud play, and I would say the same thing with Hortonworks and MapR. They have to play in the cloud and they talk about cloud, but they've got to support hybrid, they've got to support on param, they got to pick the clouds that they're going to support, AWS, Azure, maybe IBM's cloud. >> Look, Cloudera completely missed the cloud era, pun intended. However, they didn't miss open source but they're great at and I'm an admirer of Cloudera and Hortonworks on is that their open source ethos is what drove them, and so they kind of got isolated in with some of their product decisions, but that's not a bad thing. I mean, ultimately, I'm really bullish on Cloudera and Hortonworks because the ecosystem points I mentioned earlier are not high on the I wouldn't buy the IPO, I think I'd buy them at a discount, but Cloudera's not going to go away, Dave. They're going to go public. I think the valuation's going to drop like a rock and then settle around a billion, but they have good management. The founders still there, Michael Olson, Amr Awadallah. So, you're going to see Cloudera transform as a company. They have to do business out in the open and they're not afraid to, obviously they're open source. So, we're going to start to see that transition from a private venture backed, scale up, buy revenue. In the playbook of Silicon Valley venture capital's Excel partners and Greylock. Now they go public and get liquid and then now next phase of their journey is going to be build a public company and I think that they will do a good job doing it and I'm not down on them at all for that and I think it's just going to be a transition. >> Well, they're going to raise what? A couple 100 million dollars? But this industry, yeah, this industry's cashflow negative, so I agree with you. Open source is great, let's ra-ra for open source and it drives innovation, but how does this industry pay for itself? That's what I want to know. How you respond to that? >> Well, I think they have sustainable issues around services and I think partnering with the big companies like Intel that have professional services might help them on that front, but Michael Olson said in his founder's letter in his S1, kind of AI washing, he said AI and cognitive. But that's okay because Cloudera could easily pivot with their brain power, and same with Hortonworks to AI. Machine learning is very open source driven. Open source culture is growing, it's not going away, so I think Cloudera's in a very good position. >> I think the cloud guys are going to kill them in that game, and cloud guys and IBM are going to cream these profitless startups in that AI and machine learning game. >> We'll see. >> You disagree? >> I disagree, I think. Well, I mean, it depends. I mean, you know, I'm not going to, you know, forecast what the managements might do, but I mean, if I'm cloud looking at what Cloudera's done. >> What would you do? >> I would do exactly what Mike Olson's doing is I'd basically pivot immediately to machine learning. Look at Google. TensorFlow it's go so much traction with their cloud because it's got machine learning built into it. Open source is where the action is, and that's where you could do a lot of good work and use it as an advantage in that they know that game. I would not count out the open source game. >> So, we know how IBM makes money at that, you know, in theory anyway it wants. We know how Amazon's going to make money at that with their priority approach, Microsoft will do the same thing. How to Cloudera and Hortonworks make money? >> I think it's a product transition around getting to the open source with cloud technologies. Amazon is not out to kill open source, so I think there's an opportunity to wedge in a position there, and so they just got to move quickly. If they don't make these decisions then that's a failed execution on the management team at Cloudera and Hortonworks and I think they're on it. So, we'll keep an eye on that. >> No, Amazon's not trying to kill open source, I would agree, but they are bogarting open source in a big way and profiting amazingly from it. >> Well, they just do what Amy Jessie would say, they're customer driven. So, if a customer doesn't want to do five things to do one thing this is back to my point. The customers want real-time workloads. They want it with open source and they don't want all these steps in the cost of ownership. That's why this is not a new shift, it's the same wine, new bottle because now you're just seeing real projects that are demanding successful and efficient code and support and whoever delivers it builds the better mousetrap. In this case, the better mousetrap will win. >> And I'm arguing that the better mousetrap and the better marginal economics, I know I'm like a broken record on this, but if I take Kinesis and DynamoDB and Red Ship and wrap it into my big data play, offer it as a service with a set of APIs on the cloud, like AWS is going to do, or is doing, and Azure is doing, that's a better business model than, as you say, five different pieces that I have to cobble together. It's just not economically viable for customers to do that. >> Well, we've got some big new coming up here. We're going to have two days of wall-to-wall coverage of DataWorks 2017. Hortonworks announcing 2.6 of their Hadoop Hortonworks data platform. We're going to talk to Scott now, the CTO, coming up shortly. Stay with us for exclusive coverage of DataWorks in Munich, Germany 2017. We'll be back with more after this short break.

Published Date : Apr 5 2017

SUMMARY :

Brought to you by Hortonworks. Hortonworks and the ecosystem and it's hard to find those guys, and you know, deployments. going to go well under and then again, you read Cloudera's S1, and I'm going to ask them and so, it really is the next era I mean, I wouldn't and that's going to happen with Kubernetes and you know, Amazon, that they're going to support, and I think that they will Well, they're going to raise what? and same with Hortonworks to AI. and cloud guys and IBM are going to cream I mean, you know, and that's where you could to make money at that and so they just got to move quickly. to kill open source, and they don't want all these steps and the better marginal economics, We're going to talk to Scott now, the CTO,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave VellantePERSON

0.99+

Michael OlsonPERSON

0.99+

IBMORGANIZATION

0.99+

HortonworksORGANIZATION

0.99+

AmazonORGANIZATION

0.99+

EuropeLOCATION

0.99+

2013DATE

0.99+

Amy JessiePERSON

0.99+

JohnPERSON

0.99+

ClouderaORGANIZATION

0.99+

FidelityORGANIZATION

0.99+

OracleORGANIZATION

0.99+

Mike OlsonPERSON

0.99+

GermanyLOCATION

0.99+

MunichLOCATION

0.99+

Wiki BondORGANIZATION

0.99+

$2.50QUANTITY

0.99+

DavePERSON

0.99+

ScottPERSON

0.99+

John FurrierPERSON

0.99+

last yearDATE

0.99+

MapRORGANIZATION

0.99+

AWSORGANIZATION

0.99+

MicrosoftORGANIZATION

0.99+

200 millionQUANTITY

0.99+

Pat GilsonPERSON

0.99+

IntelORGANIZATION

0.99+

less than a billionQUANTITY

0.99+

two daysQUANTITY

0.99+

Scott McNealyPERSON

0.99+

Tim Berners-LeePERSON

0.99+

Silicon ValleyLOCATION

0.99+

over $600 millionQUANTITY

0.99+

The CubeORGANIZATION

0.99+

SiliconANGLEORGANIZATION

0.99+

DataWorks SummitEVENT

0.99+

HadoopORGANIZATION

0.98+

Hadoop DistroORGANIZATION

0.98+

300 millionQUANTITY

0.98+

two years agoDATE

0.98+

DataWorks 2017EVENT

0.98+

GoogleORGANIZATION

0.98+

Hadoop SummitEVENT

0.98+

eachQUANTITY

0.98+

a billionQUANTITY

0.97+

DataWorks Summit 2017EVENT

0.97+

billion dollarQUANTITY

0.97+

Amr AwadallahPERSON

0.97+

Munich, GermanyLOCATION

0.97+

Darren Chinen, Malwarebytes - Big Data SV 17 - #BigDataSV - #theCUBE


 

>> Announcer: Live from San Jose, California, it's The Cube, covering Big Data Silicon Valley 2017. >> Hey, welcome back everybody. Jeff Frick here with The Cube. We are at Big Data SV in San Jose at the Historic Pagoda Lounge, part of Big Data week which is associated with Strata + Hadoop. We've been coming here for eight years and we're excited to be back. The innovation and dynamicism of big data and evolutions now with machine learning and artificial intelligence, just continues to roll, and we're really excited to be here talking about one of the nasty aspects of this world, unfortunately, malware. So we're excited to have Darren Chinen. He's the senior director of data science and engineering from Malwarebytes. Darren, welcome. >> Darren: Thank you. >> So for folks that aren't familiar with the company, give us just a little bit of background on Malwarebytes. >> So Malwarebytes is basically a next-generation anti-virus software. We started off as humble roots with our founder at 14 years old getting infected with a piece of malware, and he reached out into the community and, at 14 years old, wrote his first, with the help of some people, wrote his first lines of code to remediate a couple of pieces of malware. It grew from there and I think by the ripe old age of 18, founded the company. And he's now I want to say 26 or 27 and we're doing quite well. >> It was interesting, before we went live you were talking about his philosophy and how important that is to the company and now has turned into really a strategic asset, that no one should have to suffer from malware, and he decided to really offer a solution for free to help people rid themselves of this bad software. >> Darren: That's right. Yeah, so Malwarebytes was founded under the principle that Marcin believes that everyone has the right to a malware-free existence and so we've always offered a free version Malwarebytes that will help you to remediate if your machine does get infected with a piece of malware. And that's actually still going to this day. >> And that's now given you the ability to have a significant amount of inpoint data, transactional data, trend data, that now you can bake back into the solution. >> Darren: That's right. It's turned into a strategic advantage for the company, it's not something I don't think that we could have planned at 18 years old when he was doing this. But we've instrumented it so that we can get some anonymous-level telemetry and we can understand how malware proliferates. For many, many years we've been positioned as a second-opinion scanner and so we're able to see a lot of things, some trends happening in there and we can actually now see that in real time. >> So, starting out as a second-position scanner, you're basically looking at, you're finding what others have missed. And how can you, what do you have to do to become the first line of defense? >> Well, with our new product Malwarebytes 3.0, I think some of that landscape is changing. We have a very complete and layered offering. I'm not the product manager, so I don't think, as the data science guy, I don't know that I'm qualified to give you the ins and outs, but I think some of that is changing as we have, we've combined a lot of products and we have a much more complete sweep of layered protection built into the product. >> And so, maybe tell us, without giving away all the secret sauce, what sort of platform technologies did you use that enabled you to scale to these hundreds of millions of in points, and then to be fast enough at identifying things that were trending that are bad that you had to prioritize? >> Right, so traditionally, I think AV companies, they have these honeypots, right, where they go and the collect a piece of virus or a piece of malware, and they'll take the MD5 hash of that and then they'll basically insert that into a definition's database. And that's a very exact way to do it. The problem is is that there's so much malware or viruses out there in the wild, it's impossible to get all of them. I think one of the things that we did was we set up telemetry and we have a phenomenal research team where we're able to actually have our team catch entire families of malware, and that's really the secret sauce to Malwarebytes. There's several other levels but that's where we're helping out in the immediate term. What we do is we have, internally, we sort of jokingly call it a Lambda Two architecture. We had considered Lambda long ago, long ago and I say about a year ago when we first started this journey. But there's, Lambda is riddled with, as you know, a number of issues. If you've ever talked to Jay Kreps from Confluent, he has a lot of opinions on that, right? And one of the key problems with that is, that if you do a traditional Lambda, you have to implement your code in two places, it's very difficult, things get out of sync, you have to have replay frameworks. And these are some of the challenges with Lambda. So we do processing in a number of areas. The first thing that we did was we implemented Kafka to handle all of the streaming data. We use Kafka streams to do inline stateless transformations and then we also use Kafka Connect. And we write all of our data both into HBase, we use that, we may swap that out later for something like Redis, and that would be a thin speed layer. And then we also move the data into S3 and we use some ephemeral clusters to do very large-scale batch processing, and that really provides our data lab. >> When you call that Lambda Two, is that because you're still working essentially on two different infrastructures, so your code isn't quite the same? You still have to check the results on either on either fork. >> That's right, yeah, we didn't feel like it was, we did evaluate doing everything in the stream. But there are certain operations that are difficult to do with purely streamed processing, and so we did need a little bit, we did need to have a thin, what we call real time indicators, a speed layer, to supplement what we were doing in the stream. And so that's the differentiating factor between a traditional Lambda architecture where you'd want to have everything in the stream and everything in batch, and the batch is really more of a truing mechanism as opposed to, our real time is really directional, so in the traditional sense, if you look at traditional business intelligence, you'd have KPIs that would allow you to gauge the health of your business. We have RTIs, Real Time Indicators, that allow us to gauge directionally, what is important to look at this day, this hour, this minute? >> This thing is burning up the charts, >> Exactly. >> Therefore it's priority one. >> That's right, you got it. >> Okay. And maybe tell us a little more, because everyone I'm sure is familiar with Kafka but the streams product from them is a little newer as is Kafka Connect, so it sounds like you've got, it's not just the transport, but you've got some basic analytics and you've got the ability to do the ETL because you've got Connect that comes from sources and destinations, sources and syncs. Tell us how you've used that. >> Well, the streams product is, it's quite different than something like Spark Streaming. It's not working off micro-batching, it's actually working off the stream. And the second thing is, it's not a separate cluster. It's just a library, effectively a .jar file, right? And so because it works natively with Kafka, it handles certain things there quite well. It handles back pressure and when you expand the cluster, it's pretty good with things like that. We've found it to be a fairly stable technology. It's just a library and we've worked very closely with Confluent to develop that. Whereas Kafka Connect is really something that we use to write out to S3. In fact, Confluent just released a new, an S3 connector direct. We were using Stream X, which was a wrapper on top of an HDFS connector and they rigged that up to write to S3 for us. >> So tell us, as you look out, what sorts of technologies do you see as enabling you to build a platform that's richer, and then how would that show up in the functionality consumers like we would see? >> Darren: With respect to the architecture? >> Yeah. >> Well one of the things that we had to do is we had to evaluate where we wanted to spend our time. We're a very small team, the entire data science and engineering team is less than I think 10 months old. So all of us got hired, we've started this platform, we've gone very, very fast. And we had to decide, how are we going to, a, get, we've made this big investment, how are we going to get value to our end customer quickly, so that they're not waiting around and you get the traditional big-data story where, we've spent all this money and now we're not getting anything out of it. And so we had to make some of those strategic decisions and because of the fact that the data was really truly big data in nature, there's just a huge amount of work that has to be done in these open-source technologies. They're not baked, it's not like going out to Oracle and giving them a purchase order and you install it and away you go. There's a tremendous amount of work, and so we've made some strategic decisions on what we're going to do in open-source and what we're going to do with a third-party vendor solution. And one of those solutions that we decided was workload automation. So I just did a talk on this about how Control-M from BMC was really the tool that we chose to handle a lot of the coordination, the sophisticated coordination, and the workload automation on the batch side, and we're about to implement that in a data-quality monitoring framework. And that's turned out to be an incredibly stable solution for us. It's allowed us to not spend time with open-source solutions that do the same things like Airflow, which may or may not work well, but there's really no support around that, and focus our efforts on what we believe to be the really, really hard problems to tackle in Kafka, Kafka Streams, Connect, et cetera. >> Is it fair to say that Kafka plus Kafka Connect solves many of the old ETL problems or do you still need some sort of orchestration tool on top of it to completely commoditize, essentially moving and transforming data from OLTP or operational system to a decision support system? >> I guess the answer to that is, it depends on your use case. I think there's a lot of things that Kafka and the stream's job can solve for you, but I don't think that we're at the point where everything can be streaming. I think that's a ways off. There's legacy systems that really don't natively stream to you anyway, and there's just certain operations that are just more efficient to do in batch. And so that's why we've, I don't think batch for us is going away any time soon and that's one of the reasons why workload automation in the batch layer initially was so important and we've decided to extend that, actually, into building out a data-quality monitoring framework to put a collar around how accurate our data is on the real-time side. >> Cuz it's really horses for courses, it's not one or the other, it's application-specific, what's the best solution for that particular is. >> Yeah, I don't think that there's, if there was a one-size-fits-all it'd be a company, and there would be no need for architects, so I think that you have to look at your use case, your company, what kind of data, what style of data, what type of analysis do you need. Do you really actually need the data in real time and if you do put in all the work to get it in real time, are you going to be able to take action on it? And I think Malwarebytes was a great candidate. When it came in, I said, "Well, it does look like we can justify "the need for real time data, and the effort "that goes into building out a real-time framework." >> Jeff: Right, right. And we always say, what is real time? In time to do something about it, (all chuckle) and if there's not time to do something about it, depending on how you define real time, really what difference does it make if you can't do anything about it that fast. So as you look out in the future with IoT, all these connected devices, this is a hugely increased attack surface as we just read our essay a few weeks back. How does that work into your planning? What do you guys think about the future where there's so many more connected devices out on the edge and various degrees of intelligence and opportunities to hi-jack, if you will? >> Yeah, I think, I don't think I'm qualified to speak about the Malwarebytes product roadmap as far as IoT goes. >> But more philosophically, from a professional point of view, cuz every coin has two sides, there's a lot of good stuff coming from IoT and connected devices, but as we keep hearing over and over, just this massive attack surface expansion. >> Well I think, for us, the key is we're small and we're not operating, like I came from Apple where we operated on a budget of infinity, so we're not-- >> Have to build the infinity or the address infinity (Darren laughs) with an actual budget. >> We're small and we have to make sure that whatever we do creates value. And so what I'm seeing in the future is, as we get more into the IoT space and logs begin to proliferate and data just exponentiates in size, it's really how do we do the same thing and how are we going to manage that in terms of cost? Generally, big data is very low in information density. It's not like transactional systems where you get the data, it's effectively an Excel spreadsheet and you can go run some pivot tables and filters and away you go. I think big data in general requires a tremendous amount of massaging to get to the point where a data scientist or an analyst can actually extract some insight and some value. And the question is, how do you massage that data in a way that's going to be cost-effective as IoT expands and proliferates? So that's the question that we're dealing with. We're, at this point, all in with cloud technologies, we're leveraging quite a few of Amazon services, server-less technologies as well. We just are in the process of moving to the Athena, to Athena, as just an on-demand query service. And we use a lot of ephemeral clusters as well, and that allows us to actually run all of our ETL in about two hours. And so these are some of the things that we're doing to prepare for this explosion of data and making sure that we're in a position where we're not spending a dollar to gain a penny if that makes sense. >> That's his business. Well, he makes fun of that business model. >> I think you could do it, you want to drive revenue to sell dollars for 90 cents. >> That's the dot com model, I was there. >> Exactly, and make it up in volume. All right, Darren Chenin, thanks for taking a few minutes out of your day and giving us the story on Malwarebytes, sounds pretty exciting and a great opportunity. >> Thanks, I enjoyed it. >> Absolutely, he's Darren, he's George, I'm Jeff, you're watching The Cube. We're at Big Data SV at the Historic Pagoda Lounge. Thanks for watching, we'll be right back after this short break. (upbeat techno music)

Published Date : Mar 15 2017

SUMMARY :

it's The Cube, and evolutions now with machine learning So for folks that aren't and he reached out into the community and, and how important that is to the company and so we've always offered a free version And that's now given you the ability it so that we can get what do you have to do to become and we have a much more complete sweep and that's really the secret the results on either and so we did need a little bit, and you've got the ability to do the ETL that we use to write out to S3. and because of the fact that the data and that's one of the reasons it's not one or the other, and if you do put in all the and opportunities to hi-jack, if you will? I don't think I'm qualified to speak and connected devices, or the address infinity and how are we going to Well, he makes fun of that business model. I think you could do it, and giving us the story on Malwarebytes, the Historic Pagoda Lounge.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
JeffPERSON

0.99+

Darren ChinenPERSON

0.99+

DarrenPERSON

0.99+

Jeff FrickPERSON

0.99+

Darren CheninPERSON

0.99+

GeorgePERSON

0.99+

Jay KrepsPERSON

0.99+

90 centsQUANTITY

0.99+

two sidesQUANTITY

0.99+

AppleORGANIZATION

0.99+

AthenaLOCATION

0.99+

MarcinPERSON

0.99+

AmazonORGANIZATION

0.99+

two placesQUANTITY

0.99+

San JoseLOCATION

0.99+

BMCORGANIZATION

0.99+

eight yearsQUANTITY

0.99+

San Jose, CaliforniaLOCATION

0.99+

first linesQUANTITY

0.99+

MalwarebytesORGANIZATION

0.99+

KafkaTITLE

0.99+

oneQUANTITY

0.99+

10 monthsQUANTITY

0.99+

Kafka ConnectTITLE

0.99+

OracleORGANIZATION

0.99+

LambdaTITLE

0.99+

firstQUANTITY

0.99+

second thingQUANTITY

0.99+

GenePERSON

0.99+

ExcelTITLE

0.99+

ConfluentORGANIZATION

0.99+

The CubeTITLE

0.98+

first lineQUANTITY

0.98+

27QUANTITY

0.97+

26QUANTITY

0.97+

RedisTITLE

0.97+

Kafka StreamsTITLE

0.97+

S3TITLE

0.97+

18QUANTITY

0.96+

14 years oldQUANTITY

0.96+

18 years oldQUANTITY

0.96+

about two hoursQUANTITY

0.96+

g agoDATE

0.96+

ConnectTITLE

0.96+

second-positionQUANTITY

0.95+

HBaseTITLE

0.95+

first thingQUANTITY

0.95+

Historic Pagoda LoungeLOCATION

0.94+

bothQUANTITY

0.93+

two different infrastructuresQUANTITY

0.92+

S3COMMERCIAL_ITEM

0.91+

Big DataEVENT

0.9+

The CubeORGANIZATION

0.88+

Lambda TwoTITLE

0.87+

Malwarebytes 3.0TITLE

0.84+

AirflowTITLE

0.83+

a year agoDATE

0.83+

second-opinionQUANTITY

0.82+

hundreds of millions ofQUANTITY

0.78+

Wikibon Big Data Market Update pt. 2 - Spark Summit East 2017 - #SparkSummit - #theCUBE


 

(lively music) >> [Announcer] Live from Boston, Massachusetts, this is the Cube, covering Sparks Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Sparks Summit in Boston, everybody. This is the Cube, the worldwide leader in live tech coverage. We've been here two days, wall-to-wall coverage of Sparks Summit. George Gilbert, my cohost this week, and I are going to review part two of the Wikibon Big Data Forecast. Now, it's very preliminary. We're only going to show you a small subset of what we're doing here. And so, well, let me just set it up. So, these are preliminary estimates, and we're going to look at different ways to triangulate the market. So, at Wikibon, what we try to do is focus on disruptive markets, and try to forecast those over the long term. What we try to do is identify where the traditional market research estimates really, we feel, might be missing some of the big trends. So, we're trying to figure out, what's the impact, for example, of real time. And, what's the impact of this new workload that we've been talking about around continuous streaming. So, we're beginning to put together ways to triangulate that, and we're going to show you, give you a glimpse today of what we're doing. So, if you bring up the first slide, we showed this yesterday in part one. This is our last year's big data forecast. And, what we're going to do today, is we're going to focus in on that line, that S-curve. That really represents the real time component of the market. The Spark would be in there. The Streaming analytics would be in there. Add some color to that, George, if you would. >> [George] Okay, for 60 years, since the dawn of computing, we have two ways of interacting with computers. You put your punch cards in, or whatever else and you come back and you get your answer later. That's batch. Then, starting in the early 60's, we had interactive, where you're at a terminal. And then, the big revolution in the 80's was you had a PC, but you still were either interactive either with terminal or batch, typically for reporting and things like that. What's happening is the rise of a new interaction mode. Which is continuous processing. Streaming is one way of looking at it but it might be more effective to call it continuous processing because you're not going to get rid of batch or interactive but your apps are going to have a little of each. So, what we're trying to do, since this is early, early in its life cycle, we're going to try and look at that streaming component from a couple of different angles. >> Okay, as I say, that's represented by this Ogive curve, or the S-curve. On the next slide, we're at the beginning when you think about these continuous workloads. We're at the early part of that S-curve, and of course, most of you or many of you know how the S-curve works. It's slow, slow, slow. For a lot of effort, you don't get much in return. Then you hit the steep part of that S-curve. And that's really when things start to take off. So, the challenge is, things are complex right now. That's really what this slide shows. And Spark is designed, really, to reduce some of that complexity. We've heard a lot about that, but take us through this. Look at this data flow from ingest, to explore, to process, to serve. We talked a lot about that yesterday, but this underscores the complexity in the marketplace. >> [George] Right, and while we're just looking mostly at numbers today, the point of the forecast is to estimate when the barriers, representing complexities, start to fall. And then, when we can put all these pieces together, in just explore, process, serve. When that becomes an end-to-end pipeline. When you can start taking the data in on one end, get a scientist to turn it into a model, inject it into an application, and that process becomes automated. That's when it's mature enough for the knee in the curve to start. >> And that's when we think the market's going to explode. But now so, how do you bound this. Okay, when we do forecasts, we always try to bound things. Because if they're not bounded, then you get no foundation. So, if you look at the next slide, we're trying to get a sense of real-time analytics. How big can it actually get? That's what this slide is really trying to-- >> [George] So this one was one firm's take on real-time analytics, where by 2027, they see it peaking just under-- >> [Dave] When you say one firm, you mean somebody from the technology district? >> [George] Publicly available data. And we take it as as a, since they didn't have a lot of assumptions published, we took it as, okay one data point. And then, we're going to come at it with some bottoms-up end top-down data points, and compare. >> [Dave] Okay, so the next slide we want to drill into the DBMS market and when you think about DBMS, you think about the traditional RDBMS and what we know, or the Oracle, SQL Server, IBMDB2's, etc. And then, you have this emergent NewSQL, and noSQL entrance, which are, obviously, we talked today to a number of folks. The number of suppliers is exploding. The revenue's still relatively small. Certainly small relative to the RDBMS marketplace. But, take us through what your expectations is here, and what some of the assumptions are behind this. >> [George] Okay, so the first thing to understand is the DBMS market, overall, is about $40 billion of which 30 billion goes to online transaction processing supporting real operational apps. 10 billion goes to Orlap or business intelligence type stuff. The Orlap one is shrinking materially. The online transaction processing one, new sales is shrinking materially but there's a huge maintenance stream. >> [Dave] Yeah which companies like Oracle and IBM and Microsoft are living off of that trying to fund new development. >> We modeled that declining gently and beginning to accelerate more going out into the latter years of the tenure period. >> What's driving that decline? Obviously, you've got the big sucking sound of a dup in part, is driving that. But really, increasingly it's people shifting their resources to some of these new emergent applications and workloads and new types of databases to support them right? But these are still, those new databases, you can see here, the NewSQL and noSQL still, relatively, small. A lot of it's open source. But then it starts to take off. What's your assumption there? >> So here, what's going on is, if you look at dollars today, it's, actually, interesting. If you take the noSQL databases, you take DynamoDB, you take Cassandra, Hadoop, HBase, Couchbase, Mongo, Kudu and you add all those up, it's about, with DynamoDB, it's, probably, about 1.55 billion out of a $40 billion market today. >> [Dave] Okay but it's starting to get meaningful. We were approaching two billion. >> But where it's meaningful is the unit share. If that were translated into Oracle pricing. The market would be much, much bigger. So the point it. >> Ten X? >> At least, at least. >> Okay, so in terms of work being done. If there's a measure of work being done. >> [George] We're looking at dollars here. >> Operations per second or etcetera, it would be enormous. >> Yes, but that's reflective of the fact that the data volumes are exploding but the prices are dropping precipitously. >> So do you have a metric to demonstrate that. We're, obviously, not going to show it today but. >> [George] Yes. >> Okay great, so-- >> On the business intelligence side, without naming names, the data warehouse appliance vendors are charging anywhere from 25,000 per terabyte up to, when you include running costs, as high as 100,000 a terabyte. That their customers are estimating. That's not the selling cost but that's the cost of ownership per terabyte. Whereas, if you look at, let's say Hadoop, which is comparable for the off loading some of the data warehouse work loads. That's down to the 5K per terabyte range. >> Okay great, so you expect that these platforms will have a bigger and bigger impact? What's your pricing assumption? Is prices going to go up or is it just volume's going to go through the roof? >> I'm, actually, expecting pricing. It's difficult because we're going to add more and more functionality. Volumes go up and if you add sufficient functionality, you can maintain pricing. But as volumes go up, typically, prices go down. So it's a matter of how much do these noSQL and NewSQL databases add in terms of functionality and I distinguish between them because NewSQL databases are scaled out version of Oracle or Teradata but they are based on the more open source pricing model. >> Okay and NoSQL, don't forget, stands for not only SQL, not not SQL. >> If you look at the slides, big existing markets never fall off a cliff when they're in the climb. They just slowly fade. And, eventually, that accelerates. But what's interesting here is, the data volumes could explode but the revenue associated with the NoSQL which is the dark gray and the NewSQL which is the blue. Those don't explode. You could take, what's the DBMS cost of supporting YouTube? It would be in the many, many, many billions of dollars. It would support 1/2 of an Oracle itself probably. But it's all open source there so. >> Right, so that's minimizing the opportunity is what you're saying? >> Right. >> You can see the database market is flat, certainly flattish and even declining but you do expect some growth in the out years as part of that evasion, that volume, presumably-- >> And that's the next slide which is where we've seen that growth come from. >> Okay so let's talk about that. So the next slide, again, I should have set this up better. The X-axis year is worldwide dollars and the horizontal axis is time. And we're talking here about these continuous application work loads. This new work load that you talked about earlier. So take us through the three. >> [George] There's three types of workloads that, in large part, are going to be driving most of this revenue. Now, these aren't completely, they are completely comparable to the DBMS market because some of these don't use traditional databases. Or if they do, they're Torry databases and I'll explain that. >> [Dave] Sure but if I look at the IoT Edge, the Cloud and the micro services and streaming, that's a tail wind to the database forecast in the previous slide, is that right? >> [George] It's, actually, interesting but the application and infrastructure telemetry, this is what Splunk pioneered. Which is all the torrents of data coming out of your data center and your applications and you're trying to manage what's going on. That is a database application. And we know Splunk, for 2016, was 400 million. In software revenue Hadoop was 750 million. And the various other management vendors, New Relic, AppDynamics, start ups and 5% of Azure and AWS revenue. If you add all that up, it comes out to $1.7 billion for 2016. And so, we can put a growth rate on that. And we talked to several vendors to say, okay, how much will that work load be compared to IoT Edge Cloud. And the IoT Edge Cloud is the smart devices at the Edge and the analytics are in the fog but not counting the database revenue up in the Cloud. So it's everything surrounding the Cloud. And that, actually, if you look out five years, that's, maybe, 20% larger than the app and infrastructure telemetry but growing much, much faster. Then the third one where you were talking about was this a tail wind to the database. Micro server systems streaming are very different ways of building applications from what we do now. Now, people build their logic for the application and everyone then, stores their data in this centralized external database. In micro services, you build a little piece of the app and whatever data you need, you store within that little piece of the app. And so the database requirements are, rather, primitive. And so that piece will not drive a lot of database revenue. >> So if you could go back to the previous slide, Patrick. What's driving database growth in the out years? Why wouldn't database continue to get eaten away and decline? >> [George] In broad terms, the overall database market, it staying flat. Because as prices collapse but the data volumes go up. >> [Dave] But there's an assumption in here that the NoSQL space, actually, grows in the out years. What's driving that growth? >> [George] Both the NoSQL and the NewSQL. The NoSQL, probably, is best serving capturing the IoT data because you don't need lots of fancy query capabilities for concurrency. >> [Dave] So it is a tail wind in a sense in that-- >> [George] IoT but that's different. >> [Dave] Yeah sure but you've got the overall market growing. And that's because the new stuff, NewSQL and NoSQL is growing faster than the decline of the old stuff. And it's not in the 2020 to 2022 time frame. It's not enough to offset that decline. And then they have it start growing again. You're saying that's going to be driven by IoT and other Edge use cases? >> Yes, IoT Edge and the NewSQL, actually, is where when they mature, you start to substitute them for the traditional operational apps. For people who want to write database apps not who want to write micro service based apps. >> Okay, alright good. Thank you, George, for setting it up for us. Now, we're going to be at Big Data SV in mid March? Is that right? Middle of March. And George is going to be releasing the actual final forecast there. We do it every year. We use Spark Summit to look at our preliminary numbers, some of the Spark related forecasts like continuous work loads. And then we harden those forecasts going into Big Data SV. We publish our big data report like we've done for the past, five, six, seven years. So check us out at Big Data SV. We do that in conjunction with the Strada events. So we'll be there again this year at the Fairmont Hotel. We got a bunch of stuff going on all week there. Some really good programs going on. So check out siliconangle.tv for all that action. Check out Wikibon.com. Look for new research coming out. You're going to be publishing this quarter, correct? And of course, check out siliconangle.com for all the news. And, really, we appreciate everybody watching. George, been a pleasure co-hosting with you. As always, really enjoyable. >> Alright, thanks Dave. >> Alright, to that's a rap from Sparks. We're going to try to get out of here, hit the snow storm and work our way home. Thanks everybody for watching. A great job everyone here. Seth, Ava, Patrick and Alex. And thanks to our audience. This is the Cube. We're out, see you next time. (lively music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. of the Wikibon Big Data Forecast. What's happening is the rise of a new interaction mode. On the next slide, we're at the beginning for the knee in the curve to start. So, if you look at the next slide, And then, we're going to come at it with some bottoms-up [Dave] Okay, so the next slide we want to drill into the [George] Okay, so the first thing to understand and IBM and Microsoft are living off of that going out into the latter years of the tenure period. you can see here, the NewSQL and you add all those up, [Dave] Okay but it's starting to get meaningful. So the point it. Okay, so in terms of work being done. it would be enormous. that the data volumes are exploding So do you have a metric to demonstrate that. some of the data warehouse work loads. the more open source pricing model. Okay and NoSQL, don't forget, but the revenue associated with the NoSQL And that's the next slide which is where and the horizontal axis is time. in large part, are going to be driving of the app and whatever data you need, What's driving database growth in the out years? the data volumes go up. that the NoSQL space, actually, grows is best serving capturing the IoT data because And it's not in the 2020 to 2022 time frame. and the NewSQL, actually, And George is going to be releasing This is the Cube.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
IBMORGANIZATION

0.99+

George GilbertPERSON

0.99+

PatrickPERSON

0.99+

GeorgePERSON

0.99+

MicrosoftORGANIZATION

0.99+

OracleORGANIZATION

0.99+

Dave VellantePERSON

0.99+

DavePERSON

0.99+

SethPERSON

0.99+

30 billionQUANTITY

0.99+

AlexPERSON

0.99+

two billionQUANTITY

0.99+

2016DATE

0.99+

$40 billionQUANTITY

0.99+

AWSORGANIZATION

0.99+

2027DATE

0.99+

20%QUANTITY

0.99+

five yearsQUANTITY

0.99+

New RelicORGANIZATION

0.99+

OrlapORGANIZATION

0.99+

$1.7 billionQUANTITY

0.99+

10 billionQUANTITY

0.99+

2020DATE

0.99+

BostonLOCATION

0.99+

AvaPERSON

0.99+

mid MarchDATE

0.99+

third oneQUANTITY

0.99+

last yearDATE

0.99+

AppDynamicsORGANIZATION

0.99+

2022DATE

0.99+

yesterdayDATE

0.99+

WikibonORGANIZATION

0.99+

60 yearsQUANTITY

0.99+

two daysQUANTITY

0.99+

siliconangle.comOTHER

0.99+

400 millionQUANTITY

0.99+

750 millionQUANTITY

0.99+

YouTubeORGANIZATION

0.99+

todayDATE

0.99+

5%QUANTITY

0.99+

Middle of MarchDATE

0.99+

Sparks SummitEVENT

0.99+

first slideQUANTITY

0.99+

threeQUANTITY

0.99+

two waysQUANTITY

0.98+

Boston, MassachusettsLOCATION

0.98+

early 60'sDATE

0.98+

about $40 billionQUANTITY

0.98+

one firmQUANTITY

0.98+

this yearDATE

0.98+

Ten XQUANTITY

0.98+

Spark SummitEVENT

0.97+

25,000 per terabyteQUANTITY

0.97+

80'sDATE

0.97+

DatabricksORGANIZATION

0.97+

DynamoDBTITLE

0.97+

three typesQUANTITY

0.97+

BothQUANTITY

0.96+

Sparks Summit East 2017EVENT

0.96+

Spark Summit East 2017EVENT

0.96+

this weekDATE

0.95+

SparkTITLE

0.95+

Jack Norris, MapR - Spark Summit East 2016 #SparkSummit #theCUBE


 

>>From New York expecting the signal to nine. It's the cube covering sparks summit east brought to you by spark summit. Now your hosts, Dave Volante and George Gilbert >>Right here in Midtown at the Hilton hotel. This has sparked somebody and this is the cube. The cube goes out to the events. We extract the signal from the noise. Jack Norris is here. He's the CMO of Mapbox, long time cube, alum jackets. It's great to see you again. Hey, if you've been here since the beginning of this whole big data >>Meme and it might've started here, I don't know. I think we've yeah, >>I think you're right. I mean, it really did start it. I think in this building, it was our first big data show at the original, you know, uh, uh, Hadoop world. And, uh, and you guys, like I say, I've been there from the start. Uh, you were kind of impatient early on. You said, you know, we're just going to go build solutions and, uh, and ignore the noise and you built a really nice, nice business. Um, you guys have been growing, you're growing your Salesforce and, uh, and things are good and all of a sudden, boom, the spark thing comes in. So we're seeing the evolution. I remember saying to George and the early days of a dupe, we were geeking out talking to all the bits and bytes and then it turned into a business discussion. It's like we're back to the hardcore bits and bites. So give us the update from Matt bar's point of view, where are we in the whole big data space? >>Well, I think, um, I think it has transitioned. I mean, uh, if you look at the typical large fortune company, the web to Datto's, it's really, how do we best leverage our data and how do we leverage our data in that we can, we can make decisions much faster, right? That high-frequency decision-making process. Um, and typically that involves taking production data and analytics and joining them together so that you're actually impacting business as it happens and to do that effectively requires, um, innovations. So the exciting thing about spark is taking and, uh, and having a distributed compute engine, it's much easier to develop and, uh, in much faster. >>So in the remember the early days we'd be at these shows and the big question was, you know, can you take the humans out of the equation? It's like, no, no humans are the last mile. Um, is that, is that changing or would we still need that human interaction or, >>Um, humans are important part of the process, but increasingly if you can adjust and make, you know, small algorithmic decisions, um, and, and make those decisions at that kind of moment of truth, you got big impact, and I'll give you a few examples. So, um, ad platforms, you know, Rubicon project over a hundred billion ad auctions a day, you know, humans, part of that process in terms of setting that up and reviewing the process, but each, you know, each supply and demand decision, there is an automated decision optimizing that has a huge impact on the bottom line, um, fraud, uh, you know, credit card swiping that transaction and deciding is this fraudulent or not avoiding false positives, et cetera, a big leveraged item. So we're seeing things like that across manufacturing, across retail healthcare. And, um, it isn't about asking bigger questions or doing reports and looking back at, you know, what happened last week. It's more, how can I have an infrastructure in place that allows this organization to be agile? Because it's not the companies with the most data that's going to win. It's the companies that are the most agile and making intelligent. >>So it's so much data. Humans can ingest it any faster. I mean, we just, we can't keep up. So the world needs data scientists that needs trained developers. You've got some news I want to talk about on the training side, but even that we can only throw so many bodies at the problem. So it's really software. That's going to allow us to scale it. Software's hard. Software takes time. So we've seen a lot of the spend in the analytics, big data world on, on services. And obviously you guys and others have been working hard to shift it towards software. I want to come back to that training issue. We heard this morning about, uh, Databricks launched a move. They trained 20,000 people. That's a lot, but still long way to go. You guys are putting some investment into training. Talk about that news. Yeah. >>Yeah. Um, well it starts at the underlying software. If you can do things in the platform to make it much easier and do things that are hard to surround with services, like, uh, data protection, right? If you've lost data, it doesn't matter how many people you throw at it, you can't recover it. Right. So that's kind of the starting point you're gonna get fired. >>The, the, uh, the approach we've taken is, is to take, uh, a software product approach to the training as well. So we rolled out on demand training. So it's free, it's on demand. You work at your own pace. It's got different modules, there's some training associated with that, or some hands-on labs, if you will. Um, we launched that last January. So it's basically coming up the year anniversary. We recently celebrated, we trained 50,000 people, uh, on, on Hadoop and big data. Um, today we're announcing expansion on spark classes. We've got full curriculum around spark, including a certification. So you can get sparked certification through this, this map, our on demand training. Okay. >>Gotcha. You said something really, really intriguing that I want to dive into a little bit is where we were talking about the small decisions that can be made really, really fast for that a human in the loop human might have to train them, but it at runtime now where you said, it's not about asking bigger questions, it's finding faster answers, um, what had to change in your platform or in the underlying technology to make that possible. >>You know, um, there's a lot that into it. It's typically a series of functions, uh, a kind of breadth that needs to be brought to the problem as well as squeezing out latencies. So instead of, um, the traditional approach, which is different applications and different analytic techniques dictate a separate silo, a separate, you know, scheme of data. And you've got those all around the organization and data kind of travels, and you get an answer at the end of some period of time. Uh, it's converging that altogether into a single platform, squeezing out those latencies so that you can have an informed action at the speed of business, if you will. And, >>Um, let's say spark never came along. Would that be possible? >>Yes. Yes. Would you, how would you, so if you look at kind of the different architectures that are out there, there's typically deep analytics in terms of, you know, let's go look at the trends, you know, the last seven years, what happened. And then look, let's look at, um, doing actions on a streaming set, say for instance, storm, and then let's do a real time database operations. So you could do that with, with HBase or map RDB and all of that together. What spark has really done is made that whole development process just much easier and much more streamlined. And that's where a lot of the excitements happen. >>So you mentioned earlier, um, to, to use cases, ad tech and fraud detection. Um, and I want to ask you about those in the state of those. So ad tech obviously has come a long way, but it's still got a ways to go. I mean, you look at, I mean, who's making money on ads. Obviously Google will make tons of money. Everybody else is sorta chasing them Facebook making money. It's probably cause they didn't let Google in. Okay. So how will spark affect sort of that business? Uh, and, and what's map, R's sort of role in evolving that, you know, to the next level. >>So, so, um, there's, there's different kind of compute and the types of things you can do, um, on the data. I think increasingly we're seeing the kind of streaming analytics and making those decisions as the data arrives, right. And then there's the whole ecosystem in terms of how do you coordinate those flows of data? It's not just a simple, here's the origin, here's the destination. There's typically a complex data flow. Um, that's where we've kind of focused on map our streams, this huge publish and subscribe infrastructure so that you can get real-time data to the appropriate location and then do the right operations, a lot of that involved with spark, but not exclusively. >>Okay. And then on fraud detection, um, obviously come a long way. Sampling could have died. Yes. And now, but now we're getting too many false positives. You get the call and, you know, I mean, I get a lot of calls because we can buy so much equipment, but, um, but now what about the next level? What are you guys doing to take fraud detection to the next level? So that when I get on the plane in Boston and I land in London, it knows, um, is that a database problem? Is it an integration problem, a systems problem, and how, what role you guys play in solving that? >>Well, there's, there's, um, you know, there's, there's a lot of details and techniques that probably go, um, beyond, you know, what, what we'll share publicly or what are our customers talk about publicly? I think in general, it's the more data that you can apply to a problem. The more context, the better off you are, that's the way I kind of summarize it so that instead of a sampling or instead of a boy, that's a strange purchase over there, it's understanding, well, this is Dave Valenti and this is the full body of, of, uh, expenditures he's done, then the types of things and here's who he frequently purchases from. And here's kind of a transaction trend started in San Francisco, went to New York, et cetera. So in context it would make more sense. So >>Part of that is more data. And the other part of that is just better algorithms and better, better learnings and applying that on a continuous basis. How are your customers dealing with that, that constraint? I mean, if they got a, a hundred dollars to spend, yeah. They can only spend so much on, on each of those gathering more data, cleaning the data, they spent so much time getting it ready versus making their machine learning algorithms or whatever the other techniques to do. What are you seeing there as sort of best practice? It was probably varies. I'm sure, but give us some color on it. >>Um, I'll actually go back to Google and Google a letter last round, um, you know, excellent, excellent insights coming from Google. They wrote a paper called the unreasonable effectiveness of data and in it, they basically squarely addressed that problem. And given the choice to invest in either the complex model and algorithm or put more data at it, putting more data, had a huge impact. And, um, you know, my simple explanation is if you're sampling the data, you have to have a model that tries to recreate reality. If you're looking at all of the data, then the anomalies can, can pop up and be more apparent. And, um, the more context you can bring, the more data from other sources. So you get around, you know, a better picture of what's happening, the better off you are. And so that requires scale. It requires speed and requires different techniques that can be brought to bear, right? The database operation, here's a streaming operation, here's a deep, you know, file machine learning algorithm. >>So there's a lot of vendors in the sort of big data ecosystem are coming at spark from different angles and, um, are, are trying to add value to it and sort of bathe themselves in sort of the halo. Yep. Now you guys took some time upfront to build a converged platform so that you weren't trying to wrap your arms around 37 different projects. Can you tell us how having perhaps not anticipated spark how this converts platform allows you to add more value to it than other approaches? >>So, so we simplify, if you look at the Hadoop ecosystem, it's basically separated into the components for compute and management on top of the data layer, right? The Hadoop distributed file system. So how do you scale data? How do you protect it? It's very simply what's going on. Spark really does a great job at that top layer. Doesn't do anything about defining the underlying storage layer in the Hadoop community that underlying storage layer is a batch system. So you're trying to do, you know, micro batch kind of streaming operations on top of batch oriented data. What we addressed was to take that whole data layer, make it real time, make it random. Read-write converge enterprise storage together with Hadoop support and spark support on a single platform. And that's basically >>With the difference and to make an enterprise great. You guys were really the first to lead the lecture. You were, everybody started talking about attic price straight after you were kind of delivering it. So you've had a lead there. Do you feel like you still have a lead there, or is that the kind of thing where you sort of hit the top of the S-curve and start innovating elsewhere? >>NC state did a study, uh, just this past year, a recent study identified that only 25% of data corruption issues are identified and properly handled by the Hadoop distributed file system. 42% of those are silent. So there's a huge gap in terms of quote unquote enterprise grade features and what we think. >>Yes, silent data corruption has been a problem for decades now. And you're saying it's no different in the duke ecosystem, especially as, as mainstream businesses start to, uh, to adopt this what's happening in the valley. Uh, we're seeing, you know, in the wall street journal every day you read about down rounds, flat rounds, people can't get B rounds. Uh, you guys are funded, you know, you're growing, you're talking about investments, you know, what do you see? Do you, do you feel like you're achieving escape velocity? Um, maybe give us sort of an update on, uh, the state of the business. >>Yeah. I, I think the state of the business is best represented by the customers, right? And the customers kind of vote, right. They vote in terms of, you know, how well is this technology driving their business? So we've got a recent study, um, that kind of shows the, the returns that customers, um, are getting, uh, we've got a 1% chance, a 99% retention rate with our customers. We've got, uh, an expansion rate. That's, that's unbelievable. We've got multi-million dollar customers in, uh, in seven of the top verticals and nine out of the top $10 million customers. So we're seeing significant investments and more importantly, significant returns on the part of customers where they're not just doing a single application on the platform, but multiple >>Applications, Jack Norris map are always focused. Always a pleasure having you on the cube. Thanks very much for coming on. Appreciate it. Keep right there, buddy. We'll be back with our next guest is the cube we're live from spark somebody's right back. Okay.

Published Date : Feb 17 2016

SUMMARY :

covering sparks summit east brought to you by spark summit. It's great to see you again. I think we've yeah, You said, you know, we're just going to go build solutions and, if you look at the typical large fortune company, So in the remember the early days we'd be at these shows and the big question was, you know, and reviewing the process, but each, you know, each supply and demand decision, And obviously you guys and others have been working hard to shift it towards software. If you can do things in the platform to make it much easier and do things that are hard to surround So you can get sparked certification through really fast for that a human in the loop human might have to train them, but it at runtime around the organization and data kind of travels, and you get an answer at the end of some period Would that be possible? let's go look at the trends, you know, the last seven years, what happened. So you mentioned earlier, um, to, to use cases, ad tech and fraud detection. so that you can get real-time data to the appropriate location and then do the right operations, You get the call and, you know, I mean, I get a lot of calls because we can buy so much equipment, but, The more context, the better off you are, that's the way I kind of summarize What are you seeing there as sort of best practice? um, you know, my simple explanation is if you're sampling the data, this converts platform allows you to add more value to it than other approaches? So how do you scale data? You were, everybody started talking about attic price straight after you were kind of delivering it. and properly handled by the Hadoop distributed file system. you know, in the wall street journal every day you read about down rounds, flat rounds, people can't get B rounds. They vote in terms of, you know, Always a pleasure having you on the cube.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave ValentiPERSON

0.99+

Jack NorrisPERSON

0.99+

Dave VolantePERSON

0.99+

New YorkLOCATION

0.99+

LondonLOCATION

0.99+

GeorgePERSON

0.99+

San FranciscoLOCATION

0.99+

BostonLOCATION

0.99+

George GilbertPERSON

0.99+

99%QUANTITY

0.99+

GoogleORGANIZATION

0.99+

42%QUANTITY

0.99+

FacebookORGANIZATION

0.99+

DatabricksORGANIZATION

0.99+

50,000 peopleQUANTITY

0.99+

nineQUANTITY

0.99+

20,000 peopleQUANTITY

0.99+

last weekDATE

0.99+

DattoORGANIZATION

0.99+

last JanuaryDATE

0.99+

$10 millionQUANTITY

0.98+

sevenQUANTITY

0.98+

eachQUANTITY

0.98+

firstQUANTITY

0.98+

MapboxORGANIZATION

0.98+

todayDATE

0.97+

1%QUANTITY

0.97+

HadoopTITLE

0.97+

MattPERSON

0.96+

single platformQUANTITY

0.96+

NCORGANIZATION

0.95+

this morningDATE

0.95+

single applicationQUANTITY

0.94+

25%QUANTITY

0.94+

MidtownLOCATION

0.93+

first bigQUANTITY

0.92+

RubiconORGANIZATION

0.92+

37 different projectsQUANTITY

0.92+

last seven yearsDATE

0.89+

over a hundred billion ad auctions a dayQUANTITY

0.88+

this past yearDATE

0.86+

sparkORGANIZATION

0.85+

multi-million dollarQUANTITY

0.84+

decadesQUANTITY

0.83+

a hundred dollarsQUANTITY

0.79+

data corruptionQUANTITY

0.7+

HBaseTITLE

0.67+

HiltonORGANIZATION

0.67+

RDBTITLE

0.64+

SparkORGANIZATION

0.57+

MapRORGANIZATION

0.57+

mapTITLE

0.57+

SalesforceORGANIZATION

0.53+

2016EVENT

0.51+

- Spark SummitEVENT

0.46+

EastLOCATION

0.42+

James Hamilton, AWS | AWS Re:Invent 2013


 

(mellow electronic music) >> Welcome back, we're here live in Las Vegas. This is SiliconANGLE and Wikibon's theCUBE, our flagship program. We go out to the events, extract the signal from the noise. We are live in Las Vegas at Amazon Web Services re:Invent conference, about developers, large-scale cloud, big data, the future. I'm John Furrier, the founder of SiliconANGLE. I'm joined by co-host, Dave Vellante, co-founder of Wikibon.org, and our guest is James Hamilton, VP and Distinguished Engineer at Amazon Web Services. Welcome to theCUBE. >> Well thank you very much. >> You're a tech athlete, certainly in our book, is a term we coined, because we love to use sports analogies You're kind of the cutting edge. You've been the business and technology innovating for many years going back to the database days at IBM, Microsoft, and now Amazon. You gave a great presentation at the analyst briefing. Very impressive. So I got to ask you the first question, when did you first get addicted to the notion of what Amazon could be? When did you first taste the Cool-Aide? >> Super good question. Couple different instances. One is I was general manager of exchange hosts and services and we were doing a decent job, but what I noticed was customers were loving it, we're expanding like mad, and I saw opportunity to improve by at least a factor of two I'm sorry, 10, it's just amazing. So that was a first hint that this is really important for customers. The second one was S3 was announced, and the storage price pretty much froze the whole industry. I've worked in storage all my life, I think I know what's possible in storage, and S3 was not possible. It was just like, what is this? And so, I started writing apps against it, I was just blown away. Super reliable. Unbelievably priced. I wrote a fairly substantial app, I got a bill for $7. Wow. So that's really the beginnings of where I knew this was going to change the world, and I've been, as you said, addicted to it since. >> So you also mentioned some stats there. We'll break it down, 'cause we love to talk about the software defined data center, which is basically not even at the hype stage yet. It's just like, it's still undefined, but software virtualization, network virtualization really is pushing that movement of the software focus, and that's essentially you guys are doing. You're talking about notifications and basically it's a large-scale systems problem. That you guys are building a global operating system as Andy Jassy would say. Well, he didn't say that directly, he said internet operating system, but if you believe that APIs are critical services. So I got to ask you that question around this notion of a data center, I mean come on, nobody's really going to give up their data center. It might change significantly, but you pointed out the data center costs are in the top three order, servers, power circulation systems, or cooling circulation, and then actual power itself. Is that right, did I get that right? >> Pretty close, pretty close. Servers dominate, and then after servers if you look at data centers together, that's power, cooling, and the building and facility itself. That is the number two cost, and the actual power itself is number three. >> So that's a huge issue. When we talk like CIOs, it's like can you please take the facility's budget off my back? For many reasons, one, it's going to be written off soon maybe. All kinds of financial issues around-- >> A lot of them don't see it, though, which is a problem. >> That is a problem, that is a problem. Real estate season, and then, yes. >> And then they go, "Ah, it's not my problem" so money just flies out the window. >> So it's obviously a cost improvement for you. So what are you guys doing in that area and what's your big ah-ha for the customers that you walk in the door and say, look, we have this cloud, we have this system and all those headaches can be, not shifted, or relieved if you will, some big asprin for them. What's the communication like? What do you talk to them about? >> Really it depends an awful lot on who it is. I mean, different people care about different things. What gets me excited is I know that this is the dominate cost of offering a service is all of this muck. It's all of this complexity, it's all of this high, high capital cost up front. Facility will run 200 million before there's servers in it. This is big money, and so from my perspective, taking that way from most companies is one contribution. Second contribution is, if you build a lot of data centers you get good at it, and so as a consequence of that I think we're building very good facilities. They're very reliable, and the costs are plummeting fast. That's a second contribution. Third contribution is because... because we're making capacity available to customers it means they don't have to predict two years in advance what they're going to need, and that means there's less wastage, and that's just good for the industry as a whole. >> So we're getting some questions on our crowd chat application. If you want to ask a question, ask him anything. It's kind of like Reddit. Go to crowdchat.net/reinvent. The first question came in was, "James, when do you think ARM will be in the data center?" >> Ah ha, that's a great question. Well, many people know that I'm super excited about ARM. It's early days, the reason why I'm excited is partly because I love seeing lots of players. I love seeing lots of innovation. I think that's what's making our industry so exciting right now. So that's one contribution that ARM brings. Another is if you look at the history of server-side computing, most of the innovation comes from the volume-driven, usually on clients first. The reason why X86 ended up in such a strong position is so many desktops we running X86 processors and as a consequence it became a great server processor. High R&D flow into it. ARM is in just about every device that everyone's carrying around. It's almost every disk drive, it's just super broadly deployed. And whenever you see a broadly deployed processor it means there's an opportunity to do something special for customers. I think it's good for the industry. But in a precise answer to your question, I really don't have one right now. It's something that we're deeply interested in and investigating deeply, but at this point it hasn't happened yet, but I'm excited by it. >> Do you think that... Two lines of questioning here. One is things that are applicable to AWS, other's just your knowledge of the industry and what you think. We talked about that yesterday with OCP, right? >> Yep. >> Not a right fit for us, but you applaud the effort. We should talk about that, too, but does splitting workloads up into little itty, bitty processors change the utilization factor and change the need for things like virtualization, you know? What do you think? >> Yeah, it's a good question. I first got excited about the price performance of micro-servers back in 2007. And at that time it was pretty easy to produce a win by going to a lower-powered processor. At that point memory bandwidth wasn't as good as it could be. It was actually hard on some workloads to fully use a processor. Intel's a very smart company, they've done great work on improving the memory bandwidth, and so today it's actually harder to produce a win, and so you kind of have workloads in classes. At the very, very high end we've got database workloads. They really love single-threaded performance, and performance really is king, but there are lots of highly parallel workloads where there's an opportunity for a big gain. I still think virtualization is probably something where the industry's going to want to be there, just because it brings so many operational advantages. >> So I got to ask the question. Yesterday we had Jason Stowe on, CEO of Cycle Computing, and he had an amazing thing that he did, sorry, trumping it out kids say, but it's not new to you, but it's new to us. He basically created a supercomputer and spun up hundreds of thousands of cores in 30 minutes, which is like insane, but he did it for like 30 grand. Which would've cost, if you try to provision it to the TUCO calculator or whatever your model, it'd be months and years, maybe, and years. But the thing that he said I want to get your point on and I'm going to ask you questions specifically on is, Spot instances were critical for him to do that, and the creativity of his solutions, so I got to ask you, did you see Spot pricing instances being a big deal, and what impact has that done to AWS' vision of large scale? >> I'm super excited by Spot. In fact, it's one of the reasons I joined Amazon. I went through a day of interviews, I met a bunch of really smart people doing interesting work. Someone probably shouldn't have talked to me about Spot because it hadn't been announced yet, and I just went, "This is brilliant! "This is absolutely brilliant!" It's taking the ideas from financial markets, where you've got high-value assets, and saying why don't we actually sell it off, make a market on the basis of that and sell it off? So two things happen that make Spot interesting. The first is an observation up front that poor utilization is basically the elephant in the room. Most folks can't use more than 12% to 15% of their overall server capacity, and so all the rest ends up being wasted. >> You said yesterday 30% is outstanding. It's like have a party. >> 30% probably means you're not measuring it well. >> Yeah, you're lying. >> It's real good, yeah, basically. So that means 70% or more is wasted, it's a crime. And so the first thing that says is, that one of the most powerful advertisements for cloud computing is if you bring a large number of non-correlated workloads together, what happens is when you're supporting a workload you've got to have enough capacity to support the peak, but you only get to monetize the average. And so as the peak to average gets further apart, you're wasting more. So when you bring a large number of non-correlated workloads together what happens is it flattens out just by itself. Without doing anything it flattens out, but there's still some ups and downs. And the Spot market is a way of filling in those ups and downs so we get as close to 100%. >> Is there certain workloads that fit the spot, obviously certain workloads might fit it, but what workloads don't fit the Spot price, because, I mean, it makes total sense and it's an arbitrage opportunity for excess capacity laying around, and it's price based on usage. So is there a workload, 'cause it'll be torrent up, torrent down, I mean, what's the use cases there? >> Workloads that don't operate well in an interrupted environment, that are very time-critical, those workloads shouldn't be run in Spot. It's just not what the resource is designed for. But workloads like the one that we were talking to with Cycle Computing are awesome, where you need large numbers of resources. If the workload needs to restart, that's absolutely fine, and price is really the focus. >> Okay, and question from crowd chat. "Ask James what are his thoughts "on commodity networking and merchant silicon." >> I think an awful lot about that. >> This guy knows you. (both laughing) >> Who's that from? >> It's your family. >> Yeah, exactly! >> They're watching. >> No, network commoditization is a phenomenal thing that the whole industry's needed that for 15 years. We've got a vertical ecosystem that's kind of frozen in time. Vertically-integrated ecosystem kind of frozen in time. Costs everywhere are falling except in networking. We just got to do something, and so it's happening. I'm real excited by that. It's really changing the Amazon business and what we can do for customers. >> Let's talk a little bit about server design, because I was fascinated yesterday listening to you talk how you've come full circle. Over the last decade, right, you started with what's got to be stripped down, basic commodity and now you're of a different mindset. So describe that, and then I have some follow-up questions for you. >> Yeah, I know what you're alluding to. Is years ago I used to argue you don't want hardware specialization, it's crazy. It's the magic's in software. You want to specialize software running on general-purpose processors, and that's because there was a very small number of servers out there, and I felt like it was the most nimble way to run. However today, in AWS when we're running ten of thousands of copies of a single type of server, hardware optimizations are absolutely vital. You end up getting a power-performance advantage at 10X. You can get a price-performance advantage that's substantial and so I've kind of gone full circle where now we're pulling more and more down into the hardware, and starting to do hardware optimizations for our customers. >> So heat density is a huge problem in data centers and server design. You showed a picture of a Quanta package yesterday. You didn't show us your server, said "I can't you ours," but you said, "but we blow this away, "and this is really good." But you describe that you're able to get around a lot of those problems because of the way you design data centers. >> Yep. >> Could you talk about that a little bit? >> Sure, sure, sure. One of the problems when you're building a server it could end up anywhere. It could end up in a beautiful data center that's super well engineered. It could end up on the end of a row on a very badly run data center. >> Or in a closet. >> Or in a closet. The air is recirculating, and so the servers have to be designed with huge headroom on cooling requirements, and they have to be able to operate in any of those environments without driving warranty costs for the vendors. We take a different approach. We say we're not going to build terrible data centers. We're going to build really good data centers and we're going to build servers that exploit the fact those data centers are good, and what happens is more value. We don't have to waste as much because we know that we don't have to operate in the closet. >> We got some more questions coming here by the way. This is awesome. This ask me anything crowd chat thing is going great. We got someone, he's from Nutanix, so he's a geek. He's been following your career for many years. I got to ask you about kind of the future of large-scale. So Spot, in his comment, David's comment, Spot instances prove that solutions like WMare's distributed power management are not valuable. Don't power off the most expensive asset. So, okay, that brings up an interesting point. I don't want to slam on BMWare right now, but I just wanted to bring to the next logical question which is this is a paradigm shift. That's a buzz word, but really a lot's happening that's new and innovative. And you guys are doing it and leading. What's next in the large-scale paradigm of computing and computer science? On the science-side you mentioned merchant silicon. Obviously that's, the genie's out of the bottle there, but what's around the corner? Is it the notifications at the scheduling? Was it virtualization, is it compiler design? What are some of the things that you see out on the horizon that you got your eyes on? >> That's interesting, I mean. I've got, if you name your area, and I'll you some interesting things happening in the area, and it's one of the cool things of being in the industry right now. Is that 10 years ago we had a relatively static, kind of slow-pace. You really didn't have to look that far ahead, because of anything was coming you'd see it coming for five years. Now if you ask me about power distribution, we've got tons of work going on in power distribution. We're researching different power distribution topologies. We're researching higher voltage distribution, direct current distribution. Haven't taken any of those steps yet, but we're were working in that. We've got a ton going on in networking. You'll see an announcement tomorrow of a new instance type that is got some interesting characteristics from a networking perspective. There's a lot going on. >> Let's pre-announce, no. >> Gary's over there like-- >> How 'about database, how 'about database? I mean, 10 years ago, John always says database was kind of boring. You go to a party say, oh welcome to database business, oh yeah, see ya. 25 years ago it was really interesting. >> Now you go to a party is like, hey ah! Have a drink! >> It a whole new ballgame, you guys are participating. Google Spanner is this crazy thing, right? So what are your thoughts on the state of the database business today, in memory, I mean. >> No, it's beautiful. I did a keynote at SIGMOD a few years ago and what I said is that 10 years ago Bruce Linsey, I used to work with him in the database world, Bruce Linsey called it polishing the round ball. It's just we're making everything a little, tiny bit better, and now it's fundamentally different. I mean what's happening right now is the database world, every year, if you stepped out for a year, you wouldn't recognize it. It's just, yeah, it's amazing. >> And DynamoDB has had rapid success. You know, we're big users of that. We actually built this app, crowd chat app that people are using on Hadoop and Hbase, and we immediately moved that to DynamoDB and your stack was just so much faster and scalable. So I got to ask you the-- >> And less labor. >> Yeah, yeah. So it's just been very reliable and all the other goodness of the elastic B socket and SQS, all that other good stuff we're working with node, et cetera So I got to ask you, the area that I want your opinion around the corner is versioning control. So at large-scale one of the challenges that we have is as we're pushin' new code, making sure that the integrated stack is completely updated and synchronized with open-source projects. So where does that fit into the scaling up? 'Cause at large scale, versioning control used to be easy to manage, but downloading software and putting in patches, but now you guys handle all that at scale. So that, I'm assuming there's some automation involved, some real tech involved, but how are you guys handling the future of making sure the code is all updated in the stack? >> It's a great question. It's super important from a security perspective that the code be up to date and current. It's super important from a customer perspective and you need to make sure that these upgrades are just non-disruptive. One customer, best answer I heard was yesterday from a customer was on a panel, they were asked how did they deal with Amazon's upgrades, and what she said is, "I didn't even know when they were happening. "I can't tell when they're happening." Exactly the right answer. That's exactly our goal. We monitor the heck out of all of our systems, and our goal, and boy we take it seriously, is we need to know any issue before a customer knows it. And if you fail on that promise, you'll meet Andy really quick. >> So some other paradigm questions coming in. Floyd asks, "Ask James what his opinion of cloud brokerage "companies such as Jamcracker or Graviton. "Do they have a place, or is it wrong thinking?" (James laughs) >> From my perspective, the bigger and richer the ecosystem, the happier our customers all are. It's all goodness. >> It's Darwinism, that's the answer. You know, the fit shall survive. No, but I think that brings up this new marketplace that Spot pricing came out of the woodwork. It's a paradigm that exists in other industries, apply it to cloud. So brokering of cloud might be something, especially with regional and geographical focuses. You can imagine a world of brokering. I mean, I don't know, I'm not qualified to answer that. >> Our goal, honestly, is to provide enough diversity of services that we completely satisfy customer's requirements, and that's what we intend to do. >> How do you guys think about the make versus buy? Are you at a point now where you say, you know what, we can make this stuff for our specific requirements better than we can get it off the shelf, or is that not the case? >> It changes every few minutes. It really does. >> So what are the parameters? >> Years ago when I joined the company we were buying servers from OEM suppliers, and they were doing some tailoring for our uses. It's gotten to the point now where that's not the right model and we have our own custom designs that are being built. We've now gotten to the point where some of the components in servers are being customized for us, partly because we're driving sufficient volume that it's justified, and partly because the partners that the component suppliers are happy to work with us directly and they want input from us. And so it's every year it's a little bit more specialized and that line's moving, so it's shifting towards specialization pretty quickly. >> So now I'm going to be replaced by the crowd, gettin' great questions, I'm going to be obsolete! No earbud, I got it right here. So the question's more of a fun one probably for you to answer, or just kind of lean back and kind of pull your hair out, but how the heck does AWS add so much infrastructure per day? How do you do it? >> It's a really interesting question. I kind of know how much infrastructure, I know abstractly how much infrastructure we put out every day, but when you actually think about this number in context, it's mind boggling. So here's the number. Here's the number. Every day, we deploy enough servers to support Amazon when it was a seven billion dollar company. You think of how many servers a seven billion dollar e-commerce company would actually require? Every day we deploy that many servers, and it's just shocking to me to think that the servers are in the logistics chain, they're being built, they're delivered to the appropriate data centers, there's back positions there, there's networking there, there's power there. I'm actually, every day I'm amazed to be quite honest with you. >> It's mind-boggling. And then for a while I was there, okay, wait a minute. Would that be Moors' Law? Uh no, not even in particular. 'Cause you said every day. Not every year, every day. >> Yeah, it really is. It's a shocking number and one, my definition of scale changes almost every day, where if you look at the number of customers that are trusting with their workloads today, that's what's driving that growth, it's phenomenal! >> We got to get wrapped up, but I got to ask the Hadoob World SQL over Hadoob question solutions. Obviously Hadoob is great, great for storing stuff, but now you're seeing hybrids come out. Again this comes back down to the, you can recognize the database world anymore if you were asleep for a year. So what's your take on that ecosystem? You guys have a lasting map or a decent a bunch of other things. There's some big data stuff going on. How do you, from a database perspective, how do you look at Hadoob and SQL over Hadoob? >> I personally love 'em both, and I love the diversity that's happening in the database world. There's some people that kind of have a religion and think it's crazy to do anything else. I think it's a good thing. Map reduce is particularly, I think, is a good thing, because it takes... First time I saw map reduce being used was actually a Google advertising engineer. And what I loved about his, I was actually talking to him about it, and what I loved is he had no idea how many servers he was using. If you ask me or anyone in the technology how many servers they're using, they know. And the beautiful thing is he's running multi-thousand node applications and he doesn't know. He doesn't care, he's solving advertising problems. And so I think it's good. I think there's a place for everything. >> Well my final question is asking guests this show. Put the bumper sticker on the car leaving re:Invent this year. What's it say? What does the bumper sticker say on the car? Summarize for the folks, what is the tagline this year? The vibe, and the focus? >> Yeah, for me this was the year. I mean, the business has been growing but this is the year where suddenly I'm seeing huge companies 100% dependent upon AWS or on track to be 100% dependent upon AWS. This is no longer an experiment, something people want to learn about. This is real, and this is happening. This is running real businesses. So it's real, baby! >> It's real baby, I like, that's the best bumper... James, distinguished guest now CUBE alum for us, thanks for coming on, you're a tech athlete. Great to have you, great success. Sounds like you got a lot of exciting things you're working on and that's always fun. And obviously Amazon is killing it, as we say in Silicon Valley. You guys are doing great, we love the product. We've been using it for crowd chats. Great stuff, thanks for coming on theCUBE. >> Thank you. >> We'll be right back with our next guest after this short break. This is live, exclusive coverage with siliconANGLE theCUBE. We'll be right back.

Published Date : Nov 14 2013

SUMMARY :

I'm John Furrier, the founder of SiliconANGLE. So I got to ask you the first question, and the storage price pretty much froze the whole industry. So I got to ask you that question around and the actual power itself is number three. can you please take the facility's budget off my back? A lot of them don't see it, That is a problem, that is a problem. so money just flies out the window. So what are you guys doing in that area and that's just good for the industry as a whole. "James, when do you think ARM will be in the data center?" of server-side computing, most of the innovation and what you think. and change the need for things and so you kind of have workloads in classes. and the creativity of his solutions, so I got to ask you, and so all the rest ends up being wasted. It's like have a party. And so as the peak to average and it's an arbitrage opportunity that's absolutely fine, and price is really the focus. Okay, and question from crowd chat. This guy knows you. that the whole industry's needed that for 15 years. Over the last decade, right, you started with It's the magic's in software. because of the way you design data centers. One of the problems when you're The air is recirculating, and so the servers I got to ask you about kind of the future of large-scale. and it's one of the cool things You go to a party say, oh welcome of the database business today, in memory, I mean. is the database world, every year, So I got to ask you the-- So at large-scale one of the challenges that we have is that the code be up to date and current. So some other paradigm questions coming in. From my perspective, the bigger and richer the ecosystem, It's Darwinism, that's the answer. diversity of services that we completely It really does. the component suppliers are happy to work with us So the question's more of a fun one that the servers are in the logistics chain, 'Cause you said every day. where if you look at the number of customers the Hadoob World SQL over Hadoob question solutions. and think it's crazy to do anything else. Summarize for the folks, what is the tagline this year? I mean, the business has been growing It's real baby, I like, that's the best bumper... This is live, exclusive coverage

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavidPERSON

0.99+

Erik KaulbergPERSON

0.99+

2017DATE

0.99+

Jason ChamiakPERSON

0.99+

Dave VolontePERSON

0.99+

Dave VellantePERSON

0.99+

RebeccaPERSON

0.99+

Marty MartinPERSON

0.99+

Rebecca KnightPERSON

0.99+

JasonPERSON

0.99+

JamesPERSON

0.99+

AmazonORGANIZATION

0.99+

DavePERSON

0.99+

Greg MuscurellaPERSON

0.99+

ErikPERSON

0.99+

MelissaPERSON

0.99+

MichealPERSON

0.99+

Lisa MartinPERSON

0.99+

Justin WarrenPERSON

0.99+

Michael NicosiaPERSON

0.99+

Jason StowePERSON

0.99+

Sonia TagarePERSON

0.99+

AysegulPERSON

0.99+

MichaelPERSON

0.99+

PrakashPERSON

0.99+

JohnPERSON

0.99+

Bruce LinseyPERSON

0.99+

Denice DentonPERSON

0.99+

Aysegul GunduzPERSON

0.99+

RoyPERSON

0.99+

April 2018DATE

0.99+

August of 2018DATE

0.99+

MicrosoftORGANIZATION

0.99+

Andy JassyPERSON

0.99+

IBMORGANIZATION

0.99+

AustraliaLOCATION

0.99+

EuropeLOCATION

0.99+

April of 2010DATE

0.99+

Amazon Web ServicesORGANIZATION

0.99+

JapanLOCATION

0.99+

Devin DillonPERSON

0.99+

National Science FoundationORGANIZATION

0.99+

ManhattanLOCATION

0.99+

ScottPERSON

0.99+

GregPERSON

0.99+

Alan ClarkPERSON

0.99+

Paul GalenPERSON

0.99+

GoogleORGANIZATION

0.99+

JamcrackerORGANIZATION

0.99+

Tarek MadkourPERSON

0.99+

AlanPERSON

0.99+

AnitaPERSON

0.99+

1974DATE

0.99+

John FerrierPERSON

0.99+

12QUANTITY

0.99+

ViaWestORGANIZATION

0.99+

San FranciscoLOCATION

0.99+

2015DATE

0.99+

James HamiltonPERSON

0.99+

John FurrierPERSON

0.99+

2007DATE

0.99+

Stu MinimanPERSON

0.99+

$10 millionQUANTITY

0.99+

DecemberDATE

0.99+

Jack Norris - BigDataNYC 2013 - theCUBE - #BigDataNYC


 

>>I from Midtown Manhattan, the cute quiet coverage of big data NYC Civicon angled, Wiki bonds production made possible by Hortonworks. We do hairdo and lamb disco and new made invincible. And now your hosts, John furrier and Volante >>Hi buddy. We're back. This is Dave Volante with Jeff Kelly with Wiki bond. And this is the cube Silicon angle's continuous production. We're here at big data NYC right across the street from the Hilton where strata comp and a dupe world is going on. We've got a multi-time cube guest, Jack Norris, the CMO of map bars here, Jack. Welcome back to the cube first. So by the way, thank you so much for the support. As you know, we're across the street here at the Warwick hotel map, our, you guys have always been so generous supporting the cube. We can't thank you enough for that. So really appreciate it. Thank you. So we were able to listen to your keynote yesterday. It was, we, we, we weren't broadcasting, you know, head to head yesterday and had an opportunity to hear your keynote. So, first of all, how did that go? I want to ask you some questions about it. >>It, it was a really well-received and I think people were kind of clamoring to try to separate the myths from, from reality on, on Hadoop, >>We had three myths that you talked about, you know, one related to the distraction. I'd like to get into some of those. So what was the, the first myth was around the, the, the, the district distribution battle. So take us through that. >>So, you know, th the impression that it's a knock-down drag-out competitive battle across Hadoop distributions was the first myth. And the reality is that all of the distribution share the same open source Apache code. And this is one of the first markets that's really, really created, or the first open-source technologies it's really created a market. I mean, look, what's happened here with this whole, this whole big data and Hadoop, but given that early stage, there's the requirement to really combine that open source code with additional innovations to meet customer needs. And so what you see is you see those aggregators that are taken open source, you see others that are taking the open source, and then adding maybe management utility, couple of, of, you know, different applications on top. And then our approach at map R is we're taking the open source with those management innovations, doing some development, the open source community with things like Apache drill, and then really focusing on the underlying architecture, the data platform and providing innovations at that layer. So >>Actually sort of the three major destroys that we talk about all the time. You know, you guys, Hortonworks and Hadoop, you guys have been consistent the whole time as has Hortonworks, right? Cloud era basically put out a post recently saying, Hey, kind of going in a different direction, sort of what I call the tapped out of the Hadoop distro, you know, piece of it. But so there's a lot of discussion around it. You're putting forth the, Hey, it's not an internet seen war, but does it matter is my question? >>Well, I think if you take a step back, the Hadoop ecosystem is incredibly strong growing very, very quickly, fastest growing big data technology, one of the top 10 technologies overall. And I think it's because we are sharing the same API. It is possible for customers to learn on one, develop and move seamlessly to another. And, you know, in the keynote, I talked about the difference between the no SQL market, which is, you know, there is no consensus there and, and customers have to figure out not only what's the right word workload, but what's the technology that's actually going to have some staying power, right? >>That's a powerful comment. Amazon turn the data center and into an API, or you as the duke community is essentially turning data, access into an API. And that is a very powerful and leverageable concept. Okay. Your second myth was around the whole, no SQL yes. Piece of it. You help you put up a slide. I thought I read Jeff Kelly's reports. And I thought, I thought I knew them all, but there were a couple in there that I didn't recognize as you probably knew them all, but so take us through myth. Number two >>Too. I'm sure we missed some >>There wasn't room on the slide for anymore. >>The, yeah, it's basically about the consensus. There is no real consensus. There's no common API. There's no ability to move applications seamlessly across no SQL solutions. If you look at one no SQL solution, and that's, HBase a big inherent advantage because it's integrated with Hindu, you know, this whole trend is about compute and data together. So if you've got a no sequel solution, that's on that same, you know, massive data store, you know, big leg up. And, and then we got into the, well, if you've got HBase, it's included in all the distributions and all the distribution share the same open source, then obviously it must run the same across all distributions. And there, we shared some pretty interesting data to show the difference. When you, when you do architectural differences and innovations underneath that you can dramatically change the performance of, of not only MapReduce, but of no SQL. Yes. >>Okay. So not all no SQL is created equally. Not all HBase is created equally as essentially what you're saying there. Now the third piece was to dupe is enterprise ready, right? Yeah. So you guys were first to say, well, we have a Hadoop platform that's enterprise ready way ahead on that. Got criticized a lot for going down that path shrugged and said, okay, we'll just keep doing business with customers. And you've been again, very clear and consistent on that. So talk about the third myth >>And that's, you know, is, is Hadoop ready for prime time? And I think the way to combat that myth is by customer examples and showing the tremendous success that customers are enjoying with Hadoop. And, you know, we, we don't have time on the cube here to go through all of them, but, you know, I like to point out 90 billion auctions a day with Rubicon, they've surpassed Google in terms of ad reach. They're doing that on Mapbox 1.7 trillion events a month with comScore that's on, on map bar. You look in, in traditional enterprise, you know, a single retailer with over 2000 nodes of Hadoop. I mean, it's a key part of their merchandising and retail operations, and combining all sorts of, of data feeds and all sorts of use cases there, financial services over a thousand nodes of risk medication, personalized offers streamlining their operations. I mean, it's, it's dramatic. And then, you know, we shared some of the more, more interesting ones, esoteric ones like garbage and whiskey and weather prediction. >>There was consider these, we even as diverse and eclectic as they are, they consider these mission critical application. >>Oh, absolutely. No it it's. And I think that's the difference because what we're talking about is not Hadoop as this cash, right? This temporary processing, where we can do, you know, some interesting batch analytics and then take that and put that someplace else. And yes, there are applications like that, but companies soon realized that if I'm going to use this as a key part of my operations, and it's about data on compute, then I want a consistent permanent store. I want a system of record. So all of the SLS and high availability and data protection features that they expect in their enterprise applications should be present in Hadoop, right? That's where we focus. Let's run down a couple of those. >>What are some of the key capabilities that you need in an enterprise enterprise grade platform? That map bar is >>Well, let's, let's take, let's take business continuity cause that's important if you're really going to trust data there. And you know, one of the big drivers as you expand data is how much am I going to spend on it? And if you look at a large investment bank, $270 million of their budget, not total, but incremental to address the additional capacity, there's a big emphasis for let's look at a better way to do that. So instead of spending $15,000 a terabyte, if you can spend a few hundred dollars a terabyte, that's a huge, huge advantage. And that's the focus of Hindu, but to do that, well, then the features that are in this enterprise storage have to be present. And we're talking about, you know, mirroring and not a copy table function, but replication, that's how that's how organizations do it, right. If you're going to recovery and recovery, you know, you can't back up a petabyte of information through a copy function, right? You have to do a snapshot and the snapshots have to be consistent, right. And, and we're not saying anything that, you know, an enterprise administrator doesn't know, there is some confusion when you're more on the developer side as to what these features are and the difference between a fuzzy snapshot and a point in time, consistent snaps. >>Got it. So let's talk a little bit about the, the enterprise data hub, this, this concept that Michael Wilson with clutter introduced yesterday. Tell us a little bit about your take on, on, on Mike's I guess, definition and, and essentially I think trying to name the category of kind of what Hadoop can do and what, and where it sits in the architecture. Did you agree with his, his, >>Yeah. I mean, if you look at, at that description, it's about I'm taking important data and I'm putting it in a dupe and I'm combining a lot of different data sources and it's been referred to as a data lake and a data reservoir and a data ocean. I mean, we've heard a lot of terms. We worked with an outside consultant that was originally an architect at Terre data. It's been about eight months, almost a year ago now where he defined it and enterprise data hub. And it's it's, he went through kind of the list of requirements. And once you move from a transitory to a permanent store, then that becomes an enterprise data hub. And an enterprise data hub can be used to select and process information, maybe it's ETL and serve some downstream applications. It can also be useful to do analysis directly on it, to, you know, to serve different business functions. But the system requirements that he established for that I think are absolutely true. And it's, you have to have the full data protection. You have to have the full disaster recovery. You have to have the full high availability because this is going to be important data serving the organization. If it's data that you can lose, if it's data that you, you don't really care about having highly available, then it's a very narrow use case that that data hub serves. >>So you're saying the enterprise data hub isn't ready for prime time. >>No, I'm saying that there, there are requirements. And we have companies today that have deployed an enterprise data hub and they are quite successful with it. And, you know, the quotes are the ETL functions that they're doing on that hub are 10 times faster and it's 10 times cheaper than what they're seeing. >>Soundbite, Dave, >>I agree, but it's nuanced. Right. And so, you know, the customers cause a lot of vendors, right? They're all saying the same thing to the customers, right? So you've got your messaging that you've, you know, you've proven out over the last several years and then the entire market starts to use the same terminology. So it is, this is why I, like, I think this, what is, what are those >>Things? We're in a little bit of this, this kind of marketing fog here in the relative early stages. I think the best response there is customer proof points. And I think some education in the very beginning, you know, when they're in development and test, it's really important to understand, you know, what is Hadoop and what can I use it for and what data source am I going to leverage? I think the features that we're talking about really start to show up as you deploy in production. And as you expand its use in production and there we've enjoyed tremendous success, >>But he would argue that you have a lead in this space. I wouldn't, I don't think you would either the space being robustness enterprise ready, mission criticality is your lead increasing, decreasing staying the same. >>What's your sense? Well, it's hard cause there's no, you know, th th there's no external service that's out there, you know, interviewing every customer and, and giving numbers. I do know that we passed 500 paying customers. I do know that we've got significant deployments and you can measure those in terms of number of nodes, you know, in the thousands of nodes, you can measure those in terms of use cases. So we've got, you know, one company they've passed 20 different use cases on the same cluster. I think that's an interesting proof point. We're scaling in terms of the number of, of people in an organization that are trained in leveraging the data in map are again in the, in the thousands. So, you know, I think this market is so big and so dynamic that this isn't about, you know, one company success at the expense of everyone. Else's zero sum game. I think, you know, we're all here kind of raising this, this boat and focusing on this paradigm shift, but when it comes to production success, that's our focus. And I think that's where we've, we've proven that >>One thing I'm really want to get your opinion on, you know, as, as to do matures and some of the innovations you guys are doing and, and making the platform, you know, basically a multi application platform, you can do more things with Hadoop. And we've been talking about this on the cube, is that as that happens, you're going to start you as an industry. You're going to start bumping up against the EDW vendors and some of the other database vendors in the traditional world. And you're now you're doing some of the things that those, those tools can do now, you know, two years ago, it was very much just, this is all very complimentary Hadoop and your EDW. There's no overlap. We're gonna all play nice. But increasingly we're seeing that there is an overlap. How do you view that? Is that, and what is your relationship with those, with those EDW vendors and, and what are you hearing from customers when you go into a customer? Okay. >>So, I mean, there's a, there's a lot in that question. I think the F the first comment though, is don't look at Hadoop through this single data warehouse lens. And if you look at, at trying to use Hadoop to completely replace an enterprise data warehouse where there's, here's a few decades of experience, there, there are many organizations that have a lot of activities that are based in that data warehouse. And that's where we're seeing a data warehouse offload that is complimentary, but it gives organizations this lever to say, well, I'm going to control the fill rate, and I'm going to take some of the data that's no longer, you know, really active and put that on Hadoop and really change my ability to manage the costs in a data warehouse environment. The other thing that's interesting is that the types of applications that duper doing, I think are creating a new class it's about operations and analytics, kind of combined together, taking high arrival rate data and making very quick micro changes to optimize whether that's fraud detection or recommendation engines, or taking sensor data and predictive analytics for, for maintenance, et cetera. There is just a tremendous number of, of applications. In some cases, leveraging a new data source in some cases, doing new applications, but it's just opening things up. And, and I think organizations are moving to be very data-driven and Hadoop is at the center of that. >>And you control the field, right? That's another really good soundbites. And, and these that, you mentioned this high arrival rate data, this fraud detection, predictive analytics, maintenance, these are things that you're doing today with >>Navarre right? Yeah, >>Absolutely. Great. All right, Jack. Well, listen, always a pleasure. Thanks very much for coming by. Great to see you again. All right. Keep it right there about Uber, right back with our next guest. This is the cube we're live from the big apple.

Published Date : Oct 30 2013

SUMMARY :

I from Midtown Manhattan, the cute quiet coverage of big data NYC So by the way, thank you so much for the We had three myths that you talked about, you know, one related to the distraction. So, you know, th the impression that it's a knock-down drag-out sort of what I call the tapped out of the Hadoop distro, you know, piece of it. And, you know, in the keynote, I talked about the difference between the no SQL market, And I thought, I thought I knew them all, but there were a couple in there that I didn't recognize as you probably knew them all, that's on that same, you know, massive data store, you know, big leg up. So you guys were first to say, And that's, you know, is, is Hadoop ready for prime time? where we can do, you know, some interesting batch analytics and then take that and put that someplace else. And you know, one of the big drivers as you expand Did you agree with his, his, to, you know, to serve different business functions. And, you know, the quotes are the ETL functions that they're doing on that hub are 10 And so, you know, the customers cause a lot of you know, when they're in development and test, it's really important to understand, you know, I wouldn't, I don't think you would either the space being robustness enterprise so dynamic that this isn't about, you know, one company success at the expense those tools can do now, you know, two years ago, it was very much just, this is all very complimentary Hadoop and your EDW. And if you look at, at trying to use Hadoop to completely replace an enterprise data warehouse And you control the field, right? Great to see you again.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Jeff KellyPERSON

0.99+

Michael WilsonPERSON

0.99+

10 timesQUANTITY

0.99+

JackPERSON

0.99+

Jack NorrisPERSON

0.99+

10 timesQUANTITY

0.99+

AmazonORGANIZATION

0.99+

$270 millionQUANTITY

0.99+

MikePERSON

0.99+

yesterdayDATE

0.99+

Dave VolantePERSON

0.99+

HortonworksORGANIZATION

0.99+

third pieceQUANTITY

0.99+

DavePERSON

0.99+

HadoopTITLE

0.99+

Midtown ManhattanLOCATION

0.99+

UberORGANIZATION

0.99+

VolantePERSON

0.99+

thousandsQUANTITY

0.99+

firstQUANTITY

0.99+

20 different use casesQUANTITY

0.99+

GoogleORGANIZATION

0.99+

secondQUANTITY

0.99+

John furrierPERSON

0.98+

NYCLOCATION

0.98+

two years agoDATE

0.98+

HadoopORGANIZATION

0.98+

first commentQUANTITY

0.98+

RubiconORGANIZATION

0.98+

SQLTITLE

0.97+

Terre dataORGANIZATION

0.97+

OneQUANTITY

0.97+

1.7 trillion eventsQUANTITY

0.97+

thirdQUANTITY

0.97+

todayDATE

0.97+

oneQUANTITY

0.96+

singleQUANTITY

0.96+

a year agoDATE

0.95+

one companyQUANTITY

0.94+

HBaseTITLE

0.94+

NavarrePERSON

0.93+

EDWORGANIZATION

0.92+

over 2000 nodesQUANTITY

0.91+

big appleORGANIZATION

0.91+

first marketsQUANTITY

0.9+

nodesQUANTITY

0.89+

about eight monthsQUANTITY

0.88+

2013DATE

0.88+

SoundbiteORGANIZATION

0.87+

three mythsQUANTITY

0.87+

HinduORGANIZATION

0.87+

first open-sourceQUANTITY

0.86+

Wiki bondORGANIZATION

0.85+

BigDataNYCEVENT

0.85+

$15,000 a terabyteQUANTITY

0.85+

three majorQUANTITY

0.82+

90 billion auctions a dayQUANTITY

0.81+

500 paying customersQUANTITY

0.79+

comScoreORGANIZATION

0.79+

map RORGANIZATION

0.78+

over a thousand nodesQUANTITY

0.77+

HiltonLOCATION

0.77+

few hundred dollars a terabyteQUANTITY

0.76+

Number twoQUANTITY

0.76+

10 technologiesQUANTITY

0.74+

Jack Norris - Hadoop Summit 2013 - theCUBE - #HadoopSummit


 

>>Ash it's, you know, what will that mean to my investment? And the announcement fusion IO is that, you know, we're 25 times faster on read intensive HBase applications. The combination. So as organizations are deploying Hadoop, and they're looking at technology changes coming down the pike, they can rest assured that they'll be able to take advantage of those in a much more aggressive fashion with map R than, than other distribution. >>Jack, how I got to ask you, we were talking last night at the Hadoop summit, kind of the kickoff party and, you know, everyone was there. All the top execs were there and all the developers, you know, we were in the queue. I think, I think that either Dave or myself coined the term, the big three of big data, you guys ROMs cloud Cloudera map R and Hortonworks, really at the, at the beginning of the key players early on and Charles from Cloudera was just recently on. And, and he's like, oh no, this, this enterprise grade stuff has been kicked around. It's been there from the beginning. You guys have been there from the beginning and Matt BARR has never, ever waffled on your, on your messaging. You've always been very clear. Hey, we're going to take a dupe open source a dupe and turn it into an enterprise grade product. Right. So that's clear, right? That's, that's, that's a great, that's a great, so what's your take on this because now enterprise grade is kind of there, I guess, the buzz around getting the, like the folks that have crossed the chasm implemented. So what can you comment on that about one enterprise grade, the reality of it, certainly from your perspective, you haven't been any but others. And then those folks that are now rolling it out for the first time, what can you share with them around? What does it mean to be enterprise grade? >>So enterprise grade is more about the customer experience than, than a marketing claim. And, you know, by enterprise grade, what we're talking about are some of the capabilities and features that they've grown to expect in their, their other enterprise applications. So, you know, the ability to meet full S SLA is full ha recovery from multiple failures, rolling upgrades, data protection was consistent snapshots business continuity with mirroring the ability to share a cluster across multiple groups and have, you know, volumes. I mean, there's a, there's a host of features that fall under the umbrella enterprise grade. And when you move from no support for any of those features to support to a few of them, I don't think that's going to, to ha it's more like moving to low availability. And, and there's just a lot of differences in terms of when we say enterprise grade with those features mean versus w what we view as kind of an incomplete story. So >>What do you, what do you mean by low availability? Well, I mean, it's tongue in cheek. It's nice. It's a good term. It's really saying, you know, just available when you sometimes is that what you mean? Is this not true availability? I mean, availability is 99.9%. Right? >>Right. So if you've got a, an ha solution that can't recover from multiple failures, that's downtime. If you've got an HBase application that's running online and you have data that goes down and it takes 10 to 30 minutes to have the region servers recover it from another place in the distribution, that's downtime. If you have snapshots that aren't consistent across the cluster, that doesn't provide data protection, there's no point in time recovery for, for a cluster. So, you know, there's a lot of details underneath that, but what it, what it amounts to is, do you have interruptions? Do you have downtime? Do you have the potential for losing data? And our answer is you need a series of features that are hardened and proven to deliver that. >>What about recoverability? You mentioned that you guys have done a lot of work in that area with snapshotting, that's kind of being kicked around, are our folks addressing, what are the comp what's your competition doing in those areas of recoverability just mentioned availability. Okay, got that. Recoverability security, compliance, and usability. Those are the areas that seem to be the hot focus areas what's going on in the energy. How would you give them the grade, the letter grade, if you will, candidly, compared to what you guys offer? Well, the, >>The first of all, it's take recoverability. You know, one of the tenants is you have a point in time recovery, the ability to restore to a previous point that's consistent across the cluster. And right now there's, there's no point in time recovery for, for HDFS, for the files. And there's no point in time recovery for HBase tables. So there's snapshot support. It's being talked about in the open source community with respect to snapshots, but it's being referred to in the JIRAs as fuzzy snapshots and really compared to copy table. >>So, Jack, I want to turn the conversation to the, kind of the topic we've talked about before kind of the open versus a proprietary that, that whole debate we've, we've, we've heard about that. We talked about that before here on the cube. So just kind of reiterate for us your take. I mean, we, we hear perhaps because of the show we're at, there's a lot of talk about the open source nature of Hadoop and some of the purists, as you might call them are saying, it's gotta be open a hundred percent Patrick compatible, et cetera. And then there's others that are taking a different approach, explain your approach and why you think that's the key way to make, to really spur adoption of a dupe and make it >>W w we're we're a part of the community we're, we've got, you know, commitment going on. We've, you know, pioneered and pushed a patchy drill, but we have done innovations as well. And I think that those innovations are really required to support and extend the, the whole ecosystem. So canonical distributes RN, three D distribution. We've got, you know, all our, our packages are, are available on get hub and, and open source. So it's not, it's not a binary debate. And I think the, the point being that there's companies that have jumped ahead and now that Peloton is, is, you know, pedaling faster and, and we'll, we'll catch up. We'll streamline. I think the difference is we rearchitected. So we're basically in a race car and, you know, are, are racing ahead with, with enterprise grade features that are required. And there's a lot of work that still needs to be done, needs to be accomplished before that full rearchitecture is, is in place. >>Well, I mean, I think for me, the proof is really in the pudding when you, when it comes to talk about customers that are doing real things and real production, grade mission, critical applications that they're running. And to me that shows the successor or relative success of a given approach. So I know you guys are working with companies like ancestry.com, live nation and Quicken loans. Maybe you could, could you walk us through a couple of those scenarios? Let's take ancestry.com. Obviously they've got a huge amount of data based on the kind of geological information, where do you guys do >>With them? Yeah, so they've got, I mean, they've got the world's largest family genealogy services available on the web. So there's a massive amount of data that they make accessible and, and, you know, ability for, for analysis. And then they've rolled out new features and new applications. One of which is to ship a kit out, have people spit in a tube, returned back and they do DNA matching and reveal additional details. So really some really fabulous leading edge things that are being done with, with the use of, of Hadoop. >>Interesting. So talk about when you went to, to work with them, what were some of their key requirements? Was it around, it was more around the enterprise enterprise, grade security and uptime kind of equation, or was it more around some of the analytics? What, what, what's the kind of the killer use case for them? >>It's kind of, you know, it's, it's hard with a specific company or even, you know, to generalize across companies. Cause they're really three main areas in terms of ease of use and administration dependability, which includes the full ha and then, and then performance. And in some cases, it's, it's just one of those that kind of drives it. And it's used to justify, in other cases, it's kind of a collection. The ease of use is being able to use a cluster, not only as Hadoop, but to access it and treat it like enterprise storage. So it's a complete POSIX compliance file system underneath that allows the, the mounting and access and updates and using it in dynamic read-write. So what that means from an application level, it's, it's faster, it's much easier to administer and it's much easier and reliable for developers to, to utilize. >>I got to ask you about the marketing question cause I see, you know, map our, you guys have done a good job of marketing. Certainly we want to be thankful to you guys is supporting the cube in the past and you guys have been great supporters of our mission, but now the ecosystem's evolving a lot more competition. Claudia mentioned those eight companies they're tracking in quote Hadoop, and certainly Jeff and I, and, and SiliconANGLE by look at there's a lot more because Hadoop washing has been going on now for the term Hadoop watching me and jumping in and doing Hadoop, slapping that onto an existing solution. It's not been happening full, full, full bore for a year. At least what's the next for you guys to break above the noise? Obviously the communities are very active projects are coming online. You guys have your mission in the enterprise. What's the strategy for you guys going forward is more of the same and anything new even share. >>Yeah, I, I, I think as far as breaking above the noise, it will be our customers, their success and their use cases that really put the spotlight on what the differences are in terms of, of, you know, using a big data platform. And I think what, what companies will start to realize is I'd rather analogy between supply chain and the big, the big revolution in supply chain was focusing on inventory at each stage in the supply chain. And how do you reduce that inventory level and how do you speed the, the flow of goods and the agility of a company for competitive advantage. And I think we're going to view data the same way. So companies instead of raw data that they're copying and moving across different silos, if they're able to process data in place and send small results sets, they're going to be faster, more agile and more competitive. >>And that puts the spotlight on what data platform is out there that can support a broad set of applications and it can have the broadest set of functionality. So, you know, what we're delivering is a mission grade, you know, enterprise grade mission, critical support platform that supports MapReduce and does that high performance provides NFS POSIX access. So you can use it like a file system integrates, you know, enterprise grade, no SQL applications. So now you can do, you know, high-speed consistent performance, real time operations in addition to batch streaming, integrated search, et cetera. So it's, it's really exciting to provide that platform and have organizations transform what they're doing. >>How's the feedback on with Ted Dunning? I haven't seen a lot of buzz on the Twittersphere is getting positive feedback here. He's a, a tech athlete. He's a guru, he's an expert. He's got his hands in all the pies. He's a scientist type. What's he up to? What's his, what's his role within Mapa and he's obviously playing in the open-source community. What's he up to these days, >>Chief application architect, he's on the leading edge of my house. So machine learning, so, you know, sharing insights there, he was speaking at the storm meetup two nights ago and sharing how you can integrate long running batch, predictive analytics with real-time streaming and how the use of snapshots really that, that easy and possible. He travels the world and is helping organizations understand how they can take some very complex, long running processes and really simplify and shorten those >>Chance to meet him in New York city had last had duke world at a, at a, a party and great guy, fantastic geek, and certainly is doing a great work and shout out to Ted. Congratulations, continue up that support. How's everyone else doing? How's John and Treevis doing how's the team at map are we're pedaling as best as you can growing >>Really quickly. No, we're just shifting gears. Would it be on pedaling >>Engine? >>Yeah. Give us an update on the company in terms of how the growth and kind of where you guys are moving that. >>Yeah. We're, we're expanding worldwide, you know, just this, you know, last few months we've opened up offices and in London and Munich and Paris, we're expanding in Asia, Japan and Korea. So w our, our sales and services and engineering, and basically across the whole company continues to expand rapidly. Some really great, interesting partnerships and, and a lot of growth Natalie's we add customers, but it's, it's nice to see customers that continue to really grow their use of map are within their organization, both in terms of amount of data that they're analyzing and the number of applications that they're bringing to bear on the platform. >>Well, that a little bit, because I think, you know, one of the, one of the trends we do see is when a company brings in big data, big data platform, and they might start experiment experimenting with it, build an application. And then maybe in the, maybe in the marketing department, then the sales guys see it and they say, well, maybe we can do something with that. How is that typically the kind of the experience you're seeing and how do you support companies that want to start expanding beyond those initial use cases to support other departments, potentially even other physical locations around the world? How do you, how do you kind of, >>That's been the beauty of that is if you have a platform that can support those new applications. So if you know, mission critical workloads are not an issue, if you support volumes so that you can logically separate makes it much easier, which we have. So one of our customers Zions bank, they brought in Matt BARR to do fraud detection. And pretty soon the fact that they were able to collect all of that data, they had other departments coming to them and saying, Hey, we'd like to use that to do analysis on because we're not getting that data from our existing system. >>Yeah. They come in and you're sitting on a goldmine, there are use cases. And you also mentioned kind of, as you're expanding internationally, what's your take on the international market for big data to do specifically is, is the U S kind of a leaps and bounds ahead of the rest of the world in terms of adoption of the technology. What are you seeing out there in terms of where, where the rest of the, >>I wouldn't say leaps and bounds, and I think internationally, they're able to maybe skip some of the experimental steps. So we're seeing, we're seeing deployment of class financial services and telecom, and it's, it's fairly broad recruit technologies there. The largest provider of recruiting services, indeed.com is one of their subsidiaries they're doing a lot with, with Hadoop and map are specifically, so it's, it's, it's been, it's been expanding rapidly. Fantastic. >>I also, you know, when you think about Europe, what's going on with Google and some of the, the privacy concerns even here, or I should say, is there, are there different regulatory environments you've got to navigate when you're talking about data and how you use data when you're starting to expand to other, other locales? >>Yeah. There's typically by vertical, there's different, different requirements, HIPAA and healthcare, and basal to, and financial services. And so all of those, and it, it, it basically, it's the same theme of when you're bringing Hadoop into an organization and into a data center, the same sorts of concerns and requirements and privacy that you're applying in other areas will be applied on Hindu. >>I'm now kind of turning back to the technology. You mentioned Apache drill. I'd love to get an update on kind of where, where that stands. You know, it's put, then put that into context for people. We hear a lot about the SQL and Hadoop question here, where does drill fit into that, into that equation? >>Well, the, the, you know, there's a lot of different approaches to provide SQL access. A lot of that is driven by how do you, how do you leverage some of the talent and organization that, you know, speak SQL? So there's developments with respect to hive, you know, there's other projects out there. Apache drill is an open source project, getting a lot of community involvement. And the design center there is pretty interesting. It started from the beginning as an open source project. And two main differences. One was in looking at supporting SQL it's, let's do full ANSI SQL. So it's full 2003 ANSI, sequel, not a SQL like, and that'll support the greatest number of applications and, you know, avoid a lot of support and, and issues. And the second design center is let's support a broad set of data sources. So nested sources like Jason scheme on discovery, and basically fitting it into an enterprise environment, which sometimes is kinda messy and can get messy as acquisitions happen, et cetera. So it's complimentary, it's about, you know, enabling interactive, low latency queries. >>Jack, I want to give you the final word. We are out of time. Thanks for coming on the cube. Really preached. Great to see you again, keep alumni, but final word. And we'll end the segment here on the cube is your quick thoughts on what's happening here at Hadoop world. What is this show about? Share with the audience? What's the vibe, the summary quick soundbite on Hadoop. >>I think I'll go back to how we started. It's not, if you used to do putz, how you use to do and, you know, look at not only the first application, but what it's going to look like in multiple applications and pay attention to what enterprise grade means. >>Okay. They were secure. We got a more coverage coming, Jack Norris with map R I'll say one of the big three original, big three, still on the, on the list in our mind, and the market's mind with a unique approach to Hadoop and the mid-June great. This is the cube I'm Jennifer with Jeff Kelly. We'll be right back after this short break, >>Let's settle the PR program out there and fighting gap tech news right there. Plenty of the attack was that providing a new gadget. Let's talk about the latest game name, but just the.

Published Date : Jun 27 2013

SUMMARY :

IO is that, you know, we're 25 times faster on read intensive HBase applications. All the top execs were there and all the developers, you know, So, you know, the ability to meet full S SLA is full ha It's really saying, you know, just available when So, you know, there's a lot of details compared to what you guys offer? You know, one of the tenants is you have a point of Hadoop and some of the purists, as you might call them are saying, it's gotta be open a hundred percent that Peloton is, is, you know, pedaling faster and, and we'll, we'll catch up. So I know you guys are working with companies like ancestry.com, live nation and Quicken that they make accessible and, and, you know, ability for, So talk about when you went to, to work with them, what were some of their key requirements? It's kind of, you know, it's, it's hard with a specific company or even, I got to ask you about the marketing question cause I see, you know, map our, you guys have done a good job of marketing. And how do you reduce that inventory level and how do you speed the, you know, what we're delivering is a mission grade, you know, enterprise grade mission, How's the feedback on with Ted Dunning? so, you know, sharing insights there, he was speaking at the storm meetup How's John and Treevis doing how's the team at map are we're pedaling as best as you can No, we're just shifting gears. and basically across the whole company continues to expand rapidly. Well, that a little bit, because I think, you know, one of the, one of the trends we do see is when a company brings in big data, That's been the beauty of that is if you have a platform that can support those And you also mentioned kind of, they're able to maybe skip some of the experimental steps. and it, it, it basically, it's the same theme of when you're bringing Hadoop into We hear a lot about the SQL and Hadoop question support the greatest number of applications and, you know, avoid a lot of support and, Great to see you again, you know, look at not only the first application, but what it's going to look like in multiple This is the cube I'm Jennifer with Jeff Kelly. Plenty of the attack was that providing a new gadget.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
TedPERSON

0.99+

LondonLOCATION

0.99+

ClaudiaPERSON

0.99+

Jeff KellyPERSON

0.99+

AsiaLOCATION

0.99+

Ted DunningPERSON

0.99+

Jack NorrisPERSON

0.99+

DavePERSON

0.99+

JohnPERSON

0.99+

JackPERSON

0.99+

10QUANTITY

0.99+

ParisLOCATION

0.99+

KoreaLOCATION

0.99+

Matt BARRPERSON

0.99+

MunichLOCATION

0.99+

New YorkLOCATION

0.99+

99.9%QUANTITY

0.99+

JenniferPERSON

0.99+

TreevisPERSON

0.99+

25 timesQUANTITY

0.99+

JapanLOCATION

0.99+

GoogleORGANIZATION

0.99+

bothQUANTITY

0.99+

oneQUANTITY

0.99+

JeffPERSON

0.99+

eight companiesQUANTITY

0.99+

first timeQUANTITY

0.99+

mid-JuneDATE

0.99+

CharlesPERSON

0.98+

EuropeLOCATION

0.98+

30 minutesQUANTITY

0.98+

OneQUANTITY

0.98+

first applicationQUANTITY

0.98+

AshPERSON

0.98+

two nights agoDATE

0.98+

HortonworksORGANIZATION

0.98+

each stageQUANTITY

0.97+

SQLTITLE

0.97+

SiliconANGLEORGANIZATION

0.97+

NataliePERSON

0.97+

ancestry.comORGANIZATION

0.96+

HadoopTITLE

0.96+

PatrickPERSON

0.96+

last nightDATE

0.95+

JasonPERSON

0.95+

2003DATE

0.95+

HadoopEVENT

0.94+

ApacheORGANIZATION

0.94+

HadoopPERSON

0.93+

indeed.comORGANIZATION

0.93+

hundred percentQUANTITY

0.92+

HBaseTITLE

0.92+

Hadoop Summit 2013EVENT

0.92+

Quicken loansORGANIZATION

0.92+

two main differencesQUANTITY

0.89+

HIPAATITLE

0.89+

#HadoopSummitEVENT

0.89+

S SLATITLE

0.89+

HadoopORGANIZATION

0.88+

ClouderaORGANIZATION

0.85+

map RTITLE

0.85+

a yearQUANTITY

0.83+

Zions bankORGANIZATION

0.83+

PelotonLOCATION

0.78+

NFSTITLE

0.78+

MapReduceTITLE

0.77+

Cloudera map RORGANIZATION

0.75+

liveORGANIZATION

0.74+

second design centerQUANTITY

0.73+

HinduORGANIZATION

0.7+

theCUBEORGANIZATION

0.7+

three main areasQUANTITY

0.68+

one enterprise gradeQUANTITY

0.65+

Jack Norris | Strata Data Conference 2013


 

>>Okay. We're back here inside the cube, our flagship program about the events and extract the signal from the noise. This is strata conference. O'Reilly media is a big data event. We're talking about Hadoop analytics, data platforms, and big is come into the enterprise from the front door. As we heard them yesterday. I'm John Frey with Dave Volante, wiki.org. And we're here with Jack Norris, our cube alumni, and a favorite guest here. You're a in charge executive at map. Our, you guys are leading the charge with this use of a dupe. Welcome back to the cube. Thank you. Okay, so what's, let's chat about what's going on. What's your take on all the big news out here for the distributions. I'll the big power moose. You guys have a relationship with EMC. Okay. Exclusive relationship with those guys. Intel's got a distribution Horton versus with Microsoft, a lot of things going on. So this is your wheelhouse. So what's your take on the Hadoop action here? >>Well, I think there's an article in Forbes where I think they, they said it best. This is showing that map bars had the right strategy all along. And what we're seeing is, is basically there's a fairly low bar to taking a patchy Hadoop and providing a distribution. And so we're seeing a lot of new entrance in the market and there's, there's a lot of options. If you want to try Hadoop and experiment and get started. And then there's production class Hadoop, which includes enterprise data protection, snapshots mirrors, ability to integrate. And that's basically map R so start and test and dev with, with a lot of options and then move into production, class >>Mapbox. So break it down for the folks out there who are tipping the toe in the water and hearing all the noise. Cause it's right now, the noise level is very high, right? With the, with the recent announcements. But you guys have been doing business obviously for many years in this area. So when people say, Hey, I want to get a Hadoop distribution with enterprise. What, what should they be looking for? Okay. Because it's not that easy to kind of swing through the noise. So could you share with the folks out there, what, what to look for in like the, the table stakes, the check boxes? Cause there's a lot of claims. There's a lot of noise is this. And that is a lot of different options. Some teams have more committers or no committers than others, so that's all noise, but let's what are the key things that customers need to know? So I think there's, miling, >>There's three areas. All right. One is kind of how it integrates into your enterprise. And with Hadoop, you have the Hadoop distributed file system API. That's how you interact. Well, if you're able to also use standard tools that can use standard file and database access, it makes it much, much easier. So map ours unique and supporting NFS and making that happen. That's a, that's a big difference. The second is on dependability and there's high availability capabilities and then there's data protection. So I'll focus on snapshots as an example, you've got data replicated and Hindu. That's great. But if you have a user error, an application error, that's replicated just as quickly. So having the ability to recover and double-edged in time. Yeah. So if I can say, Hey, I made a mistake. Can I go back two minutes earlier with snapshots that makes it possible map ours, unique and snapshot support. And then finally, there's there's disaster recovery mirroring where you can go across clusters, mirror, what's going on across the land and being able to recover in the case of a disaster where you lose a whole cluster or use a whole >>Section and that's not available in >>Other, those aren't available either. That's >>NFS, >>Snapshots has been on the JIRA list for over five years. >>Yeah. Okay. So I wonder >>If I could find that and then there's third. Cause I said three and almost said two, the third is performance and scale and, but >>That'd be for >>Integration, dependability and speed. >>Okay. So dependability Jr's part of the VR snapshots. MDR. Okay. So let's talk about the performance because you guys had asked a Google's a big partner of you guys. So we should, we just had them on the cube strata. So you have to have a record setting. Do you have a record setting? EMC take that. Well, you work with DMC. So let me talk about the performance real quick. Then we'll talk about some of the EMC conversations, but performance, you have a variety of diverse performance benchmarks, Google you have within the enterprise. Can you talk about those? >>So, so what we announced this week was the minute sort world record. So minutes or runs across technologies is just, how can you, you know, how much data can you sort in 60 seconds? And if you look back at, at the previous record that was done in the labs with Microsoft with special purpose software, and they did 1.4 terabytes Hadoop hasn't been used since 2009, it's been several years because it's got features in there that work against performance. Things like checkpointing and logging because it assumes you've got long running MapReduce jobs. So we set the record with our distribution of Hadoop. So we have kind of one hand tied behind our back, given that technology. Secondly, we sent it in the cloud, which is the other hand tied behind our back because it's a, it's a virtualized environment. So we set the record with just with your legs And a 1.5 terabytes in 60 seconds. Very proud of that. >>Well, that's interesting because we've been doing a lot of labs testing, Dave and I and our teams on cost. Right. So, yeah. And it's an interesting benchmark because you always don't look at the nuance, the cost to compare a cloud performance versus bare metal. Most people don't factor into setup, cost of deployment. Exactly. So can you just quickly talk about that and how significant of an order of magnitude of your customer? >>So the, the previous Hadoop record took 3,400 servers about 27,000 cores, 13, 13,000, almost 14,000 discs and did 600 gigs, actually a little less than that at 5 78. And on Google, we did it with 2020 100 virtual instances, 8,000 cores did 1.5 terabytes >>And costs. You spin up the Google versus >>Basically if you look at that and you assume conservatively 4,000 per server, it's $13.8 million worth of hardware previously. And the cost to do that run on Google was $20 and 33 cents. >>Well, you got to discount. I mean, come on a partner mean it really costs that much. I mean, they that's what they would charge for it. Actually >>We are map artist's case on that minute. If you look at the Asheville charges to be 1200, >>Okay. It's not six millions, so millions to thousands. Yep. Okay. That's impressive. We'll have to go look at the numbers. Like we're going to look at GreenPlum's numbers in the next couple of weeks when talking about the Google relationship and men were that the up way with that was that >>Very excited about it. We're actually deployed throughout the cloud. We've got multiple partners Google's in limited preview. So we've got a number of customers kind of, you know, testing that and, and doing some really interesting things. >>So we monitor the data center market. I'll see with our proprietary tool that you know about the viewfinder and crowd spots and thing is that the data center verticals interesting, right? If you look at the sentiment analysis of what the conversation is on, on just the Twitter data, it's Facebook, apple, these companies. And when we dig into the numbers, it's not so much the companies, it's the fact that their data center operations are significantly being looked at as the leading indicator for where CEO's are going. So I want to ask you in your conversations with your customers, what are the conversations around moving to the cloud and where are they on that transition? Because we hear, yeah, one of the cloud for all the benefits you were mentioning, but Google and Facebook, these are the gold standards as, as architecture necessarily a cut and paste architecture, but they see the benefits that they're doing. So what are your conversations with your enterprise customers around the cloud cloud architecture and what other features besides replication and disaster recovery, are they, are they looking at >>Well, it's basically work, workload driven and dataset driven. So data that's already in the cloud are kind of a natural first step is, well, why don't I do the analysis there as well? So things like Google earth and digital advertising data, that's real interesting candidates for that also periodic workload. So if they have workloads that need to spin up and spin down, the, the cloud works, works really well for that. And in some cases it's driven by their own environments. They've got data centers that are approaching capacity and they need to kind of do offloads and then looking at the, at the cloud because it's easy to get up running quickly and uses an alternative. >>I want to do come back to one of your three sort of value props here, particularly the dependability piece and specifically the snapshot. So somebody asked me one time, how do you know a couple of years ago, how do you back up a petabyte as he could do this thing? And then his answer was, well, you don't know. So I want to, I want to ask you how your customers are protecting and, and, and, and what you guys are bringing to the table. >>So snapshots is not a bolt on feature. It's basically a low level feature based on the underlying data architecture. So when we architected that from the beginning, snapshots was, was a, was a core feature. And if you use a technique called redirect on, right, you're not copying the data, right? So you can do efficient, you can do a petabyte snapshot, you know, basically almost instantaneously because you're tracking the pointers of the latest blocks that have been written. So if, if the data change rate is, is basically, data's not changing, you can snapshot every minute and not have any additional storage overhead. >>Right. Okay. And, and so you can set that. So you, you map, map, our technologies will allow them to set that, dial that up, dial it down and switches. >>So we support logical volumes. So you can set policies at that volume and you can say, well, this volume is critical data. And then I can set policies. Well, critical data is every minute. And then I can change what the definition of critical data is. Maybe it's every five minutes, et cetera. So you can set up these different policies at volumes and have snapshots happen independently for each. >>Can you do that by workload or dataset or by application or whatever I get essentially provided as a service, as opposed to kind of a one size fits all approach. >>Exactly. And that, that also corresponds to user access, administrative privileges, you know, other features and policies within the, within the cluster. >>How about the, you know, this whole trend toward bringing SQL into, into Hadoop. What's, what's your take on that? And what's your angle? >>So interactive, SQL's an important aspect because you've got so many people trained in the organization and, and leverage, you know, sequel, but it's one of many use cases that needs to run across a big data platform. So there's a range of big data analytics, batch analytics, interactive capabilities with sequel, database operations, no sequel search streaming, all those are kind of functions that need to run across a platform. So it's a piece, but it's not the big driver, because what we've seen is that there's higher rival rate of machine generated data and machine generated response to respond to those for digital advertising, for recommendation engines for fraud detection can really move the needle for an organization, have huge swings and profitability >>And the ball down the field big time. Yeah. And >>Having an interactive piece with a kind of a human element involved, it doesn't really scale and work on a 24 by seven basis. >>Jack final question, we're over now by a minute. But when I ask a one party question, obviously, very competitive landscape right now in terms of competitiveness, the stakes are higher because the demand in the market market opportunities is massive. What's map ours business strategy going forward, no change in direction. Is it going to be same old, same old. You guys have any new things going down and you see the marketplace. >>We've got a huge lead when it comes to kind of mission critical enterprise grade features. And our focus is one platform. So the ability to support enterprise Hadoop, enterprise HBase and provide those full capabilities for ease of use for dependability, for performance. And, you know, we've seen a lot of companies test on one distribution and switch to map are and will continue to help that in the future. >>Well, we, we will, we will say we've been covering this big data space now going on four years now, Dave and I, and we've watched all the players pivot a few times. You guys have not, you guys have been true to your mission from day one and that we know where you stand. No one, everyone knows where you stand enterprise grade. It's a good strategy. I think everyone's putting that on their label now. So enterprise grade Washington, we call it a congratulations map art and said the cube. We'll be right back with our next guest here on day three wall-to-wall coverage at O'Reilly media. When do our news, our next from 12 to one, we'll be right back after this short break.

Published Date : Mar 4 2013

SUMMARY :

So what's your take on the Hadoop If you want to try Hadoop So could you share with the folks out there, what, what to look for in like the, the table stakes, And with Hadoop, you have the Hadoop That's If I could find that and then there's third. So let's talk about the performance because you And if you look back at, at the previous record that was done in the labs with So can you just quickly talk about that and how significant And on Google, we did it with 2020 100 virtual instances, And costs. And the cost to do that run on Google was $20 Well, you got to discount. If you look at the Asheville charges to be 1200, We'll have to go look at the numbers. So we've got a number of customers kind of, you know, testing that and, So I want to ask you in your conversations with your customers, So data that's already in the cloud are kind of a natural first step is, well, So I want to, I want to ask you how your customers are protecting and, and, So you can do efficient, you can do a petabyte snapshot, So you, you map, So you can set policies at that volume and you can say, Can you do that by workload or dataset or by application or whatever I get essentially provided as a service, you know, other features and policies within the, within the cluster. How about the, you know, this whole trend toward bringing SQL into, into Hadoop. you know, sequel, but it's one of many use cases that needs to run And the ball down the field big time. Having an interactive piece with a kind of a human element involved, and you see the marketplace. So the ability to support enterprise Hadoop, You guys have not, you guys have been true to your mission from day

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave VolantePERSON

0.99+

MicrosoftORGANIZATION

0.99+

$20QUANTITY

0.99+

Jack NorrisPERSON

0.99+

John FreyPERSON

0.99+

appleORGANIZATION

0.99+

$13.8 millionQUANTITY

0.99+

DavePERSON

0.99+

600 gigsQUANTITY

0.99+

GoogleORGANIZATION

0.99+

60 secondsQUANTITY

0.99+

1.5 terabytesQUANTITY

0.99+

33 centsQUANTITY

0.99+

FacebookORGANIZATION

0.99+

3,400 serversQUANTITY

0.99+

six millionsQUANTITY

0.99+

8,000 coresQUANTITY

0.99+

EMCORGANIZATION

0.99+

O'ReillyORGANIZATION

0.99+

1200QUANTITY

0.99+

thirdQUANTITY

0.99+

thousandsQUANTITY

0.99+

AshevilleLOCATION

0.99+

millionsQUANTITY

0.99+

twoQUANTITY

0.99+

TwitterORGANIZATION

0.99+

2009DATE

0.99+

1.4 terabytesQUANTITY

0.99+

SQLTITLE

0.99+

threeQUANTITY

0.99+

yesterdayDATE

0.99+

24QUANTITY

0.99+

this weekDATE

0.99+

four yearsQUANTITY

0.99+

one partyQUANTITY

0.99+

over five yearsQUANTITY

0.99+

three areasQUANTITY

0.99+

HadoopTITLE

0.99+

OneQUANTITY

0.98+

2020DATE

0.98+

oneQUANTITY

0.98+

100 virtual instancesQUANTITY

0.97+

secondQUANTITY

0.97+

one platformQUANTITY

0.97+

first stepQUANTITY

0.97+

JackPERSON

0.97+

one timeQUANTITY

0.97+

SecondlyQUANTITY

0.95+

about 27,000 coresQUANTITY

0.94+

HBaseTITLE

0.93+

13, 13,000QUANTITY

0.93+

GreenPlumORGANIZATION

0.92+

day threeQUANTITY

0.92+

DMCORGANIZATION

0.91+

IntelORGANIZATION

0.9+

a minuteQUANTITY

0.9+

day oneQUANTITY

0.89+

Strata Data ConferenceEVENT

0.89+

4,000 per serverQUANTITY

0.89+

14,000 discsQUANTITY

0.87+

five minutesQUANTITY

0.85+

WashingtonLOCATION

0.84+

one distributionQUANTITY

0.83+

wiki.orgOTHER

0.83+

sevenQUANTITY

0.83+

couple of years agoDATE

0.83+

5 78QUANTITY

0.82+

eachQUANTITY

0.81+

JrPERSON

0.79+

12QUANTITY

0.77+

Jack Norris | Strata-Hadoop World 2012


 

>>Okay. We're back here, live in New York city for big data week. This is siliconangle.tvs, exclusive coverage of Hadoop world strata plus Hadoop world big event, a big data week. And we just wrote a blog post on siliconangle.com calling this the south by Southwest for data geeks and, and, um, it's my prediction that this is going to turn into a, quite the geek Fest. Uh, obviously the crowd here is enormous packed and an amazing event. And, uh, we're excited. This is siliconangle.com. I'm the founder John ferry. I'm joined by cohost update >>Volante of Wiki bond.org, where people go for free research and peers collaborate to solve problems. And we're here with Jack Norris. Who's the vice president of market marketing at map are a company that we've been tracking for quite some time. Jack, welcome back to the cube. Thank you, Dave. I'm going to hand it to you. You know, we met quite a while ago now. It was well over a year ago and we were pushing at you guys and saying, well, you know, open source and nice look, we're solving problems for customers. We got the right model. We think, you know, this is, this is our strategy. We're sticking to it. Watch what happens. And like I said, I have to hand it to you. You guys are really have some great traction in the market and you're doing what you said. And so congratulations on that. I know you've got a lot more work to do, but >>Yeah, and actually the, the topic of openness is when it's, it's pretty interesting. Um, and, uh, you know, if you look at the different options out there, all of them are combining open source with some proprietary. Uh, now in the case of some distributions, it's very small, like an ODBC driver with a proprietary, um, driver. Um, but I think it represents that that any solution combining to make it more open is, is important. So what we've done is make innovations, but what we've made those innovations we've opened up and provided API. It's like NFS for standard access, like rest, like, uh, ODBC drivers, et cetera. >>So, so it's a spectrum. I mean, actually we were at Oracle open world a few weeks ago and you listen to Larry Ellison, talk about the Oracle public cloud mix of actually a very strong case that it's open. You can move data, it's all Java. So it's all about standards. Yeah. And, uh, yeah, it from an opposite, but it was really all about the business value. That's, that's what the bottom line is. So, uh, we had your CEO, John Schroeder on yesterday. Uh, John and I both were very impressed with, um, essentially what he described as your philosophy of we, we not as a product when we have, we have customers when we announce that product and, um, you know, that's impressive, >>Is that what he was also given some good feedback that startup entrepreneurs out there who are obviously a lot of action going on with the startup community. And he's basically said the same thing, get customers. Yeah. And that's it, that's all and use your tech, but don't be so locked into the tech, get the cutters, understand the needs and then deliver that. So you guys have done great. And, uh, I want to talk about the, the show here. Okay. Because, uh, you guys are, um, have a big booth and big presence here at the show. What, what did you guys are learning? I'll say how's the positioning, how's the new news hitting. Give us a quick update. So, >>Uh, a lot of news, uh, first started, uh, on Tuesday where we announced the M seven edition. And, uh, yeah, I brought a demo here for me, uh, for you all. Uh, because the, the big thing about M seven is what we don't have. So, uh, w we're not demoing Regents servers, we're not demoing compactions, uh, we're not demoing a lot of, uh, manual administration, uh, administrative tasks. So what that really means is that we took this stack. And if you look at HBase HBase today has about half of dupe users, uh, adopting HBase. So it's a lot of momentum in the market, uh, and, you know, use for everything from real-time analytics to kind of lightweight LTP processing. But it's an infrastructure that sits on top of a JVM that stores it's data in the Hadoop distributed file system that sits on a JVM that stores its data in a Linux file system that writes to disk. >>And so a lot of the complexity is that stack. And so as an administrator, you have to worry about how data gets permit, uh, uh, you know, kind of basically written across that. And you've got region servers to keep up, uh, when you're doing kind of rights, you have things called compactions, which increased response time. So it's, uh, it's a complex environment and we've spent quite a bit of time in, in collapsing that infrastructure and with the M seven edition, you've got files and tables together in the same layer writing directly to disc. So there's no region servers, uh, there's no compactions to deal with. There's no pre splitting of tables and trying to do manual merges. It just makes it much, much simpler. >>Let's talk about some of your customers in terms of, um, the profile of these guys are, uh, I'm assuming and correct me if I'm wrong, that you're not selling to the tire kickers. You're selling to the guys who actually have some experience with, with a dupe and have run into some of the limitations and you come in and say, Hey, we can solve some of those problems. Is that, is that, is that right? Can you talk about that a little bit >>Characterization? I think part of it is when you're in the evaluation process and when you first hear about Hadoop, it's kind of like the Gartner hype curve, right. And, uh, you know, this stuff, it does everything. And of course you got data protection, cause you've got things replicated across the cluster. And, uh, of course you've got scalability because you can just add nodes and so forth. Well, once you start using it, you realize that yes, I've got data replicated across the cluster, but if I accidentally delete something or if I've got some corruption that's replicated across the cluster too. So things like snapshots are really important. So you can return to, you know, what was it, five minutes before, uh, you know, performance where you can get the most out of your hardware, um, you know, ease of administration where I can cut this up into, into logical volumes and, and have policies at that whole level instead of at an individual file. >>So there's a, there's a bunch of features that really resonate with users after they've had some experience. And those tend to be our, um, you know, our, our kind of key customers. There's a, there's another phase two, which is when you're testing Hadoop, you're looking at, what's possible with this platform. What, what type of analytics can I do when you go into production? Now, all of a sudden you're looking at how does this fit in with my SLS? How does this fit in with my data protection, uh, policies, you know, how do I integrate with my different data sources? And can I leverage existing code? You know, we had one customer, um, you know, a large kind of a systems integrator for the federal government. They have a million lines of code that they were told to rewrite, to run with other distributions that they could use just out of the box with Matt BARR. >>So, um, let's talk about some of those customers. Can you name some names and get >>Sure. So, um, actually I'll, I'll, I'll talk with, uh, we had a keynote today and, uh, we had this beautiful customer video. They've had to cut because of times it's running in our booth and it's screaming on our website. And I think we've got to, uh, actually some of the bumper here, we kind of inserted. So, um, but I want to shout out to those because they ended up in the cutting room floor running it here. Yeah. So one was Rubicon project and, um, they're, they're an interesting company. They're a real-time advertising platform at auction network. They recently passed a Google in terms of number one ad reach as mentioned by comScore, uh, and a lot of press on that. Um, I particularly liked the headline that mentioned those three companies because it was measured by comScore and comScore's customer to map our customer. And Google's a key partner. >>And, uh, yesterday we announced a world record for the Hadoop pterosaur running on, running on Google. So, um, M seven for Rubicon, it allows them to address and replace different point solutions that were running alongside of Hadoop. And, uh, you know, it simplifies their, their potentially simplifies their architecture because now they have more things done with a single platform, increases performance, simplifies administration. Um, another customer is ancestry.com who, uh, you know, maybe you've seen their ads or heard, uh, some of their radio shots. Um, they're they do a tremendous amount of, of data processing to help family services and genealogy and figure out, you know, family backgrounds. One of the things they do is, is DNA testing. Uh, so for an internet service to do that, advanced technology is pretty impressive. And, uh, you know, you send them it's $99, I believe, and they'll send you a DNA kit spit in the tube, you send it back and then they process that and match and give you insights into your family background. So for them simplifying HBase meant additional performance, so they could do matches faster and really simplified administration. Uh, so, you know, and, and Melinda Graham's words, uh, you know, it's simpler because they're just not there. Those, those components >>Jack, I want to ask you about enterprise grade had duped because, um, um, and then, uh, Ted Dunning, because he was, he was mentioned by Tim SDS on his keynote speech. So, so you have some rockstars stars in the company. I was in his management team. We had your CEO when we've interviewed MC Sri vis and Google IO, and we were on a panel together. So as to know your team solid team, uh, so let's talk about, uh, Ted in a minute, but I want to ask you about the enterprise grade Hadoop conversation. What does that mean now? I mean, obviously you guys were very successful at first. Again, we were skeptics at first, but now your traction and your performance has proven this is a market for that kind of platform. What does that mean now in this, uh, at this event today, as this is evolving as Hadoop ecosystem is not just Hadoop anymore. It's other things. Yeah, >>There's, there's, there's three dimensions to enterprise grade. Um, the first is, is ease of use and ease of use from an administrator standpoint, how easy does it integrate into an existing environment? How easy does it, does it fit into my, my it policies? You know, do you run in a lights out data center? Does the Hadoop distribution fit into that? So that's, that's one whole dimension. Um, a key to that is, is, you know, complete NFS support. So it functions like, uh, you know, like standard storage. Uh, a second dimension is undependability reliability. So it's not just, you know, do you have a checkbox ha feature it's do you have automated stateful fail over? Do you have self healing? Can you handle multiple, uh, failures and, and, you know, automated recovery. So, you know, in a lights out data center, can you actually go there once a week? Uh, and then just, you know, replace drives. And a great example of that is one of our customers had a test cluster with, with Matt BARR. It was a POC went on and did other things. They had a power field, they came back a week later and the cluster was up and running and they hadn't done any manual tasks there. And they were, they were just blown away to the recovery process for the other distributions, a long laundry list of, >>So I've got to ask you, I got to ask you this, the third >>One, what's the third one, third one is performance and performance is, is, you know, kind of Ross' speed. It's also, how do you leverage the infrastructure? Can you take advantage of, of the network infrastructure, multiple Knicks? Can you take advantage of heterogeneous hardware? Can you mix and match for different workloads? And it's really about sharing a cluster for different use cases and, and different users. And there's a lot of features there. It's not just raw >>The existing it infrastructure policies that whole, the whole, what happens when something goes wrong. Can you automate that? And then, >>And it's easy to be dependable, fast, and speed the same thing, making HBase, uh, easy, dependable, fast with themselves. >>So the talk of the show right now, he had the keynote this morning is that map. Our marketing has dropped the big data term and going with data Kozum. Is that true? Is that true? So, Joe, Hellerstein just had a tweet, Joe, um, famous, uh, Cal Berkeley professor, computer science professor now is CEO of a startup. Um, what's the industry trifecta they're doing, and he had a good couple of epic tweets this week. So shout out to Joe Hellerstein, but Joel Hellison's tweet that says map our marketing has decided to drop the term big data and go with data Kozum with a shout out to George Gilder. So I'm kind of like middle intellectual kind of humor. So w w w what's what's your response to that? Is it true? What's happening? What is your, the embargo, the VP of marketing? >>Well, if you look at the big data term, I think, you know, there's a lot of big data washing going on where, um, you know, architectures that have been out there for 30 years or, you know, all about big data. Uh, so I think there's a, uh, there's the need for a more descriptive term. Um, the, the purpose of data Kozum was not to try to coin something or try to, you know, change a big data label. It was just to get people to take a step back and think, and to realize that we are in a massive paradigm shift. And, you know, with a shout out to George Gilder, acknowledging, you know, he recognized what the impact of, of making available compute, uh, meant he recognized with Telekom what bandwidth would mean. And if you look at the combination of we've got all this, this, uh, compute efficiency and bandwidth, now data them is, is basically taking those resources and unleashing it and changing the way we do things. >>And, um, I think, I think one of the ways to look at that is the new things that will be possible. And there's been a lot of focus on, you know, SQL interfaces on top of, of Hadoop, which are important. But I think some of the more interesting use cases are taking this machine J generated data that's being produced very, very rapidly and having automated operational analytics that can respond in a very fast time to change how you do business, either, how you're communicating with customers, um, how you're responding to two different, uh, uh, risk factors in the environment for fraud, et cetera, or, uh, just increasing and improving, um, uh, your response time to kind of cost events. We met earlier called >>Actionable insight. Then he said, assigning intent, you be able to respond. It's interesting that you talk about that George Gilder, cause we like to kind of riff and get into the concept abstract concepts, but he also was very big in supply side economics. And so if you look at the business value conversation, one of things we pointed out, uh, yesterday and this morning, so opening, um, review was, you know, the, the top conversations, insight and analytics, you know, as a killer app right now, the app market has not developed. And that's why we like companies like continuity and what you guys are doing under the hood is being worked on right at many levels, performance units of those three things, but analytics is a no brainer insight, but the other one's business value. So when you look at that kind of data, Kozum, I can see where you're going with that. >>Um, and that's kind of what people want, because it's not so much like I'm Republican because he's Republican George Gilder and he bought American spectator. Everyone knows that. So, so obviously he's a Republican, but politics aside, the business side of what big data is implementing is massive. Now that I guess that's a Republican concept. Um, but not really. I mean, businesses is, is, uh, all parties. So relative to data caused them. I mean, no one talks about e-business anymore. We talking to IBM at the IBM conference and they were saying, Hey, that was a great marketing campaign, but no one says, Hey, uh, you and eat business today. So we think that big data is going to have the same effect, which is, Hey, are you, do you have big data? No, it's just assumed. Yeah. So that's what you're basically trying to establish that it's not just about big. >>Yeah. Let me give you one small example, um, from a business value standpoint and, uh, Ted Dunning, you mentioned Ted earlier, chief application architect, um, and one of the coauthors of, of, uh, the book hoot, which deals with machine learning, uh, he dealt with one of our large financial services, uh, companies, and, uh, you know, one of the techniques on Hadoop is, is clustering, uh, you know, K nearest neighbors, uh, you know, different algorithms. And they looked at a particular process and they sped up that process by 30,000 times. So there's a blog post, uh, that's on our website. You can find out additional information on that. And I, >>There's one >>Point on this one point, but I think, you know, to your point about business value and you know, what does data Kozum really mean? That's an incredible speed up, uh, in terms of, of performance and it changes how companies can react in real time. It changes how they can do pattern recognition. And Google did a really interesting paper called the unreasonable effectiveness of data. And in there they say simple algorithms on big data, on massive amounts of data, beat a complex model every time. And so I think what we'll see is a movement away from data sampling and trying to do an 80 20 to looking at all your data and identifying where are the exceptions that we want to increase because there, you know, revenue exceptions or that we want to address because it's a cost or a fraud. >>Well, that's what I, I would give a shout out to, uh, to the guys that digital reasoning Tim asked he's plugged, uh, Ted. It was idolized him in terms of his work. Obviously his work is awesome, but two, he brought up this concept of understanding gap and he showed an interesting chart in his keynote, which was the date explosion, you know, it's up and, you know, straight up, right. It's massive amount of data, 64% unstructured by his calculation. Then he showed out a flat line called attention. So as data's been exploding over time, going up attention mean user attention is flat with some uptick maybe, but so users and humans, they can't expand their mind fast enough. So machine learning technologies have to bridge that gap. That's analytics, that's insight. >>Yeah. There's a big conversation now going on about more data, better models, people trying to squint through some of the comments that Google made and say, all right, does that mean we just throw out >>The models and data trumps algorithms, data >>Trumps algorithms, but the question I have is do you think, and your customer is talking about, okay, well now they have more data. Can I actually develop better algorithms that are simpler? And is it a virtuous cycle? >>Yeah, it's I, I think, I mean, uh, there are there's, there are a lot of debate here, a lot of information, but I think one of the, one of the interesting things is given that compute cycles, given the, you know, kind of that compute efficiency that we have and given the bandwidth, you can take a model and then iterate very quickly on it and kind of arrive at, at insight. And in the past, it was just that amount of data in that amount of time to process. Okay. That could take you 40 days to get to the point where you can do now in hours. Right. >>Right. So, I mean, the great example is fraud detection, right? So we used the sample six months later, Hey, your credit card might've been hacked. And now it's, you know, you got a phone call, you know, or you can't use your credit card or whatever it is. And so, uh, but there's still a lot of use cases where, you know, whether is an example where modeling and better modeling would be very helpful. Uh, excellent. So, um, so Dana custom, are you planning other marketing initiatives around that? Or is this sort of tongue in cheek fun? Throw it out there. A little red meat into the chum in the waters is, >>You know, what really motivated us was, um, you know, the cubes here talking, you know, for the whole day, what could we possibly do to help give them a topic of conversation? >>Okay. Data cosmos. Now of course, we found that on our proprietary HBase tools, Jack Norris, thanks for coming in. We appreciate your support. You guys have been great. We've been following you and continue to follow. You've been a great support of the cube. Want to thank you personally, while we're here. Uh, Matt BARR has been generous underwriter supportive of our great independent editorial. We want to recognize you guys, thanks for your support. And we continue to look forward to watching you guys grow and kick ass. So thanks for all your support. And we'll be right back with our next guest after this short break. >>Thank you. >>10 years ago, the video news business believed the internet was a fat. The science is settled. We all know the internet is here to stay bubbles and busts come and go. But the industry deserves a news team that goes the distance coming up on social angle are some interesting new metrics for measuring the worth of a customer on the web. What zinc every morning, we're on the air to bring you the most up-to-date information on the tech industry with scrutiny on releases of the day and news of industry-wide trends. We're here daily with breaking analysis, from the best minds in the business. Join me, Kristin Filetti daily at the news desk on Silicon angle TV, your reference point for tech innovation 18 months.

Published Date : Oct 25 2012

SUMMARY :

And, uh, we're excited. We think, you know, this is, this is our strategy. Um, and, uh, you know, if you look at the different options out there, we not as a product when we have, we have customers when we announce that product and, um, you know, Because, uh, you guys are, um, have a big booth and big presence here at the show. uh, and, you know, use for everything from real-time analytics to you know, kind of basically written across that. Can you talk about that a little bit And, uh, you know, this stuff, it does everything. And those tend to be our, um, you know, Can you name some names and get uh, we had this beautiful customer video. uh, you know, you send them it's $99, I believe, and they'll send you a DNA so let's talk about, uh, Ted in a minute, but I want to ask you about the enterprise grade Hadoop conversation. So it functions like, uh, you know, like standard storage. is, you know, kind of Ross' speed. Can you automate that? And it's easy to be dependable, fast, and speed the same thing, making HBase, So the talk of the show right now, he had the keynote this morning is that map. there's a lot of big data washing going on where, um, you know, architectures that have been out there for you know, SQL interfaces on top of, of Hadoop, which are important. uh, yesterday and this morning, so opening, um, review was, you know, but no one says, Hey, uh, you and eat business today. uh, you know, K nearest neighbors, uh, you know, different algorithms. Point on this one point, but I think, you know, to your point about business value and you which was the date explosion, you know, it's up and, you know, straight up, right. that Google made and say, all right, does that mean we just throw out Trumps algorithms, but the question I have is do you think, and your customer is talking about, okay, well now they have more data. cycles, given the, you know, kind of that compute efficiency that we have and given And now it's, you know, you got a phone call, you know, We want to recognize you guys, thanks for your support. We all know the internet is here to stay bubbles and busts come and go.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Joe HellersteinPERSON

0.99+

George GilderPERSON

0.99+

Ted DunningPERSON

0.99+

Kristin FilettiPERSON

0.99+

Joel HellisonPERSON

0.99+

John SchroederPERSON

0.99+

JoePERSON

0.99+

JackPERSON

0.99+

Larry EllisonPERSON

0.99+

Jack NorrisPERSON

0.99+

JohnPERSON

0.99+

40 daysQUANTITY

0.99+

Melinda GrahamPERSON

0.99+

64%QUANTITY

0.99+

$99QUANTITY

0.99+

comScoreORGANIZATION

0.99+

TimPERSON

0.99+

DavePERSON

0.99+

TuesdayDATE

0.99+

Matt BARRPERSON

0.99+

HellersteinPERSON

0.99+

GoogleORGANIZATION

0.99+

George GilderPERSON

0.99+

TedPERSON

0.99+

John ferryPERSON

0.99+

30 yearsQUANTITY

0.99+

30,000 timesQUANTITY

0.99+

todayDATE

0.99+

IBMORGANIZATION

0.99+

a week laterDATE

0.99+

yesterdayDATE

0.99+

twoQUANTITY

0.99+

three companiesQUANTITY

0.99+

DanaPERSON

0.99+

Tim SDSPERSON

0.99+

one pointQUANTITY

0.99+

JavaTITLE

0.99+

firstQUANTITY

0.99+

six months laterDATE

0.99+

oneQUANTITY

0.99+

OracleORGANIZATION

0.99+

one customerQUANTITY

0.99+

LinuxTITLE

0.98+

once a weekQUANTITY

0.98+

18 monthsQUANTITY

0.98+

RubiconORGANIZATION

0.98+

HBaseTITLE

0.98+

KozumPERSON

0.98+

GartnerORGANIZATION

0.98+

this morningDATE

0.97+

TelekomORGANIZATION

0.97+

this weekDATE

0.97+

10 years agoDATE

0.97+

second dimensionQUANTITY

0.97+

bothQUANTITY

0.97+

KozumORGANIZATION

0.95+

third oneQUANTITY

0.95+

OneQUANTITY

0.94+

three thingsQUANTITY

0.94+

a year agoDATE

0.94+

HadoopTITLE

0.93+

siliconangle.comOTHER

0.93+

KnicksORGANIZATION

0.93+

RegentsORGANIZATION

0.92+

Jack Norris | Hadoop Summit 2012


 

>>Okay. We're back live in Silicon valley and San Jose, California for the continuous coverage of siliconangle.tv and have duke world 2012. This is ground zero for the alpha geeks in big data. Uh, just the tech elite. We call them tech athletes and, uh, we're excited to cover it on the ground. Extract the signal from the noise here. This is the cube, our flagship telecast. I'm joining my co-host Jeff Kelly from Wiki bond.org, the best analyst in the business. Jeff, welcome back for another segment. End of the day, day one loving every minute. Okay. We're here with our guest. Jack Norris is a cm of map bar Jack. Welcome back to the cube. You've been on a few times. Um, so you guys have some news. Yes. So let's get right to the news. So you guys are a player in the business, so share with your news, the folks. Excellent jump right in. >>So, uh, two big announcements today, we announced that Amazon is integrating map bar as part of their Lastic MapReduce service and both edition or, or free edition. M three is available as well as M five directly with Amazon, Amazon in the cloud. >>So what's the value proposition. Why would a customer say, all right, I want to do this in the cloud manpower, an Amazon cloud rather than doing it on premise. >>Okay. So let's start with, I mean, there's a lot of value propositions, all balled up into one here. Uh, first of all, in the cloud, it allows them to spin up very quickly. Within a couple minutes, you can get, uh, you know, hundreds of nodes available. Um, and, uh, and depending on where you're processing the data, if you've got a lot of data in the cloud already makes a lot of sense to do the Hadoop processing directly there. So that's, that's one area. A second is you might have an on-premise cloud deployment and need to have a disaster recovery. So map R provides point in time, snapshots, uh, as well as, as a white area replication. So you can use mirroring having Amazon available as a target is a huge advantage. And then there's also a third application area where you can do processing of the data in the cloud and then synchronize those results to an on-premise. So basically process where the data is combined the results into a cluster on premise. So you >>Don't have to move the raw data. Uh, >>On-premise actually, it's all about let's do the processing on the data. Well, you know, the whole, >>The value proposition and big data in general is let's not move, move data as little as possible. Yep. Uh, you know, so you bring the computation to the data, if you can. Uh, so what are your take on this event? I mean, we've got, uh, this is a, you know, the 4th of June summit, uh, you know, Hortonworks is now fully taken over the show and talk about what you see out here in terms of, uh, the other vendors that play. And, uh, just to kind of the attendees, the vibe you're seeing, >>Uh, it's a lot of excitement. I think a big difference between last year, which seemed to be very developer focused. We're seeing a lot of, a lot of presentations by customers. A lot of information was shared by our customers today. It was fun to see that, uh, comScore's shared, uh, shared their success. Boeing gap map is, uh, it was great for us. >>Fantastic. We look at Amazon, Amazon, first of all, is the gold standard for public cloud. Right? They've knocked it out of the park. Everyone knows Amazon. Um, but they've been criticized on the big data front because of the cycle times involve on. Um, and some developers and mean for web service spending up and down. No problem. Um, and we're seeing businesses like Netflix run on Amazon. So Amazon is not a stranger to running scale for cloud, but Hadoop has kind of been a klugey thing for Amazon. So I think, you know, talk about why Amazon and you guys is a good fit out to the market. The market reach is great. So you guys know and have a huge addressable market. Are you guys helping solve some of that complexity with the, uh, with the MapReduce side? What's, >>What's the core, I guess the first comment first response would be, I think every customer should have that type of Kluge. Uh, uh, they could have the success that Amazon has in Hadoop. They have a huge number of, of, uh, of Hadoop deployments have been very, very successful. I think, >>I mean, you know what I mean by it's natural, it's, cloogy everywhere right now. That's the problem. But Amazon has huge scale, um, and had not a natural fit. There >>Is not a natural fit >>For the data for the data component. And, uh, uh, the HBase for example, >>Component. So where were Amazons, you know, made it very frictionless is the ability to spin up Hadoop to do the analysis. The gap that was missing is some of the, the ha capabilities. The data protection features the disaster recovery, and, you know, we're map are now it gives options to those customers. You know, if they want those kinds of enterprise enterprise grade features, now they have an option within EMR. It can select a M five and, and get moving if they want a performance. And in NFS, they've got the M three options. >>Well, congratulations. I think it's a great deal for you guys and for Amazon customers. My question for you is, as you guys explore the enterprise ready equation, which has been a big topic this week, um, what does that mean to you guys? Cause it means different things to different people depends on where, how high up to OLTB do you go? Right? I mean, we're how far from batch to real time transactional, um, levels you go, I mean, low bash, no problem. But as you start to get more near real time, it's going to be a little bit different gray in this house used security HDFS. Yeah. >>Yeah. So, so duke represents the strategic platform, right? Deploying that in an organization, um, you know, moving from kind of an experimental kind of lab based to production environment creates a different set of feature requirements. How available is it? How easy is it to integrate, right? How do I kind of protect that information and how do I share it? So when we say enterprise grade, we mean you can have SLA, she can put the data there and, and be confident that the data will remain there, that you can have a point in time recovery for an application error or user mistake. Uh, you can have a disaster recovery features in place. And then the integration is about not recreating the wheel to get access to the information. So Hadoop is very powerful, but it requires interacting through an HDFS API. If you can leverage it like through map bar with NFS standard file based access standard ODBC access, open it up. >>So I can use a standard file browser applications to see and manipulate the data really opens up the use cases. And then finally, what we announced in two dot oh, was multitenancy features. So as you share that information, all of a sudden the SLA is of different groups and well, these guys need it immediately. And if you've got some low grade batch jobs are going to impact that. So you want the ability to protect, to isolate, to secure information, and basically have virtual clusters within a cluster. And those features are important to cloud, but they're also important to on-premise >>So great for the hybrid cloud environments out there. I mean, the multitenancy cracking the code on that. Exactly huge. I mean, that is basically, I mean, right now most enterprises are like private cloud because it's like, they're basically extension of their data center and you're seeing a lot more activity in the hybrid cloud as a gateway to the public cloud. So, >>And, and, you know, frankly, people are kind of struggling with in an experimental with Apache Hadoop and the other distributions, the policies are either at the individual file level or the whole cluster. And it all almost forced the creation of separate physical clusters, which kind of goes against the whole Hadoop concept. So the ability to manage it, a logical layer have separate volumes where you can apply policies to apply that applies to all the content underneath really kind of makes it much, much easier for administrators to kind of deal with these multiple use cases. >>Amazon, Amazon has always been one of those cases for the enterprise where it's been one of those and they've, this has been talked about for years, put the credit card down, go play on Amazon, but then bring it back into the it group for certification. And so I think this is a nice product for you guys to bring that comfort. You know, we're very >>Excited the enterprise saying, Hey, >>Come play in Amazon. It's Bulletproof enterprise. Ready? So congratulations. >>I wonder, can we talk, uh, talk use cases. So what are you seeing in terms of, uh, evolving use cases as, as, uh, duke continues to become more enterprise grade, uh, depending on your definition, uh, but how is that impacting what you're seeing in terms of, even if it's just, uh, you know, the, the, um, the mindset even people think now, okay, now it's enterprise grade, well, maybe, you know, in, in, depending on who you talk to, it's been that way for a bit, but what kind of, uh, use cases are you seeing develop now that it's kind of starting to gain acceptance? It's like, okay, we can trust our data is going to be there, et cetera. >>So th there's a huge range of use cases that, uh, different by industry, different by kind of dataset that's being used against everything from really a deep store where you can do analytics on it. So you're selecting the content to something that's very, very analytic machine learning intensive, where you're doing sophisticated clustering algorithms, uh, et cetera, um, where we've seen kind of an expansion of use cases are around real-time streaming and you get streaming data sets that are kind of entering into the cloud. And, um, some of the more mission, critical data moving beyond just maybe click stream data or things that if you happen to drop a few, you know, not a big deal, right. Versus the kind of trust the business type of content. >>Talk a little bit about the streaming, uh, aspects, uh, because of course, you know, we think of duke, we think of a batch system in terms of streaming data into Hadoop. You know, that's, that's a different, uh, that's something we don't, we haven't heard a lot about. So how do you guys approach that? >>So, uh, one of the artifacts of, of HDFS, which is a, is a distributed file system that scores in the underlying Linux file system, it's append only. So as an administrator, you decide, how frequently do I close the file item? I going to do that an hourly basis on it every eight hours, because you have to close the file for other applications to see the data that's been written. Right? So one of the innovations that, uh, that we pursued was to rewrite that create this dynamic read-write layer. So you can continue to write data in any application is seeing the latest data that's written. So you can Mount the cluster as if it's storage and just continue to write data. There really opens up what's, uh, what's possible companies like Informatica, they're all from a messaging product integrates directly in with, with Matt BARR and provides. >>So what kind of advantage does that provide to the end user? What w w translate that into real business value? Why, why is that important? >>Well, so one example is comScore, comScore handles 30 billion, uh, objects a day, uh, as they go out and try to measure the use of, of the web and being able to continually write and stream that information and scale and handle that in a real time and do analytics and turn around data faster, has tremendous business value to them. If they're stuck in a batch environment where the load times lengthen to the point where all of a sudden they can't keep up and they're actually reporting on, you know, old news. And I think the analogy is forecasting rain a day after it's wet. Isn't exactly valuable. >>Yeah. So you guys, obviously a great deal of the enterprise ready for Amazon, big story, big coup for the company. What's next for you. I want to ask that and make sure you get that out there on your agenda for the next year, but then I want you to take a step back a year, maybe a year and a half ago. Look back at how much has changed in this landscape. Um, share your perspective because the market has gone through an evolution where there's been a market opportunity, and then everyone goes, oh my God, it's bigger than we actually thought. I mean, Jeff, Kelly's a groundbreaking report about the $50 billion market is now being talked about as too low. So big data has absolutely opened up to a huge, and it's changed some of the tactics around strategies. So your strategy, Hortonworks strategy, even cloud era. So, and it's still evolving. So what's changed for the folks out there from a year and a half ago, a year ago to today, and then look out for the next 12 months. What's on your agenda. >>Well, if, if you look back, I think we've been fairly consistent. Um, uh, I'm, I'm not going to take credit for the vision of our CEO and CTO. Uh, but they recognized early on that Hadoop was, uh, was a strategic platform and to be a strategic platform that applied to the broadest number of use cases and organizations required some, some areas, uh, of innovation and particularly the how it, how it scaled, how it was managed, how you stored and protected the information needed a rearchitecture. And I think that, you know, architecture matters when you're going through a paradigm shift, having the right one in place creates this, this ability, you know, to speed innovation. And I think that's, if there's anything that's changed, I think it's the speed of innovation has even increased in the Hadoop community. I think it's, it's created a focus on these enterprise grade features on how do we store this valuable information and, and continue to explore. >>And I think one of the observations I'll make is that on that note is that it really focuses everyone to be just mind your own business and get the products out. You know what I'm saying? We've seen everyone, the product focus be the number one conversation. >>What we've seen is customers, you know, start and they expand rapidly. Some of that student data growth, but a lot of it is student more and more applications are being delivered and, and, uh, and, and the values kind of extracted from the hoop platform and success breeds success. Well, >>Congratulations for all your success, great win with Amazon web services and make that a little bit more easier, more robust, and more, more features for them and you, uh, more revenue for part of our, um, and I want to personally thank you for your support to the cube. Uh, we've expanded with a new studio B software for extra extra interviews, um, and wanna expand the conversation, thanks to your generous support. You can bring the independent coverage out to the market and, um, great community, thanks for helping us out. And we appreciate it. So thank you. Okay. Jack Dorsey with Matt bar, we'll be right back to wrap up day one with that. Jeff and I will give our analysis right at the short break.

Published Date : Jun 14 2012

SUMMARY :

So you guys are a player in the business, so share with your news, Amazon in the cloud. So what's the value proposition. And then there's also a third application area where you can do processing of the data in Don't have to move the raw data. Well, you know, the whole, uh, you know, Hortonworks is now fully taken over the show and talk about what you see out here in terms of, uh, it was great for us. So I think, you know, talk about why Amazon and you guys is a good fit out What's the core, I guess the first comment first response would be, I think every customer I mean, you know what I mean by it's natural, it's, cloogy everywhere right now. For the data for the data component. the disaster recovery, and, you know, we're map are now it gives options to those customers. I think it's a great deal for you guys and for Amazon customers. that the data will remain there, that you can have a point in time recovery for an application error or user mistake. So as you share that information, So great for the hybrid cloud environments out there. So the ability to manage it, And so I think this is a nice product for you guys to So congratulations. So what are you seeing in terms of, uh, evolving use cases as, really a deep store where you can do analytics on it. Talk a little bit about the streaming, uh, aspects, uh, because of course, you know, we think of duke, I going to do that an hourly basis on it every eight hours, because you have to close the file for other applications actually reporting on, you know, old news. I want to ask that and make sure you get that And I think that, you know, architecture matters when you're going through a paradigm shift, And I think one of the observations I'll make is that on that note is that it really focuses everyone to be What we've seen is customers, you know, start and they expand rapidly. You can bring the independent coverage out to the market and, um, great community,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Jeff KellyPERSON

0.99+

JeffPERSON

0.99+

AmazonORGANIZATION

0.99+

Jack NorrisPERSON

0.99+

Jack DorseyPERSON

0.99+

NetflixORGANIZATION

0.99+

$50 billionQUANTITY

0.99+

Silicon valleyLOCATION

0.99+

30 billionQUANTITY

0.99+

todayDATE

0.99+

InformaticaORGANIZATION

0.99+

a year agoDATE

0.99+

next yearDATE

0.99+

comScoreORGANIZATION

0.99+

a year and a half agoDATE

0.99+

KellyPERSON

0.99+

last yearDATE

0.99+

AmazonsORGANIZATION

0.99+

LinuxTITLE

0.99+

Matt BARRPERSON

0.99+

San Jose, CaliforniaLOCATION

0.99+

one exampleQUANTITY

0.98+

one areaQUANTITY

0.97+

third applicationQUANTITY

0.97+

MattPERSON

0.97+

oneQUANTITY

0.97+

HadoopTITLE

0.97+

this weekDATE

0.96+

2012DATE

0.95+

hundreds of nodesQUANTITY

0.94+

HortonworksORGANIZATION

0.94+

JackPERSON

0.93+

both editionQUANTITY

0.93+

a dayQUANTITY

0.93+

two big announcementsQUANTITY

0.92+

secondQUANTITY

0.9+

next 12 monthsDATE

0.88+

day oneQUANTITY

0.86+

two dotQUANTITY

0.85+

M threeOTHER

0.85+

M threeTITLE

0.84+

MapReduceORGANIZATION

0.82+

Hadoop Summit 2012EVENT

0.79+

first responseQUANTITY

0.79+

every eight hoursQUANTITY

0.78+

SLATITLE

0.77+

JuneDATE

0.77+

first commentQUANTITY

0.77+

Lastic MapReduceTITLE

0.69+

M fiveOTHER

0.69+

BoeingORGANIZATION

0.68+

M fiveTITLE

0.67+

siliconangle.tvOTHER

0.67+

ground zeroQUANTITY

0.67+

Wiki bond.orgORGANIZATION

0.62+

ApacheORGANIZATION

0.61+

4th ofEVENT

0.6+

Dr. Amr Awadallah - Interview 1 - Hadoop World 2011 - theCUBE


 

okay we're back live in new york city for hadoop world 2011 john furrier its founder SiliconANGLE calm and we have a special walk-in guest tomorrow and allah the vp of engineering co founder of Cloudera who's going to be on at two thirty eastern time on the cube to go more in depth but since we saw her in the hallway we had a quick spot wanted to grab him in here this is the cube our flagship telecast where we go out to the event atop the smartest people and i'm here with my co-host i'm dave vellante Wikibon door welcome back you're a longtime cube alum so appreciate you coming back on and doing a quick drive by here thanks for the nice welcome so you know we go talk to the smart people in the room you're one of the smartest guys that I know and we've been friends for years and it was your my tweet heard around the world by you to find space and we've been sharing the office space at Cloudera a year didn't have you I meant to have you we're going to be trying to find space because you're expanding so fast we have to get in a new home sorry about that but I wanted to really thank you personally appear on live you've enabled SiliconANGLE Wikibon to we figured it out early because of you I mean we had our nose sniffing around the big data area before it's called big data but when we met talked we've been tracking the social web and really it's exploded in an amazing way and I'm just really thankful because I've been had a front-row seat in the trenches with you guys and and it's been amazing so I want to thank you're welcome and that's great to have you on board and so so you you've been evangelizing in the trenches at Yahoo you were a ir a textile partners announcing the hundred million dollar fund which is all great news today but you've been the real spark get cloudy air is one of the 10 others one of them but I know one of the main sparks a co-founder a lots of ginger cuz I'm Rebecca and my co-founder from facebook I mean we both we said this before like we saw the future like an hour companies we saw the future where everybody is gonna go next and now Jeff's gonna be on as well he's now taking this whole date of science thing art yep building out a team you gotta drilled that down with him what do you what do you think about all this I mean like right now how do you feel personally emotionally and looking at the marketplace share with us your yeah I'm very emotional today actually yeah lots of the good news is you heard about the funding news yes million dollars for startups but no but the 14 oh yeah yeah it is more most actually the news was supposed to come out today came out a bit earlier sir day but yeah I'm very very emotional because of that it's a very Testament from very big name investor's of how well we were doing and recognition of how big this wave really is also the hundred million fun from Excel that's also a huge testament and lots of hopefully lots of new innovations or startups will come out of that so I'm very emotional about that but also overwhelmed by the by the the size of this event and how many people are really gravitating towards the technology which shows how much work we still have to do going forward it was very very August of a great a bit scared a bit scared Michaels is a great CEO on stage they're great guy we love Mike just really he's geeky and he's pragmatic Jerry strategist and you got Kirk who's the operator yeah but he showed a slide up at his keynote that showed the evolution of Hadoop yes the core Hadoop and then he showed ya year-by-year and now we got that columns extending and you got new new components coming out take us through that that progression just go back a few years in and walk us through why is this going on so fast and what are the what's the what's the community doing and just yeah and what happened in 2008 it doesn't need was one mr. yeah when we when we started so I mean first 2008 when we started and what he was believing us back then that hey this thing is going to be big like we had the belief because we saw it happen firsthand but many folks were dismissive and no no no this this big data thing is a fat and nobody will care about it and look and behold today it's obviously proving not to be the case in terms of the maturity of the of the platform you're absolutely right i mean the slide that Mike showed should but only thirty percent of the contributions happening today are in the Hadoop core layer and and and and the overall kind of vision there is very system very similar to the operating system right except what this really is it's a data operating system right it's how to operate large amounts of data in a big data center so sorry it's like an operating system for many machines as opposed to Linux which does not bring system for a single machine right so Hadoop when it came out Hadoop is only the colonel it's only that inner layers which if you look at any opening system like windows or linux and so on the core functionality is two things storing files and running applications on top of these files that's what windows does that's what linux does that was loop does at the heart but then to really get an opening system to work you need many ancillary components around it that really make it functional you need libraries in it applications in eat integration IO devices etc etc and that's really what's happening in the hadoop world so started with the core OS layer which is Hadoop HDFS for storage MapReduce for computation but then now all of these other things are showing around that core kernel to really make it a fully functional extensible data opening system I which made a little replay button but let's just put the paws on that because this is kind of an important point in folks out there there's a lot of different and a lot of people and metaphors are used in this business so it's the Linux I want to be it's just like Red Hat right yeah we kind of use that term the business model is talk a little bit about that we just mentioned you know not like Linux just unpack that a little bit deeper for us what's the difference you mentioned Linux is can you replay what you just said that was really so I was actually talking about the similarity the similarity and then i can and then i can talk about the difference the similarity is the heart of Hadoop is a system for storing files which is sdfs and a system for running applications on top of these files which is MapReduce the heart of Linux is the same thing assistant for storing files which is a txt for and a system for scheduling applications on top of these files that's the same heart of Windows and so on the difference though so that's the similarity I got a difference is Linux is made to run on a single note right and when this is made to run on a single note Hadoop is really made to run on many many notes so hadoo bicester cares about taking a data center of servers a rack of servers or a data center of servers and having them look like one big massive mainframe built out of commodity hardware that can store arbitrary amounts of data and run any type of hence the new components like the hives of the world so now so now these new components coming up like high for example I've makes it easier to write queries for Hadoop it's it's a sequel language for writing queries on top of Hadoop so you don't have to go and write it in MapReduce which we call that assembly language of Hadoop so if you write it and MapReduce you will get the most flexibility you will get the most performance but only if you know what you're doing very similar when you do machine code if you do machine cool assembly you will able do anything but you can also shoot yourself in the foot sunbelt is that right the same thing with MapReduce right when you use hive hive abstracts that out for you so your rights equal and then hive takes care of doing all of the plumbing work to get that compulsion to map it is for you so that's hive HBase for example is a very nice system that augments a dupe makes it low latency and makes it makes it support update and insert and delete transactions which are HDFS does not support out of the box so small like a database it's more like my sequel yeah the energy of my sequel to Linux is very similar to hbase to HDFS and what's your take on were from you know your founders had on now yeah on the business model similarities and differences with with redhead yes so actually they are different I mean that the sonority the similarity stops at open source we are both open source right in the sense that the core system is open source is available out there you can look at the source code again the and so on the difference is with redhead red that actually has a license on their bits so there's the source code and then there's the bits so when Red Hat compiles the source code and two bits these bits you cannot deploy them without having a red hat license with us is very different is now we have the source code which is Apache is all in the patchy we compile the source code into a bunch of bits which is our distribution called cdh these bits are one hundred percent open-source 103 can deploy them use them you don't have to face anything the only reason why you would come back and pay us is for Cloudera enterprise which is really when you go operational when become operational a mission-critical cloud enterprise gives you two things first it gives you a proprietary management suite that we built and it's very unique to us nobody in the market has anything close to what we have right now that makes it easier for you to deploy configure monitor provision do capacity planning security management etc for a loop nobody else has anything close what we have right now for that management's that is unique to cloud area and not part of a patchy open source yes it's not part of the vet's office you only get that as a subscriber to cloud era we do have a free version of that that's available for download and it can run up to 15 hours just for you to get up and running quickly yeah and it's really very simple has a very simple installer like you should be able to go fire off that software and say install Hadoop these are one of my servers and would take care of everything else for you it's like having these installers you know when windows came out in the beginning and he had this nice progress bar and you can install applications very easily imagine that now for a cluster of servers right that's ready what this is the other reason why people subscribe to the cloud enterprise in addition to getting this management suite is getting our support services right and support is necessary for any software even if it's free even for hardware think if I give you a free airplane right now just comment just give it here you go here is an airplane right you can run this airplane make money from passengers you still need somebody to maintain their plane for you right you can still go higher your mechanics maybe we'd have a tweetup bummer you can hire your own mechanics to maintain that airplane but we tell you like if you subscribe with us as the mechanics for your airplane the support you will get with us will be way better than anything else and economics of it also would be way better than having your own stuff for doing the maintenance for that airplane okay final question and we got a one-minute because we slid you in real quick we're going to come back for folks armor is going to come back at two-thirty so come back its eastern time and we'll have a more in-depth conversation but just share with the folks watching your view of what's going on in the patchy and you know there's all these kind of weird you know Fudd being thrown around that clutter is not this and that and you guys clearly the leader we talked with Kirk about that we don't need to go into that but just surely this what's going on what's the real deal happening with Apache the code and you have a unique offering which I mean the real deal and I advise people to go look at this blog post that our CEO wrote called by Michaelson road called the community effect and the real deal is there is a very big healthy community developing the source code for Hadoop the core system which is actually fsm MapReduce and all the components around around that core system we at Cloudera employ a very large engineering organization and tactile engineering relation is bigger than many of these other companies in the space that's our engineering is bigger if you look at the whole company itself is much much bigger than any of these other players so we we do a lot of contributions and to the core system and to the projects around it however we are part of the community and we're definitely doing this with the community it's not just a clowder thing for the core platform so that that's the real deal all right yeah so here we are armor that co-founder congratulations great funding hundred L from accel partners who invested in you guys congratulations you're part of the community we all know that just kind of clarifying that for the record and you have a unique differentiator management suite and the enterprise stuff and say expand the experience experience yeah I think a huge differentiation we have is we have been doing this for three years I had over everybody else we have the experience across all the industries that matter so when you come to us we know how to do this in the finance industry in the retail industry and the health industry and the government so that that's something also that so I'll just for the audience out there arm is coming back at two third you're gonna go deeper in today's the highly decorated or a general because there is there a leak oh and thanks for the small extra info he's in the uniform to the cloud era logo yes sir affecting some of those for us to someday great so what you see you again love love our great great friend

Published Date : May 1 2012

SUMMARY :

clarifying that for the record and you

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
RebeccaPERSON

0.99+

MikePERSON

0.99+

ClouderaORGANIZATION

0.99+

2008DATE

0.99+

ExcelTITLE

0.99+

HadoopTITLE

0.99+

three yearsQUANTITY

0.99+

linuxTITLE

0.99+

one-minuteQUANTITY

0.99+

windowsTITLE

0.99+

MichaelsPERSON

0.99+

JeffPERSON

0.99+

john furrierPERSON

0.99+

2011DATE

0.99+

LinuxTITLE

0.99+

KirkPERSON

0.99+

todayDATE

0.99+

thirty percentQUANTITY

0.99+

YahooORGANIZATION

0.99+

hbaseTITLE

0.98+

single noteQUANTITY

0.98+

two thingsQUANTITY

0.97+

single noteQUANTITY

0.97+

two bitsQUANTITY

0.97+

dave vellantePERSON

0.97+

HDFSTITLE

0.97+

10QUANTITY

0.97+

firstQUANTITY

0.97+

JerryPERSON

0.97+

facebookORGANIZATION

0.97+

hundred LQUANTITY

0.96+

bothQUANTITY

0.96+

million dollarsQUANTITY

0.96+

one hundred percentQUANTITY

0.95+

Red HatTITLE

0.95+

AugustDATE

0.95+

MapReduceTITLE

0.95+

Amr AwadallahPERSON

0.95+

tomorrowDATE

0.94+

hundred millionQUANTITY

0.94+

Dr.PERSON

0.94+

hundred million dollarQUANTITY

0.94+

up to 15 hoursQUANTITY

0.93+

hadoopTITLE

0.93+

WindowsTITLE

0.93+

single machineQUANTITY

0.92+

HBaseTITLE

0.92+

new york cityLOCATION

0.9+

yearsQUANTITY

0.9+

a yearQUANTITY

0.9+

ApacheORGANIZATION

0.9+

oneQUANTITY

0.89+

a lot of peopleQUANTITY

0.87+

red hatTITLE

0.85+

Hadoop WorldTITLE

0.84+

SiliconANGLEORGANIZATION

0.82+

two-thirtyDATE

0.8+

FuddPERSON

0.77+

Michaelson roadPERSON

0.74+