Breaking Analysis: How JPMC is Implementing a Data Mesh Architecture on the AWS Cloud

>> From theCUBE studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR. This is braking analysis with Dave Vellante. >> A new era of data is upon us, and we're in a state of transition. You know, even our language reflects that. We rarely use the phrase big data anymore, rather we talk about digital transformation or digital business, or data-driven companies. Many have come to the realization that data is a not the new oil, because unlike oil, the same data can be used over and over for different purposes. We still use terms like data as an asset. However, that same narrative, when it's put forth by the vendor and practitioner communities, includes further discussions about democratizing and sharing data. Let me ask you this, when was the last time you wanted to share your financial assets with your coworkers or your partners or your customers? Hello everyone, and welcome to this week's Wikibon Cube Insights powered by ETR. In this breaking analysis, we want to share our assessment of the state of the data business. We'll do so by looking at the data mesh concept and how a leading financial institution, JP Morgan Chase is practically applying these relatively new ideas to transform its data architecture. Let's start by looking at what is the data mesh. As we've previously reported many times, data mesh is a concept and set of principles that was introduced in 2018 by Zhamak Deghani who's director of technology at ThoughtWorks, it's a global consultancy and software development company. And she created this movement because her clients, who were some of the leading firms in the world had invested heavily in predominantly monolithic data architectures that had failed to deliver desired outcomes in ROI. So her work went deep into trying to understand that problem. And her main conclusion that came out of this effort was the world of data is distributed and shoving all the data into a single monolithic architecture is an approach that fundamentally limits agility and scale. Now a profound concept of data mesh is the idea that data architectures should be organized around business lines with domain context. That the highly technical and hyper specialized roles of a centralized cross functional team are a key blocker to achieving our data aspirations. This is the first of four high level principles of data mesh. So first again, that the business domain should own the data end-to-end, rather than have it go through a centralized big data technical team. Second, a self-service platform is fundamental to a successful architectural approach where data is discoverable and shareable across an organization and an ecosystem. Third, product thinking is central to the idea of data mesh. In other words, data products will power the next era of data success. And fourth data products must be built with governance and compliance that is automated and federated. Now there's lot more to this concept and there are tons of resources on the web to learn more, including an entire community that is formed around data mesh. But this should give you a basic idea. Now, the other point is that, in observing Zhamak Deghani's work, she is deliberately avoided discussions around specific tooling, which I think has frustrated some folks because we all like to have references that tie to products and tools and companies. So this has been a two-edged sword in that, on the one hand it's good, because data mesh is designed to be tool agnostic and technology agnostic. On the other hand, it's led some folks to take liberties with the term data mesh and claim mission accomplished when their solution, you know, maybe more marketing than reality. So let's look at JP Morgan Chase in their data mesh journey. Is why I got really excited when I saw this past week, a team from JPMC held a meet up to discuss what they called, data lake strategy via data mesh architecture. I saw that title, I thought, well, that's a weird title. And I wondered, are they just taking their legacy data lakes and claiming they're now transformed into a data mesh? But in listening to the presentation, which was over an hour long, the answer is a definitive no, not at all in my opinion. A gentleman named Scott Hollerman organized the session that comprised these three speakers here, James Reid, who's a divisional CIO at JPMC, Arup Nanda who is a technologist and architect and Serita Bakst who is an information architect, again, all from JPMC. This was the most detailed and practical discussion that I've seen to date about implementing a data mesh. And this is JP Morgan's their approach, and we know they're extremely savvy and technically sound. And they've invested, it has to be billions in the past decade on data architecture across their massive company. And rather than dwell on the downsides of their big data past, I was really pleased to see how they're evolving their approach and embracing new thinking around data mesh. So today, we're going to share some of the slides that they use and comment on how it dovetails into the concept of data mesh that Zhamak Deghani has been promoting, and at least as we understand it. And dig a bit into some of the tooling that is being used by JP Morgan, particularly around it's AWS cloud. So the first point is it's all about business value, JPMC, they're in the money business, and in that world, business value is everything. So Jr Reid, the CIO showed this slide and talked about their overall goals, which centered on a cloud first strategy to modernize the JPMC platform. I think it's simple and sensible, but there's three factors on which he focused, cut costs always short, you got to do that. Number two was about unlocking new opportunities, or accelerating time to value. But I was really happy to see number three, data reuse. That's a fundamental value ingredient in the slide that he's presenting here. And his commentary was all about aligning with the domains and maximizing data reuse, i.e. data is not like oil and making sure there's appropriate governance around that. Now don't get caught up in the term data lake, I think it's just how JP Morgan communicates internally. It's invested in the data lake concept, so they use water analogies. They use things like data puddles, for example, which are single project data marts or data ponds, which comprise multiple data puddles. And these can feed in to data lakes. And as we'll see, JPMC doesn't strive to have a single version of the truth from a data standpoint that resides in a monolithic data lake, rather it enables the business lines to create and own their own data lakes that comprise fit for purpose data products. And they do have a single truth of metadata. Okay, we'll get to that. But generally speaking, each of the domains will own end-to-end their own data and be responsible for those data products, we'll talk about that more. Now the genesis of this was sort of a cloud first platform, JPMC is leaning into public cloud, which is ironic since the early days, in the early days of cloud, all the financial institutions were like never. Anyway, JPMC is going hard after it, they're adopting agile methods and microservices architectures, and it sees cloud as a fundamental enabler, but it recognizes that on-prem data must be part of the data mesh equation. Here's a slide that starts to get into some of that generic tooling, and then we'll go deeper. And I want to make a couple of points here that tie back to Zhamak Deghani's original concept. The first is that unlike many data architectures, this puts data as products right in the fat middle of the chart. The data products live in the business domains and are at the heart of the architecture. The databases, the Hadoop clusters, the files and APIs on the left-hand side, they serve the data product builders. The specialized roles on the right hand side, the DBA's, the data engineers, the data scientists, the data analysts, we could have put in quality engineers, et cetera, they serve the data products. Because the data products are owned by the business, they inherently have the context that is the middle of this diagram. And you can see at the bottom of the slide, the key principles include domain thinking, an end-to-end ownership of the data products. They build it, they own it, they run it, they manage it. At the same time, the goal is to democratize data with a self-service as a platform. One of the biggest points of contention of data mesh is governance. And as Serita Bakst said on the Meetup, metadata is your friend, and she kind of made a joke, she said, "This sounds kind of geeky, but it's important to have a metadata catalog to understand where data resides and the data lineage in overall change management. So to me, this really past the data mesh stink test pretty well. Let's look at data as products. CIO Reid said the most difficult thing for JPMC was getting their heads around data product, and they spent a lot of time getting this concept to work. Here's the slide they use to describe their data products as it related to their specific industry. They set a common language and taxonomy is very important, and you can imagine how difficult that was. He said, for example, it took a lot of discussion and debate to define what a transaction was. But you can see at a high level, these three product groups around wholesale, credit risk, party, and trade and position data as products, and each of these can have sub products, like, party, we'll have to know your customer, KYC for example. So a key for JPMC was to start at a high level and iterate to get more granular over time. So lots of decisions had to be made around who owns the products and the sub-products. The product owners interestingly had to defend why that product should even exist, what boundaries should be in place and what data sets do and don't belong in the various products. And this was a collaborative discussion, I'm sure there was contention around that between the lines of business. And which sub products should be part of these circles? They didn't say this, but tying it back to data mesh, each of these products, whether in a data lake or a data hub or a data pond or data warehouse, data puddle, each of these is a node in the global data mesh that is discoverable and governed. And supporting this notion, Serita said that, "This should not be infrastructure-bound, logically, any of these data products, whether on-prem or in the cloud can connect via the data mesh." So again, I felt like this really stayed true to the data mesh concept. Well, let's look at some of the key technical considerations that JPM discussed in quite some detail. This chart here shows a diagram of how JP Morgan thinks about the problem, and some of the challenges they had to consider were how to write to various data stores, can you and how can you move data from one data store to another? How can data be transformed? Where's the data located? Can the data be trusted? How can it be easily accessed? Who has the right to access that data? These are all problems that technology can help solve. And to address these issues, Arup Nanda explained that the heart of this slide is the data in ingestor instead of ETL. All data producers and contributors, they send their data to the ingestor and the ingestor then registers the data so it's in the data catalog. It does a data quality check and it tracks the lineage. Then, data is sent to the router, which persists the data in the data store based on the best destination as informed by the registration. This is designed to be a flexible system. In other words, the data store for a data product is not fixed, it's determined at the point of inventory, and that allows changes to be easily made in one place. The router simply reads that optimal location and sends it to the appropriate data store. Nowadays you see the schema infer there is used when there is no clear schema on right. In this case, the data product is not allowed to be consumed until the schema is inferred, and then the data goes into a raw area, and the inferer determines the schema and then updates the inventory system so that the data can be routed to the proper location and properly tracked. So that's some of the detail of how the sausage factory works in this particular use case, it was very interesting and informative. Now let's take a look at the specific implementation on AWS and dig into some of the tooling. As described in some detail by Arup Nanda, this diagram shows the reference architecture used by this group within JP Morgan, and it shows all the various AWS services and components that support their data mesh approach. So start with the authorization block right there underneath Kinesis. The lake formation is the single point of entitlement and has a number of buckets including, you can see there the raw area that we just talked about, a trusted bucket, a refined bucket, et cetera. Depending on the data characteristics at the data catalog registration block where you see the glue catalog, that determines in which bucket the router puts the data. And you can see the many AWS services in use here, identity, the EMR, the elastic MapReduce cluster from the legacy Hadoop work done over the years, the Redshift Spectrum and Athena, JPMC uses Athena for single threaded workloads and Redshift Spectrum for nested types so they can be queried independent of each other. Now remember very importantly, in this use case, there is not a single lake formation, rather than multiple lines of business will be authorized to create their own lakes, and that creates a challenge. So how can that be done in a flexible and automated manner? And that's where the data mesh comes into play. So JPMC came up with this federated lake formation accounts idea, and each line of business can create as many data producer or consumer accounts as they desire and roll them up into their master line of business lake formation account. And they cross-connect these data products in a federated model. And these all roll up into a master glue catalog so that any authorized user can find out where a specific data element is located. So this is like a super set catalog that comprises multiple sources and syncs up across the data mesh. So again to me, this was a very well thought out and practical application of database. Yes, it includes some notion of centralized management, but much of that responsibility has been passed down to the lines of business. It does roll up to a master catalog, but that's a metadata management effort that seems compulsory to ensure federated and automated governance. As well at JPMC, the office of the chief data officer is responsible for ensuring governance and compliance throughout the federation. All right, so let's take a look at some of the suspects in this world of data mesh and bring in the ETR data. Now, of course, ETR doesn't have a data mesh category, there's no such thing as that data mesh vendor, you build a data mesh, you don't buy it. So, what we did is we use the ETR dataset to select and filter on some of the culprits that we thought might contribute to the data mesh to see how they're performing. This chart depicts a popular view that we often like to share. It's a two dimensional graphic with net score or spending momentum on the vertical axis and market share or pervasiveness in the data set on the horizontal axis. And we filtered the data on sectors such as analytics, data warehouse, and the adjacencies to things that might fit into data mesh. And we think that these pretty well reflect participation that data mesh is certainly not all compassing. And it's a subset obviously, of all the vendors who could play in the space. Let's make a few observations. Now as is often the case, Azure and AWS, they're almost literally off the charts with very high spending velocity and large presence in the market. Oracle you can see also stands out because much of the world's data lives inside of Oracle databases. It doesn't have the spending momentum or growth, but the company remains prominent. And you can see Google Cloud doesn't have nearly the presence in the dataset, but it's momentum is highly elevated. Remember that red dotted line there, that 40% line, anything over that indicates elevated spending momentum. Let's go to Snowflake. Snowflake is consistently shown to be the gold standard in net score in the ETR dataset. It continues to maintain highly elevated spending velocity in the data. And in many ways, Snowflake with its data marketplace and its data cloud vision and data sharing approach, fit nicely into the data mesh concept. Now, a caution, Snowflake has used the term data mesh in it's marketing, but in our view, it lacks clarity, and we feel like they're still trying to figure out how to communicate what that really is. But is really, we think a lot of potential there to that vision. Databricks is also interesting because the firm has momentum and we expect further elevated levels in the vertical axis in upcoming surveys, especially as it readies for its IPO. The firm has a strong product and managed service, and is really one to watch. Now we included a number of other database companies for obvious reasons like Redis and Mongo, MariaDB, Couchbase and Terradata. SAP as well is in there, but that's not all database, but SAP is prominent so we included them. As is IBM more of a database, traditional database player also with the big presence. Cloudera includes Hortonworks and HPE Ezmeral comprises the MapR business that HPE acquired. So these guys got the big data movement started, between Cloudera, Hortonworks which is born out of Yahoo, which was the early big data, sorry early Hadoop innovator, kind of MapR when it's kind of owned course, and now that's all kind of come together in various forms. And of course, we've got Talend and Informatica are there, they are two data integration companies that are worth noting. We also included some of the AI and ML specialists and data science players in the mix like DataRobot who just did a monster $250 million round. Dataiku, H2O.ai and ThoughtSpot, which is all about democratizing data and injecting AI, and I think fits well into the data mesh concept. And you know we put VMware Cloud in there for reference because it really is the predominant on-prem infrastructure platform. All right, let's wrap with some final thoughts here, first, thanks a lot to the JP Morgan team for sharing this data. I really want to encourage practitioners and technologists, go to watch the YouTube of that meetup, we'll include it in the link of this session. And thank you to Zhamak Deghani and the entire data mesh community for the outstanding work that you're doing, challenging the established conventions of monolithic data architectures. The JPM presentation, it gives you real credibility, it takes Data Mesh well beyond concept, it demonstrates how it can be and is being done. And you know, this is not a perfect world, you're going to start somewhere and there's going to be some failures, the key is to recognize that shoving everything into a monolithic data architecture won't support massive scale and agility that you're after. It's maybe fine for smaller use cases in smaller firms, but if you're building a global platform in a data business, it's time to rethink data architecture. Now much of this is enabled by the cloud, but cloud first doesn't mean cloud only, doesn't mean you'll leave your on-prem data behind, on the contrary, you have to include non-public cloud data in your Data Mesh vision just as JPMC has done. You've got to get some quick wins, that's crucial so you can gain credibility within the organization and grow. And one of the key takeaways from the JP Morgan team is, there is a place for dogma, like organizing around data products and domains and getting that right. On the other hand, you have to remain flexible because technologies is going to come, technology is going to go, so you got to be flexible in that regard. And look, if you're going to embrace the metaphor of water like puddles and ponds and lakes, we suggest maybe a little tongue in cheek, but still we believe in this, that you expand your scope to include data ocean, something John Furry and I have talked about and laughed about extensively in theCUBE. Data oceans, it's huge. It's the new data lake, go transcend data lake, think oceans. And think about this, just as we're evolving our language, we should be evolving our metrics. Much the last the decade of big data was around just getting the stuff to work, getting it up and running, standing up infrastructure and managing massive, how much data you got? Massive amounts of data. And there were many KPIs built around, again, standing up that infrastructure, ingesting data, a lot of technical KPIs. This decade is not just about enabling better insights, it's a more than that. Data mesh points us to a new era of data value, and that requires the new metrics around monetizing data products, like how long does it take to go from data product conception to monetization? And how does that compare to what it is today? And what is the time to quality if the business owns the data, and the business has the context? the quality that comes out of them, out of the shoot should be at a basic level, pretty good, and at a higher mark than out of a big data team with no business context. Automation, AI, and very importantly, organizational restructuring of our data teams will heavily contribute to success in the coming years. So we encourage you, learn, lean in and create your data future. Okay, that's it for now, remember these episodes, they're all available as podcasts wherever you listen, all you got to do is search, breaking analysis podcast, and please subscribe. Check out ETR's website at etr.plus for all the data and all the survey information. We publish a full report every week on wikibon.com and siliconangle.com. And you can get in touch with us, email me david.vellante@siliconangle.com, you can DM me @dvellante, or you can comment on my LinkedIn posts. This is Dave Vellante for theCUBE insights powered by ETR. Have a great week everybody, stay safe, be well, and we'll see you next time. (upbeat music)

Published Date : Jul 12 2021

SUMMARY :

This is braking analysis and the adjacencies to things

ENTITIES

Entity	Category	Confidence
JPMC	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
2018	DATE	0.99+
Zhamak Deghani	PERSON	0.99+
James Reid	PERSON	0.99+
JP Morgan	ORGANIZATION	0.99+
JP Morgan	ORGANIZATION	0.99+
Cloudera	ORGANIZATION	0.99+
Serita Bakst	PERSON	0.99+
IBM	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Scott Hollerman	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
40%	QUANTITY	0.99+
JP Morgan Chase	ORGANIZATION	0.99+
Serita	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Arup Nanda	PERSON	0.99+
each	QUANTITY	0.99+
ThoughtWorks	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
david.vellante@siliconangle.com	OTHER	0.99+
each line	QUANTITY	0.99+
Terradata	ORGANIZATION	0.99+
Redis	ORGANIZATION	0.99+
$250 million	QUANTITY	0.99+
first point	QUANTITY	0.99+
three factors	QUANTITY	0.99+
Second	QUANTITY	0.99+
MapR	ORGANIZATION	0.99+
today	DATE	0.99+
Informatica	ORGANIZATION	0.99+
Talend	ORGANIZATION	0.99+
John Furry	PERSON	0.99+
Zhamak Deghani	PERSON	0.99+
first platform	QUANTITY	0.98+
YouTube	ORGANIZATION	0.98+
fourth	QUANTITY	0.98+
single	QUANTITY	0.98+
One	QUANTITY	0.98+
Third	QUANTITY	0.97+
Couchbase	ORGANIZATION	0.97+
three speakers	QUANTITY	0.97+
two data	QUANTITY	0.97+
first strategy	QUANTITY	0.96+
one	QUANTITY	0.96+
one place	QUANTITY	0.96+
Jr Reid	PERSON	0.96+
single lake	QUANTITY	0.95+
SAP	ORGANIZATION	0.95+
wikibon.com	OTHER	0.95+
siliconangle.com	OTHER	0.94+
Azure	ORGANIZATION	0.93+

Jacque Istok, Pivotal | BigData NYC 2017

>> Announcer: Live from midtown Manhattan, it's the Cube, covering big data New York City 2017. Brought to you by Silicon Angle Media and its ecosystem sponsors. >> Welcome back everyone, we're here live in New York City for the week, three days of wall to wall coverage of big data NYC, it's big data week here in conjunction with Strata Adup, Strata Data which is an event running right around the corner, this is the Cube, I'm John Furrier with my cohost, Peter Burris, our next guest Jacque Istok who's the head of data at Pivotal. Welcome to the Cube, good to see you again. >> Likewise. >> You guys had big news we covered at VMware, obviously the Kubernetes craze is fantastic, you're starting to see cloud native platforms front and center even in some of these operational worlds like in cloud, data you guys have been here a while with Green Plum and Pivotal's been adding more to the data suite, so you guys are a player in this ecosystem. >> Correct. >> As it grows to be much more developer-centric and enterprise-centric and AI-centric, what's the update? >> I'd like to talk about a couple things, just three quick things here, one focused primarily on simplicity, first and foremost as you said, there's a lot of things going on on the cloud foundry side, a lot of things that we're doing with Kubernetes, etc., super exciting. I will say Tony Berge has written a nice piece about Green Plum in Zitinet, essentially calling Green Plum the best kept secret in the analytic database world. Why I think that's important is, what isn't really well known is that over the period of Pivotal's history, the last four and a half years, we focused really heavily on the cloud foundry side, on dev/ops, on getting users to actually be able to publish code. What we haven't talked about as much is what we're doing on the data side and I find it very interesting to repeatedly tell analysts and customers that the Green Plum business has been and continues to be a profitable business unit within Pivotal, so as we're growing on the cloud foundry side, we're continuing to grow a business that many of the organizations that I see here at Strata are still looking to get to, that ever forgotten profitability zone. >> There's a legacy around Green Plum, I'm not going to say they pivoted, pun intended, Pivotal. There's been added stuff around Green Plum, Green Plum might get lost in the messaging because it's been now one of many ingredients, right? >> It's true and when we formed Pivotal, I think there were 34 some different skews that we have now focused in on over the last two years or so. What's super exciting is again, over that time period, one of the things that we took to heart within the Green Plum side is this idea of extreme agile. As you guys know, Pivotal Labs being the core part of the Pivotal mission helps our customers figure out how to actually build software. We finally are drinking our own champagne and over the last year and a half of Green Plum R&D, we're shipping code, a complete data platform, we're shipping that on a cadence of about four to five weeks which again, a little bit unheard of in the industry, being able to move at that pace. We work through the backlog and what is also super exciting and I'm glad that you guys are able to help me tell the world, we released version five last week. Version five is actually the only parallel open source data platform that actually has native ANSI compliance SQL and I feel a little bit like I've rewound the clock 15 years where I have to actually throw in the ANSI compliance, but I think that in a lot of ways, there are SQL alternatives that are out there in the world. They are very much not ANSI compliant and that hurts. >> It's a nuance but it's table stakes in the enterprise. ANSI compliance is just, >> There's a reason you want to be ANSI compliant, because there's a whole swath of analytic applications mainly in the data warehouse world, that were built using ANSI compliant SQL, so why do this with version five? I presume it's got to have something to do with you want to start capturing some of those applications and helping customers modernize. >> That is correct. I think the SQL piece is one part of the data platform, of really a modern data platform. The other parts are again, becoming table stakes. Being able to do text analytics, we've backed Apache Solar within Green Plum, being able to do graph analytics or spatial analytics, anything from classifications, regressions, all of that, actually becomes table stakes and we feel that enterprises have suffered a little bit over the last five or six years. They've had this promise of having a new platform that they can leverage for doing interesting new things, machine learning, AI, etc. but the existing stuff that they were trying to do has been super, super hard. What we're trying to do is bridge those together and provide both in the same platform, out of the gate so that customers can actually use it immediately and I think one of the things we've seen is there's about 1000 to one SQL experienced individuals within the enterprise versus say Haduk experience in individuals. The other thing that I think is actually super important and almost bigger than everything else I talked about is we're the, a lot of the old school postgres deriviants of MBD databases forked their databases at some point in postgres's history, for a variety of reasons from licensing to when they started. Green Plum's no different. We forked right around eight dot too with this last release of version five, we've actually up leveled the postgres base within Green Plum's 8.3. Now in and of itself, it doesn't sound, >> What does that mean? >> We are now taking a 100% commitment both to open source and both to the postgres community. I think if you look at postgres today, in its latest versions, it is a full fledged, mission critical database that can be used anywhere. What we feel is that if we can bring our core engineering developments around parallelism, around analytics and combine that with postgres itself, then we don't have to implement all of the low level database things that a lot of our competitors have to do. What's unique about it is one, Green Plum continues to be open source, which again most of our competitors are not, two if you look at primarily what they're doing, nobody's got that level of commitment to the postgres community which means all of their resources are going to be stuck building core database technology, even building that ANSI SQL compliance in, which we'll get "for free" which will let us focus on things like machine learning, artificial intelligence. >> Just give a quick second and tell about the relevance of postgres because of the success, first of all it's massive, it's everywhere, but it's not going anywhere. Just give a quick, for the audience watching, what's the relevance of it. >> Sure like you said, it is everywhere. It is the most full featured, actual database in the open source community. Arguably my SQL has "more" market share, but my SQL projects that generally leverage them are not used for mission critical enterprise applications. Being able to have parity allows us not only to have that database technology baked into Green Plum, but it also gives us all of the community stuff with it. Everything from being able to leverage the most recent ODBC and JDBC libraries, but also integrations into everything from the post GIS travert for geospatial to being able to connect to other types of data sources, etc. >> It's a big community, shows that it's successful, but again, >> And it doesn't come in a red box. >> It does not come in a red box, that is correct. >> Which is not a bad thing. Look, postgres as a technology was developed a long time ago, largely in response to think about analytics and transaction, or analytics and operating applications might have actually come to and we're now living in a world where we can actually see the hardware and a lot of practices, etc. are beginning to find ways where this may start to happen. With Green Plum and postgres both MPP based, so your, by going to this, you're able to stay more modern, more up to date on all the new technology that's coming together to support these richer, more complex classes of applications. >> You're spot on, I suppose I would argue that postgres, I feel came up with as a response to Oracle in the past of, we need an open source alternative to Oracle, but other than that, 100% correct. >> There was always a difference between postgres and MySQL, MySQL always was okay, that's that, let's do that open source, postgres coming out of Berkeley and coming out of some other places, always had a slightly different notion of the types of problems it was going to take on. >> 100% correct, 100%. But to your question before, what does this all mean to customers, I think the one thing that version five really gives us the confidence to say is, and a lot of times I hate lobbing when the ball's out like this, but we welcome and embrace with open arms any terradata customers out there that are looking to save millions if not tens of millions of dollars on a modern platform that can actually run not only on premise, not only on bare metal, but virtually and off premise. We're truly the only MPP platform, the only open source MPP data platform that can allow you to build analytics and move those analytics from Amazon to Azure to back on prem. >> Talk about this, the terradata thing for a second, I want to get down and double click on that. Customers don't want to change code, so what specifically are you guys offering terradata customers specifically. With the release of version five, with a lot of the development that we've done and some of the partnering that we've done, we are now able to take without changing a line of code of your terradata applications, you load the data within the Green Plum platform, you can point those applications directly to Green Plum and run them unchanged, so I think in the past, the reticence to move to any other platform was really the amount of time it would take to actually redevelop all of the stuff that you had. We offer an ability to go from an immediate ROI to a platform that again, bridges that gap, allows you to really be modern. >> Peter, I want to talk to you about that importance that we just said because you've been studying the private cloud report, true private cloud which is on premises, coming from a cloud operating model, automating away undifferentiated labor and shipping that to differentiated labor, but this brings up what customers want in hybrid cloud and ultimately having public cloud and private cloud so hybrid sits there. They don't want to change their code basis, this is a huge deal. >> Obviously a couple things to go along with what Jacque said. The first thing is that you're right, people want the data to run where the data naturally needs to run or should run, that's the big argument about public versus hybrid versus what we call true private cloud. The idea that decreasing the workload needs to be located where the data, where it naturally should be located because of the physical, legal, regulatory, intellectual property attributes of the data, being able to do that is really really important. The other thing that Jacque said that goes right into this question John, is that ultimately in too many domains in this analytics world, which is fundamentally predicated on the idea of breaking data out of applications so that you can use it in new and novel and more value creating ways, is that the data gets locked up in a data warehouse. What's valuable in a data warehouse is not the hardware. It's the data. By providing the facility for being able to point an application at a couple of different data source including one that's more modern, or which takes advantage of more modern technology, that can be considerably cheaper, it means the shop can elevate the story about the asset and the asset here is the data and the applications that run against it, not the hardware and the system where the data's stored and located. One of the biggest challenges, we talked earlier just to go on for a second, we talked earlier with a couple of other guests about the fact that the industry still, what your average person still doesn't understand how to value data. How to establish a data asset and one of the reasons is because it's so constantly co-mingled with the underlying hardware. >> And actually I'd even further go on, I think the advent of some of these cloud data warehouses forgets that notion of being able to run it different places and provides one of the things that customers are really looking for which is simplicity. The ability to spin up a quick MPP SQL system within say Amazon for example, almost without a doubt, a lot of the business users that I speak to are willing to sacrifice capabilities within the platform which they are for the simplicity of getting up and going. One of the things that we really focused on in V5 is being able to give that same turnkey feel and so Green Plum exists within the Amazon marketplace, within the Azure marketplace, Google later this quarter, and then in addition to the simplicity, it has all of the functionality that is missing in those platforms, again, all the analytics, all the ability to reach out and federate queries against different types of data, I think it's exciting as we continue to progress in our releases, Green Plum has, for a number of years, had this ability to seamlessly query HGFS. Like a lot of the competitors, but HGFS isn't going away, neither is a generic object store like S3. But we continue to extend that to things like Spark for example, so now the ability to actually house your data within a data platform and seamlessly integrate with Spark back and forth, if you want to use Spark, use Spark, but somewhere that data needs to be materialized so that other applications can leverage it as well. >> But even then people have been saying well, if you want to put it on this disk, then put it on this disk. Given the question about Spark versus another database manager is a higher level conversation than many of the shops who are investing millions and millions and millions of dollars in their analytic application portfolio and all you're trying to do is, as I interpret it, is trying to say look, the value in the portfolio is the applications and the data. It's not the underlying elements. There's a whole bunch of new elements we can use, you can put it in the cloud, you can put it on premise if that's where the data belongs. Use some of these new and evolving technologies, but you're focused on how the data and the applications continue to remain valuable to the business over time and not the traditional hardware assets. >> Correct and I'll again leverage a notion that we get from labs, which is this idea of user centric design and so everything that we've been putting into the Green Plum database is around, ideally the four primary users of our system. Not just the analysts and not just the data scientists, but also the operators and the IT folks. That is where I'd say the last tenant of where we're going really is this idea of coopetition. I would, as the Pivotal Green Plum guy that's been around for 10 plus years, I would tell you very straight up that we are again, an open source MPP data platform that can rival any other platform out there, whether it's terradata, whether it's Haduke, we can beat that platform. >> Why should customers call you up? Why should they call you? There's all this other stuff out there, you got legacy, you got terradata, might have other things, people are knocking at my door, they're getting pounded with sales messages, buy me I'm better than the other guy. Why Pivotal data? >> The first thing I would say is, the latest reviews from Gardner for example, well actually let me rewind. I will easily argue that terradata has been the data warehouse platform for the last 30 years that everyone has tried to emulate. I'd even argue so much as that when Haduke came on the scene eight years ago, what they did was they changed the dynamics and what they're doing now is actually trying to emulate the terradata success through things like SQL on top of Haduke. What that has basically gotten us to is we're looking for a terradata replacement at Haduke like prices, that's what Green Plum has to offer in spades. Now, if you actually extend that just a little bit, I still recognize that not everybody's going to call us, there are still 200 other vendors out there that are selling a similar product or similar kinds of stories. What I would tell you in response to those folks is that Green Plum has been around in production for the last 10 plus years, we're a proven technology for solving problems, many of those are not. We work very well in this cooperative spirit of, Green Plum can be the end all be all, but I recognize it's not going to be the end all be all so this is why we have to work within the ecosystem. >> You have to, open source is dominating. At the Linux event, we just covered open source summit, 90% of software written will be open source libraries, 10% is where the value's being added. >> For sure, if you were to start up a new star up right now, would you go with a commercial product? >> No, just postgres database is good. All right final question to end the segment. This big data space that's now being called data, certainly Strata, Haduke is now Strata Data, just trying to keep that show going longer. But you got Microsoft Azure making a lot of waves going on right now with Microsoft Ignite, so cloud is into the play here, data's changed, so the question is how has this industry changed over the past eight years. You go back to 2010, I saw Green Plum coming prior to even getting bought out, but they were kicking ass, same product evolved. Where has the space gone? What's happened, how would you summarize it to someone who's walking in for the first year like hey back in the old days, we used to walk to school in the snow with no shoes on both ways. Now it's like get off my lawn you young developers. Seriously what is the evolution of that, how would you explain it? >> Again, I would start with terradata started the industry, by far and then folks like Netease and Green Plum came around to really give a lower cost alternative. Haduke came on the scene eight some years ago, and what I pride myself in being at Green Plum for this long and Green Plum implemented the map produced paradigm as Haduke was starting to build and as it continued to build, we focused on building our own distribution and SQL and Haduke, I think what we're getting down to is the brass tacks of the business is tired of technological science experiments and they just want to get stuff done. >> And a cost of ownership that's manageable. >> And sustainable. >> And sustainable and not in a spot where they're going to be locked into a single vendor, hence the open source. >> The ones that are winning today employed what strategy that ended up working out and what strategy didn't end up working out, if you go back and say, the people who took this path failed, people who took this approach won. What's the answer there? >> Clearly anybody who was an appliance that has long since drifted. I'd also say Green Plum's in this unique position where, >> An appliance too though. >> Well, pseudo appliance yes, I still have to respond to that, we were always software. >> You pivoted luckily. >> But putting that aside, the hardware vendors have gone away, all of the software competitors that we had have actually either been sunset, sold off and forgotten and so Green Plum, here we sit as the sole standard or person that's been around for the long haul. We are now seeing a spot where we have no competition other than the forgotten really legacy guys like terradata. People are longing to get off of legacy and onto something modern, the trick will be whether that modern is some of these new and upcoming players and technologies, or whether it really focuses on solving problems. >> What's the strategy with the winning strategy? Stick to your knitting, stick to what you know or was it more of, >> For us it was two fold, one it was continuing to service our customers and make them successful so that was how we built a profitable data platform business and then the other was to double down on the strategies that seemed to be interesting to organizations which were cloud, open source, and analytics and like you said, I talked to one of the folks over at the Air Force and he was mentioning how to him, data's actually more important than fuel, being able to understand where the airplanes are, where the fuel is, where the people are, where the missiles are etc., that's actually more important than the fuel itself. Data is the thing that powers everything. >> Data's currency of everything now, great Jacque thinks so much for coming on the Cube, Pivotal Data Platform, Data Suite, Green Plum now with all these other adds, that's great congratulations. Stay on the path helping customers, you can't lose. >> Exactly. >> The Cube here helping you figure out the big data noise, we're obviously in big data New York City event for our annual, the annual Cube Wikibon event, in conjunction with Strata Data across the street, more live coverage here for three days here in New York City I'm John Furrier, Peter Burris, we'll be back after this short break. (electronic music)

Published Date : Sep 27 2017

SUMMARY :

Brought to you by Silicon Angle Media Welcome to the Cube, good to see you again. to the data suite, so you guys analysts and customers that the Green Plum Green Plum might get lost in the messaging and over the last year and a half of Green Plum R&D, It's a nuance but it's table stakes in the enterprise. I presume it's got to have something to do with and provide both in the same platform, and both to the postgres community. of postgres because of the success, It is the most full featured, and operating applications might have actually come to in the past of, we need an open source alternative of the types of problems it was going to take on. MPP data platform that can allow you the reticence to move to any other platform and shipping that to differentiated labor, is that the data gets locked up in a data warehouse. all the ability to reach out and federate queries and the applications continue to remain valuable but also the operators and the IT folks. Why should customers call you up? I still recognize that not everybody's going to call us, At the Linux event, we just covered open source summit, in the snow with no shoes on both ways. and Green Plum implemented the map produced paradigm And sustainable and not in a spot where they're going to be the people who took this path failed, that has long since drifted. to respond to that, we were always software. But putting that aside, the hardware on the strategies that seemed to be interesting Stay on the path helping customers, you can't lose. for our annual, the annual Cube Wikibon event,

ENTITIES

Entity	Category	Confidence
Jacque	PERSON	0.99+
Peter Burris	PERSON	0.99+
Green Plum	ORGANIZATION	0.99+
Peter	PERSON	0.99+
Jacque Istok	PERSON	0.99+
John Furrier	PERSON	0.99+
Tony Berge	PERSON	0.99+
Silicon Angle Media	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
John	PERSON	0.99+
New York City	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
millions	QUANTITY	0.99+
90%	QUANTITY	0.99+
NYC	LOCATION	0.99+
Berkeley	LOCATION	0.99+
Pivotal	ORGANIZATION	0.99+
MySQL	TITLE	0.99+
2010	DATE	0.99+
first	QUANTITY	0.99+
Spark	TITLE	0.99+
Microsoft	ORGANIZATION	0.99+
eight years ago	DATE	0.99+
10%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
three days	QUANTITY	0.99+
Haduke	ORGANIZATION	0.99+
last week	DATE	0.99+
both	QUANTITY	0.98+
Strata	ORGANIZATION	0.98+
Netease	ORGANIZATION	0.98+
eight	DATE	0.98+
One	QUANTITY	0.98+
Strata Adup	ORGANIZATION	0.98+
first thing	QUANTITY	0.98+
terradata	ORGANIZATION	0.98+
Oracle	ORGANIZATION	0.97+
15 years	QUANTITY	0.97+
first year	QUANTITY	0.97+
200 other vendors	QUANTITY	0.97+
Strata Data	ORGANIZATION	0.97+
two	QUANTITY	0.97+
tens of millions of dollars	QUANTITY	0.97+
millions of dollars	QUANTITY	0.97+
one part	QUANTITY	0.97+

Wikibon Research Meeting

>> Dave: The cloud. There you go. I presume that worked. >> David: Hi there. >> Dave: Hi David. We had agreed, Peter and I had talked and we said let's just pick three topics, allocate enough time. Maybe a half hour each, and then maybe a little bit longer if we have the time. Then try and structure it so we can gather some opinions on what it all means. Ultimately the goal is to have an outcome with some research that hits the network. The three topics today, Jim Kobeielus is going to present on agile and data science, David Floyer on NVMe over fabric and of course keying off of the Micron news announcement. I think Nick is, is that Nick who just joined? He can contribute to that as well. Then George Gilbert has this concept of digital twin. We'll start with Jim. I guess what I'd suggest is maybe present this in the context of, present a premise or some kind of thesis that you have and maybe the key issues that you see and then kind of guide the conversation and we'll all chime in. >> Jim: Sure, sure. >> Dave: Take it away, Jim. >> Agile development and team data science. Agile methodology obviously is well-established as a paradigm and as a set of practices in various schools in software development in general. Agile is practiced in data science in terms of development, the pipelines. The overall premise for my piece, first of all starting off with a core definition of what agile is as a methodology. Self-organizing, cross-functional teams. They sprint toward results in steps that are fast, iterative, incremental, adaptive and so forth. Specifically the premise here is that agile has already come to data science and is coming even more deeply into the core practice of data science where data science is done in team environment. It's not just unicorns that are producing really work on their own, but more to the point, it's teams of specialists that come together in co-location, increasingly in co-located environments or in co-located settings to produce (banging) weekly check points and so forth. That's the basic premise that I've laid out for the piece. The themes. First of all, the themes, let me break it out. In terms of the overall how I design or how I'm approaching agile in this context is I'm looking at the basic principles of agile. It's really practices that are minimal, modular, incremental, iterative, adaptive, and co-locational. I've laid out how all that maps in to how data science is done in the real world right now in terms of tight teams working in an iterative fashion. A couple of issues that I see as regards to the adoption and sort of the ramifications of agile in a data science context. One of which is a co-location. What we have increasingly are data science teams that are virtual and distributed where a lot of the functions are handled by statistical modelers and data engineers and subject matter experts and visualization specialists that are working remotely from each other and are using collaborative tools like the tools from the company that I just left. How can agile, the co-location work primer for agile stand up in a world with more of the development team learning deeper and so forth is being done on a scrutiny basis and needs to be by teams of specialists that may be in different cities or different time zones, operating around the clock, produce brilliant results? Another one of which is that agile seems to be predicated on the notion that you improvise the process as you go, trial and error which seems to fly in the face of documentation or tidy documentation. Without tidy documentation about how you actually arrived at your results, how come those results can not be easily reproduced by independent researchers, independent data scientists? If you don't have well defined processes for achieving results in a certain data science initiative, it can't be reproduced which means they're not terribly scientific. By definition it's not science if you can't reproduce it by independent teams. To the extent that it's all loosey-goosey and improvised and undocumented, it's not reproducible. If it's not reproducible, to what extent should you put credence in the results of a given data science initiative if it's not been documented? Agile seems to fly in the face of reproducibility of data science results. Those are sort of my core themes or core issues that I'm pondering with or will be. >> Dave: Jim, just a couple questions. You had mentioned, you rattled off a bunch of parameters. You went really fast. One of them was co-location. Can you just review those again? What were they? >> Sure. They are minimal. The minimum viable product is the basis for agile, meaning a team puts together data a complete monolithic sect, but an initial deliverable that can stand alone, provide some value to your stakeholders or users and then you iteratively build upon that in what I call minimum viable product going forward to pull out more complex applications as needed. There's sort of a minimum viable product is at the heart of agile the way it's often looked at. The big question is, what is the minimum viable product in a data science initiative? One way you might approach that is saying that what you're doing, say you're building a predictive model. You're predicting a single scenario, for example such as whether one specific class of customers might accept one specific class of offers under the constraining circumstances. That's an example of minimum outcome to be achieved from a data science deliverable. A minimum product that addresses that requirement might be pulling the data from a single source. We'll need a very simplified feature set of predictive variables like maybe two or three at the most, to predict customer behavior, and use one very well understood algorithm like linear regressions and do it. With just a few lines of programming code in Python or Aura or whatever and build us some very crisp, simple rules. That's the notion in a data science context of a minimum viable product. That's the foundation of agile. Then there's the notion of modular which I've implied with minimal viable product. The initial product is the foundation upon which you build modular add ons. The add ons might be building out more complex algorithms based on more data sets, using more predictive variables, throwing other algorithms in to the initiative like logistic regression or decision trees to do more fine-grained customer segmentation. What I'm giving you is a sense for the modular add ons and builds on to the initial product that generally weaken incrementally in the course of a data science initiative. Then there's this, and I've already used the word incremental where each new module that gets built up or each new feature or tweak on the core model gets added on to the initial deliverable in a way that's incremental. Ideally it should all compose ultimately the sum of the useful set of capabilities that deliver a wider range of value. For example, in a data science initiative where it's customer data, you're doing predictive analysis to identify whether customers are likely to accept a given offer. One way to add on incrementally to that core functionality is to embed that capability, for example, in a target marketing application like an outbound marketing application that uses those predictive variables to drive responses in line to, say an e-commerce front end. Then there's the notion of iterative and iterative really comes down to check points. Regular reviews of the standards and check points where the team comes together to review the work in a context of data science. Data science by its very nature is exploratory. It's visualization, it's model building and testing and training. It's iterative scoring and testing and refinement of the underlying model. Maybe on a daily basis, maybe on a weekly basis, maybe adhoc, but iteration goes on all the time in data science initiatives. Adaptive. Adaptive is all about responding to circumstances. Trial and error. What works, what doesn't work at the level of the clinical approach. It's also in terms of, do we have the right people on this team to deliver on the end results? A data science team might determine mid-way through that, well we're trying to build a marketing application, but we don't have the right marketing expertise in our team, maybe we need to tap Joe over there who seems to know a little bit about this particular application we're trying to build and this particular scenario, this particular customers, we're trying to get a good profile of how to reach them. You might adapt by adding, like I said, new data sources, adding on new algorithms, totally changing your approach for future engineering as you go along. In addition to supervised learning from ground troops, you might add some unsupervised learning algorithms to being able to find patterns in say unstructured data sets as you bring those into the picture. What I'm getting at is there's a lot, 10 zillion variables that, for a data science team that you have to add in to your overall research plan going forward based on, what you're trying to derive from data science is its insights. They're actionable and ideally repeatable. That you can embed them in applications. It's just a matter of figuring out what actually helps you, what set of variables and team members and data and sort of what helps you to achieve the goals of your project. Finally, co-locational. It's all about the core team needs to be, usually in the same physical location according to the book how people normally think of agile. The company that I just left is basically doing a massive social engineering exercise, ongoing about making their marketing and R&D teams a little more agile by co-locating them in different cities like San Francisco and Austin and so forth. The whole notion that people will collaborate far better if they're not virtual. That's highly controversial, but none-the-less, that's the foundation of agile as it's normally considered. One of my questions, really an open question is what hard core, you might have a sprawling team that's doing data science, doing various aspects, but what solid core of that team needs to be physically co-located all or most of the time? Is it the statistical modeler and a data engineer alone? The one who stands up how to do cluster and the person who actually does the building and testing of the model? Do the visualization specialists need to be co-located as well? Are other specialties like subject matter experts who have the knowledge in marketing, whatever it is, do they also need to be in the physical location day in, day out, week in and week out to achieve results on these projects? Anyway, so there you go. That's how I sort of appealed the argument of (mumbling). >> Dave: Okay. I got a minimal modular, incremental, iterative, adaptive, co-locational. What was six again? I'm sorry. >> Jim: Co-locational. >> Dave: What was the one before that? >> Jim: I'm sorry. >> Dave: Adaptive. >> Minimal, modular, incremental, iterative, adaptive, and co-locational. >> Dave: Okay, there were only six. Sorry, I thought it was seven. Good. A couple of questions then we can get the discussion going here. Of course, you're talking specifically in the context of data science, but some of the questions that I've seen around agile generally are, it's not for everybody, when and where should it be used? Waterfalls still make sense sometimes. Some of the criticisms I've read, heard, seen, and sometimes experienced with agile are sort of quality issues, I'll call it lack of accountability. I don't know if that's the right terminology. We're going for speed so as long as we're fast, we checked that box, quality can sacrifice. Thoughts on that. Where does it fit and again understanding specifically you're talking about data science. Does it always fit in data science or because it's so new and hip and cool or like traditional programming environments, is it horses for courses? >> David: Can I add to that, Dave? It's a great, fundamental question. It seems to me there's two really important aspects of artificial intelligence. The first is the research part of it which is developing the algorithms, developing the potential data sources that might or might not matter. Then the second is taking that and putting it into production. That is that somewhere along the line, it's saving money, time, etc., and it's integrated with the rest of the organization. That second piece is, the first piece it seems to be like most research projects, the ROI is difficult to predict in a new sort of way. The second piece of actually implementing it is where you're going to make money. Is agile, if you can integrate that with your systems of record, for example and get automation of many of the aspects that you've researched, is agile the right way of doing it at that stage? How would you bridge the gap between the initial development and then the final instantiation? >> That's an important concern, David. Dev Ops, that's a closely related issue but it's not exactly the same scope. As data science and machine learning, let's just net it out. As machine learning and deep learning get embedded in applications, in operations I should say, like in your e-commerce site or whatever it might be, then data science itself becomes an operational function. The people who continue to iterate those models in line the operational applications. Really, where it comes down to an operational function, everything that these people do needs to be documented and version controlled and so forth. These people meaning data science professionals. You need documentation. You need accountability. The development of these assets, machine learning and so forth, needs to be, is compliance. When you look at compliance, algorithmic accountability comes into it where lawyers will, like e-discovery. They'll subpoena, theoretically all your algorithms and data and say explain how you arrived at this particular recommendation that you made to grant somebody or not grant somebody a loan or whatever it might be. The transparency of the entire development process is absolutely essential to the data science process downstream and when it's a production application. In many ways, agile by saying, speed's the most important thing. Screw documentation, you can sort of figure that out and that's not as important, that whole pathos, it goes by the wayside. Agile can not, should not skip on documentation. Documentation is even more important as data science becomes an operational function. That's one of my concerns. >> David: I think it seems to me that the whole rapid idea development is difficult to get a combination of that and operational, boring testing, regression testing, etc. The two worlds are very different. The interface between the two is difficult. >> Everybody does their e-commerce tweaks through AB testing of different layouts and so forth. AB testing is fundamentally data science and so it's an ongoing thing. (static) ... On AB testing in terms of tweaking. All these channels and all the service flow, systems of engagement and so forth. All this stuff has to be documented so agile sort of, in many ways flies in the face of that or potentially compromises the visibility of (garbled) access. >> David: Right. If you're thinking about IOT for example, you've got very expensive machines out there in the field which you're trying to optimize true put through and trying to minimize machine's breaking, etc. At the Micron event, it was interesting that Micron's use of different methodologies of putting systems together, they were focusing on the data analysis, etc., to drive greater efficiency through their manufacturing process. Having said that, they need really, really tested algorithms, etc. to make sure there isn't a major (mumbling) or loss of huge amounts of potential revenue if something goes wrong. I'm just interested in how you would create the final product that has to go into production in a very high value chain like an IOT. >> When you're running, say AI from learning algorithms all the way down to the end points, it gets even trickier than simply documenting the data and feature sets and the algorithms and so forth that were used to build up these models. It also comes down to having to document the entire life cycle in terms of how these algorithms were trained to make the predictors of whatever it is you're trying to do at the edge with a particular algorithm. The whole notion of how are all of these edge points applications being trained, with what data, at what interval? Are they being retrained on a daily basis, hourly basis, moment by moment basis? All of those are critical concerns to know whether they're making the best automated decisions or actions possible in all scenarios. That's like a black box in terms of the sheer complexity of what needs to be logged to figure out whether the application is doing its job as best a possible. You need a massive log, you need a massive event log from end to end of the IOT to do that right and to provide that visibility ongoing into the performance of these AI driven edge devices. I don't know anybody who's providing the tool to do it. >> David: If I think about how it's done at the moment, it's obviously far too slow at the moment. At the same time, you've got to have some testing and things like that. It seems to me that you've got a research model on one side and then you need to create a working model from that which is your production model. That's the one that goes through the testing and everything of that sort. It seems to me that the interface would be that transition from the research model to the working model that would be critical here and the working model is obviously a subset and it's going to be optimized for performance, etc. in real time, as opposed to the development model which can be a lot to do and take half a week to manage it necessary. It seems to me that you've got a different set of business pressures on the working model and a different set of skills as well. I think having one team here doesn't sound right to me. You've got to have a Dev Ops team who are going to take the working model from the developers and then make sure that it's sound and save. Especially in a high value IOT area that the level of iteration is not going to be nearly as high as in a lower cost marketing type application. Does that sound sensible? >> That sounds sensible. In fact in Dev Ops, the Dev Ops team would definitely be the ones that handle the continuous training and retraining of the working models on an ongoing basis. That's a core observation. >> David: Is that the right way of doing it, Jim? It seems to me that the research people would be continuing to adapt from data from a lot of different places whereas the operational model would be at a specific location with a specific IOT and they wouldn't have necessarily all the data there to do that. I'm not quite sure whether - >> Dave: Hey guys? Hey guys, hey guys? Can I jump in here? Interesting discussion, but highly nuanced and I'm struggling to figure out how this turns into a piece or sort of debating some certain specifics that are very kind of weedy. I wonder if we could just reset for a second and come back to sort of what I was trying to get to before which is really the business impact. Should this be applied broadly? Should this be applied specifically? What does it mean if I'm a practitioner? What should I take away from, Jim your premise and your sort of fixed parameters? Should I be implementing this? Why? Where? What's the value to my organization - the value I guess is obvious, but does it fit everywhere? Should it be across the board? Can you address that? >> Neil: Can I jump in here for a second? >> Dave: Please, that would be great. Is that Neil? >> Neil: Neil. I've never been a data scientist, but I was an actuary a long time ago. When the truth actuary came to me and said we need to develop a liability insurance coverage for floating oil rigs in the North Sea, I'm serious, it took a couple of months of research and modeling and so forth. If I had to go to all of those meetings and stand ups in an agile development environment, I probably would have gone postal on the place. I think that there's some confusion about what data science is. It's not a vector. It's not like a Dev Op situation where you start with something and you go (mumbling). When a data scientist or whatever you want to call them comes up with a model, that model has to be constantly revisited until it's put out of business. It's refined, it's evaluated. It doesn't have an end point like that. The other thing is that data scientist is typically going to be running multiple projects simultaneously so how in the world are you going to agilize that? I think if you look at the data science group, they're probably, I think Nick said this, there are probably groups in there that are doing fewer Dev Ops, software engineering and so forth and you can apply agile techniques to them. The whole data science thing is too squishy for that, in my opinion. >> Jim: Squishy? What do you mean by squishy, Neil? >> Neil: It's not one thing. I think if you try to represent data science as here's a project, we gather data, we work on a model, we test it, and then we put it into production, it doesn't end there. It never ends. It's constantly being revised. >> Yeah, of course. It's akin to application maintenance. The application meaning the model, the algorithm to be fit for purpose has to continually be evaluated, possibly tweaked, always retrained to determine its predictive fit for whatever task it's been assigned. You don't build it once and assume its strong predictive fit forever and ever. You can never assume that. >> Neil: James and I called that adaptive control mechanisms. You put a model out there and you monitor the return you're getting. You talk about AB testing, that's one method of doing it. I think that a data scientist, somebody who really is keyed into the machine learning and all that jazz. I just don't see them as being project oriented. I'll tell you one other thing, I have a son who's a software engineer and he said something to me the other day. He said, "Agile? Agile's dead." I haven't had a chance to find out what he meant by that. I'll get back to you. >> Oh, okay. If you look at - Go ahead. >> Dave: I'm sorry, Neil. Just to clarify, he said agile's dead? Was that what he said? >> Neil: I didn't say it, my son said it. >> Dave: Yeah, yeah, yeah right. >> Neil: No idea what he was talking about. >> Dave: Go ahead, Jim. Sorry. >> If you look at waterfall development in general, for larger projects it's absolutely essential to get requirements nailed down and the functional specifications and all that. Where you have some very extensive projects and many moving parts, obviously you need a master plan that it all fits into and waterfall, those checkpoints and so forth, those controls that are built into that methodology are critically important. Within the context of a broad project, some of the assets being build up might be machine loading models and analytics models and so forth so in the context of our broader waterfall oriented software development initiative, you might need to have multiple data science projects spun off within the sub-projects. Each of those would fit into, by itself might be indicated sort of like an exploration task where you have a team doing data visualization, exploration in more of an open-ended fashion because while they're trying to figure out the right set of predictors and the right set of data to be able to build out the right model to deliver the right result. What I'm getting at is that agile approaches might be embedded into broader waterfall oriented development initiatives, agile data science approaches. Fundamentally, data science began and still is predominantly very smart people, PhDs in statistics and math, doing open-ended exploration of complex data looking for non-obvious patterns that you wouldn't be able to find otherwise. Sort of a fishing expedition, a high priced fishing expedition. Kind of a mode of operation as how data science often is conducted in the real world. Looking for that eureka moment when the correlations just jump out at you. There's a lot of that that goes on. A lot of that is very important data science, it's more akin to pure science. What I'm getting at is there might be some role for more structure in waterfall development approaches in projects that have a data science, core data science capability to them. Those are my thoughts. >> Dave: Okay, we probably should move on to the next topic here, but just in closing can we get people to chime in on sort of the bottom line here? If you're writing to an audience of data scientists or data scientist want to be's, what's the one piece of advice or a couple of pieces of advice that you would give them? >> First of all, data science is a developer competency. The modern developers are, many of them need to be data scientists or have a strong grounding and understanding of data science, because much of that machine learning and all that is increasingly the core of what software developers are building so you can't not understand data science if you're a modern software developer. You can't understand data science as it (garbled) if you don't understand the need for agile iterative steps within the, because they're looking for the needle in the haystack quite often. The right combination of predictive variables and the right combination of algorithms and the right training regimen in order to get it all fit. It's a new world competency that need be mastered if you're a software development professional. >> Dave: Okay, anybody else want to chime in on the bottom line there? >> David: Just my two penny worth is that the key aspect of all the data scientists is to come up with the algorithm and then implement them in a way that is robust and it part of the system as a whole. The return on investment on the data science piece as an insight isn't worth anything until it's actually implemented and put into production of some sort. It seems that second stage of creating the working model is what is the output of your data scientists. >> Yeah, it's the repeatable deployable asset that incorporates the crux of data science which is algorithms that are data driven, statistical algorithms that are data driven. >> Dave: Okay. If there's nothing else, let's close this agenda item out. Is Nick on? Did Nick join us today? Nick, you there? >> Nick: Yeah. >> Dave: Sounds like you're on. Tough to hear you. >> Nick: How's that? >> Dave: Better, but still not great. Okay, we can at least hear you now. David, you wanted to present on NVMe over fabric pivoting off the Micron news. What is NVMe over fabric and who gives a fuck? (laughing) >> David: This is Micron, we talked about it last week. This is Micron announcement. What they announced is NVMe over fabric which, last time we talked about is the ability to create a whole number of nodes. They've tested 250, the architecture will take them to 1,000. 1,000 processor or 1,000 nodes, and be able to access the data on any single node at roughly the same speed. They are quoting 200 microseconds. It's 195 if it's local and it's 200 if it's remote. That is a very, very interesting architecture which is like nothing else that's been announced. >> Participant: David, can I ask a quick question? >> David: Sure. >> Participant: This latency and the node count sounds astonishing. Is Intel not replicating this or challenging in scope with their 3D Crosspoint? >> David: 3D Crosspoint, Intel would love to sell that as a key component of this. The 3D Crosspoint as a storage device is very, very, very expensive. You can replicate most of the function of 3D Crosspoint at a much lower price point by using a combination of D-RAM and protective D-RAM and Flash. At the moment, 3D Crosspoint is a nice to have and there'll be circumstances where they will use it, but at the meeting yesterday, I don't think they, they might have brought it up once. They didn't emphasize it (mumbles) at all as being part of it. >> Participant: To be clear, this means rather than buying Intel servers rounded out with lots of 3D Crosspoint, you buy Intel servers just with the CPU and then all the Micron niceness for their NVMe and their Interconnect? >> David: Correct. They are still Intel servers. The ones they were displaying yesterday were HP1's, they also used SuperMicro. They want certain characteristics of the chip set that are used, but those are just standard pieces. The other parts of the architecture are the Mellanox, the 100 gigabit converged ethernet and using Rocky which is IDMA over converged ethernet. That is the secret sauce which allows you and Mellanox themselves, their cards have a lot of offload of a lot of functionality. That's the secret sauce which allows you to go from any point to any point in 5 microseconds. Then create a transfer and other things. Files are on top of that. >> Participant: David, Another quick question. The latency is incredibly short. >> David: Yep. >> Participant: What happens if, as say an MPP SQL database with 1,000 nodes, what if they have to shuffle a lot of data? What's the throughput? Is it limited by that 100 gig or is that so insanely large that it doesn't matter? >> David: They key is this, that it allows you to move the processing to wherever the data is very, very easily. In the principle that will evolve from this architecture, is that you know where the data is so don't move the data around, that'll block things up. Move the processing to that particular node or some adjacent node and do the processing as close as possible. That is as an architecture is a long term goal. Obviously in the short term, you've got to take things as they are. Clearly, a different type of architecture for databases will need to eventually evolve out of this. At the moment, what they're focusing on is big problems which need low latency solutions and using databases as they are and the whole end to end use stack which is a much faster way of doing it. Then over time, they'll adapt new databases, new architectures to really take advantage of it. What they're offering is a POC at the moment. It's in Beta. They had their customers talking about it and they were very complimentary in general about it. They hope to get it into full production this year. There's going to be a host of other people that are doing this. I was trying to bottom line this in terms of really what the link is with digital enablement. For me, true digital enablement is enabling any relevant data to be available for processing at the point of business engagement in real time or near real time. The definition that this architecture enables. It's a, in my view a potential game changer in that this is an architecture which will allow any data to be available for processing. You don't have to move the data around, you move the processing to that data. >> Is Micron the first market with this capability, David? NV over Me? NVMe. >> David: Over fabric? Yes. >> Jim: Okay. >> David: Having said that, there are a lot of start ups which have got a significant amount of money and who are coming to market with their own versions. You would expect Dell, HP to be following suit. >> Dave: David? Sorry. Finish your thought and then I have another quick question. >> David: No, no. >> Dave: The principle, and you've helped me understand this many times, going all the way back to Hadoop, bring the application to the data, but when you're using conventional relational databases and you've had it all normalized, you've got to join stuff that might not be co-located. >> David: Yep. That's the whole point about the five microseconds. Now that the impact of non co-location if you have to join stuff or whatever it is, is much, much lower. It's so you can do the logical draw in, whatever it is, very quickly and very easily across that whole fabric. In terms of processing against that data, then you would choose to move the application to that node because it's much less data to move, that's an optimization of the architecture as opposed to a fundamental design point. You can then optimize about where you run the thing. This is ideal architecture for where I personally see things going which is traditional systems of record which need to be exactly as they've ever been and then alongside it, the artificial intelligence, the systems of understanding, data warehouses, etc. Having that data available in the same space so that you can combine those two elements in real time or in near real time. The advantage of that in terms of business value, digital enablement, and business value is the biggest thing of all. That's a 50% improvement in overall productivity of a company, that's the thing that will drive, in my view, 99% of the business value. >> Dave: Going back just to the joint thing, 100 gigs with five microseconds, that's really, really fast, but if you've got petabytes of data on these thousand nodes and you have to do a join, you still got to go through that 100 gig pipe of stuff that's not co-located. >> David: Absolutely. The way you would design that is as you would design any query. You've got a process you would need, a process in front of that which is query optimization to be able to farm all of the independent jobs needed to do in each of the nodes and take the output of that and bring that together. Both the concepts are already there. >> Dave: Like a map. >> David: Yes. That's right. All of the data science is there. You're starting from an architecture which is fundamentally different from the traditional let's get it out architectures that have existed, by removing that huge overhead of going from one to another. >> Dave: Oh, because this goes, it's like a mesh not a ring? >> David: Yes, yes. >> Dave: It's like the high performance compute of this MPI type architecture? >> David: Absolutely. NVMe, by definition is a point to point architecture. Rocky, underneath it is a point to point architecture. Everything is point to point. Yes. >> Dave: Oh, got it. That really does call for a redesign. >> David: Yes, you can take it in steps. It'll work as it is and then over time you'll optimize it to take advantage of it more. Does that definition of (mumbling) make sense to you guys? The one I quoted to you? Enabling any relevant data to be available for processing at the point of business engagement, in real time or near real time? That's where you're trying to get to and this is a very powerful enabler of that design. >> Nick: You're emphasizing the network topology, while I kind of thought the heart of the argument was performance. >> David: Could you repeat that? It's very - >> Dave: Let me repeat. Nick's a little light, but I could hear him fine. You're emphasizing the network topology, but Nick's saying his takeaway was the whole idea was the thrust was performance. >> Nick: Correct. >> David: Absolutely. Absolutely. The result of that network topology is a many times improvement in performance of the systems as a whole that you couldn't achieve in any previous architecture. I totally agree. That's what it's about is enabling low latency applications with much, much more data available by being able to break things up in parallel and delivering multiple streams to an end result. Yes. >> Participant: David, let me just ask, if I can play out how databases are designed now, how they can take advantage of it unmodified, but how things could be very, very different once they do take advantage of it which is that today, if you're doing transaction processing, you're pretty much bottle necked on a single node that sort of maintains the fresh cache of shared data and that cache, even if it's in memory, it's associated with shared storage. What you're talking about means because you've got memory speed access to that cache from anywhere, it no longer is tied to a node. That's what allows you to scale out to 1,000 nodes even for transaction processing. That's something we've never really been able to do. Then the fact that you have a large memory space means that you no longer optimize for mapping back and forth from disk and disk structures, but you have everything in a memory native structure and you don't go through this thing straw for IO to storage, you go through memory speed IO. That's a big, big - >> David: That's the end point. I agree. That's not here quite yet. It's still IO, so the IO has been improved dramatically, the protocol within the Me and the over fabric part of it. The elapsed time has been improved, but it's not yet the same as, for example, the HPV initiative. That's saying you change your architecture, you change your way of processing just in the memory. Everything is assumed to be memory. We're not there yet. 200 microseconds is still a lot, lot slower than the process that - one impact of this architecture is that the amount of data that you can pass through it is enormously higher and therefore, the memory sizes themselves within each node will need to be much, much bigger. There is a real opportunity for architectures which minimize the impact, which hold data coherently across multiple nodes and where there's minimal impact of, no tapping on the shoulder for every byte transferred so you can move large amounts of data into memory and then tell people that it's there and allow it to be shared, for example between the different calls and the GPUs and FPGAs that will be in these processes. There's more to come in terms of the architecture in the future. This is a step along the way, it's not the whole journey. >> Participant: Dave, another question. You just referenced 200 milliseconds or microseconds? >> David: Did I say milliseconds? I meant microseconds. >> Participant: You might have, I might have misheard. Relate that to the five microsecond thing again. >> David: If you have data directly attached to your processor, the access time is 195 microseconds. If you need to go to a remote, anywhere else in the thousand nodes, your access time is 200 microseconds. In other words, the additional overhead of that data is five microseconds. >> Participant: That's incredible. >> David: Yes, yes. That is absolutely incredible. That's something that data scientists have been working on for years and years. Okay. That's the reason why you can now do what I talked about which was you can have access from any node to any data within that large amount of nodes. You can have petabytes of data there and you can have access from any single node to any of that data. That, in terms of data enablement, digital enablement, is absolutely amazing. In other words, you don't have to pre put the data that's local in one application in one place. You're allowing an enormous flexibility in how you design systems. That coming back to artificial intelligence, etc. allows you a much, much larger amount of data that you can call on for improving applications. >> Participant: You can explore and train models, huge models, really quickly? >> David: Yes, yes. >> Participant: Apparently that process works better when you have an MPI like mesh than a ring. >> David: If you compare this architecture to the DSST architecture which was the first entrance into this that MP bought for a billion dollars, then that one stopped at 40 nodes. It's architecture was very, very proprietary all the way through. This one takes you to 1,000 nodes with much, much lower cost. They believe that the cost of the equivalent DSSD system will be between 10 and 20% of that cost. >> Dave: Can I ask a question about, you mentioned query optimizer. Who develops the query optimizer for the system? >> David: Nobody does yet. >> Jim: The DBMS vendor would have to re-write theirs with a whole different pensive cost. >> Dave: So we would have an optimizer database system? >> David: Who's asking a question, I'm sorry. I don't recognize the voice. >> Dave: That was Neil. Hold on one second, David. Hold on one second. Go ahead Nick. You talk about translation. >> Nick: ... On a network. It's SAN. It happens to be very low latency and very high throughput, but it's just a storage sub-system. >> David: Yep. Yep. It's a storage sub-system. It's called a server SAN. That's what we've been talking about for a long time is you need the same characteristics which is that you can get at all the data, but you need to be able to get at it in compute time as opposed to taking a stroll down the road time. >> Dave: Architecturally it's a SAN without an array controller? >> David: Exactly. Yeah, the array controller is software from a company called Xcellate, what was the name of it? I can't remember now. Say it again. >> Nick: Xcelero or Xceleron? >> David: Xcelero. That's the company that has produced the software for the data services, etc. >> Dave: Let's, as we sort of wind down this segment, let's talk about the business impact again. We're talking about different ways potentially to develop applications. There's an ecosystem requirement here it sounds like, from the ISDs to support this and other developers. It's the final, portends the elimination of the last electromechanical device in computing which has implications for a lot of things. Performance value, application development, application capability. Maybe you could talk about that a little bit again thinking in terms of how practitioners should look at this. What are the actions that they should be taking and what kinds of plans should they be making in their strategies? >> David: I thought Neil's comment last week was very perceptive which is, you wouldn't start with people like me who have been imbued with the 100 database call limits for umpteen years. You'd start with people, millennials, or sub-millenials or whatever you want to call them, who can take a completely fresh view of how you would exploit this type of architecture. Fundamentally you will be able to get through 10 or 100 times more data in real time than you can with today's systems. There's two parts of that data as I said before. The traditional systems of record that need to be updated, and then a whole host of applications that will allow you to do processes which are either not possible, or very slow today. To give one simple example, if you want to do real time changing of pricing based on availability of your supply chain, based on what you've got in stock, based on the delivery capabilities, that's a very, very complex problem. The optimization of all these different things and there are many others that you could include in that. This will give you the ability to automate that process and optimize that process in real time as part of the systems of record and update everything together. That, in terms of business value is extracting a huge number of people who previously would be involved in that chain, reducing their involvement significantly and making the company itself far more agile, far more responsive to change in the marketplace. That's just one example, you can think of hundreds for every marketplace where the application now becomes the systems of record, augmented by AI and huge amounts more data can improve the productivity of an organization and the agility of an organization in the marketplace. >> This is a godsend for AI. AI, the draw of AI is all this training data. If you could just move that in memory speed to the application in real time, it makes the applications much sharper and more (mumbling). >> David: Absolutely. >> Participant: How long David, would it take for the cloud vendors to not just offer some instances of this, but essentially to retool their infrastructure. (laughing) >> David: This is, to me a disruption and a half. The people who can be first to market in this are the SaaS vendors who can take their applications or new SaaS vendors. ISV. Sorry, say that again, sorry. >> Participant: The SaaS vendors who have their own infrastructure? >> David: Yes, but it's not going to be long before the AWS' and Microsofts put this in their tool bag. The SaaS vendors have the greatest capability of making this change in the shortest possible time. To me, that's one area where we're going to see results. Make no mistake about it, this is a big change and at the Micron conference, I can't remember what the guys name was, he said it takes two Olympics for people to start adopting things for real. I think that's going to be shorter than two Olympics, but it's going to be quite a slow process for pushing this out. It's radically different and a lot of the traditional ways of doing things are going to be affected. My view is that SaaS is going to be the first and then there are going to be individual companies that solve the problems themselves. Large companies, even small companies that put in systems of this sort and then use it to outperform the marketplace in a significant way. Particularly in the finance area and particularly in other data intent areas. That's my two pennies worth. Anybody want to add anything else? Any other thoughts? >> Dave: Let's wrap some final thoughts on this one. >> Participant: Big deal for big data. >> David: Like it, like it. >> Participant: It's actually more than that because there used to be a major trade off between big data and fast data. Latency and throughput and this starts to push some of those boundaries out so that you sort of can have both at once. >> Dave: Okay, good. Big deal for big data and fast data. >> David: Yeah, I like it. >> Dave: George, you want to talk about digital twins? I remember when you first sort of introduced this, I was like, "Huh? What's a digital twin? "That's an interesting name." I guess, I'm not sure you coined it, but why don't you tell us what digital twin is and why it's relevant. >> George: All right. GE coined it. I'm going to, at a high level talk about what it is, why it's important, and a little bit about as much as we can tell, how it's likely to start playing out and a little bit on the differences of the different vendors who are going after it. As far as sort of defining it, I'm cribbing a little bit from a report that's just in the edit process. It's data representation, this is important, or a model of a product, process, service, customer, supplier. It's not just an industrial device. It can be any entity involved in the business. This is a refinement sort of Peter helped with. The reason it's any entity is because there is, it can represent the structure and behavior, not just of a machine tool or a jet engine, but a business process like sales order process when you see it on a screen and its workflow. That's a digital twin of what used to be a physical process. It applied to both the devices and assets and processes because when you can model them, you can integrate them within a business process and improve that process. Going back to something that's more physical so I can do a more concrete definition, you might take a device like a robotic machine tool and the idea is that the twin captures the structure and the behavior across its lifecycle. As it's designed, as it's built, tested, deployed, operated, and serviced. I don't know if you all know the myth of, in the Greek Gods, one of the Goddesses sprang fully formed from the forehead of Zeus. I forgot who it was. The point of that is digital twin is not going to spring fully formed from any developers head. Getting to the level of fidelity I just described is a journey and a long one. Maybe a decade or more because it's difficult. You have to integrate a lot of data from different systems and you have to add structure and behavior for stuff that's not captured anywhere and may not be captured anywhere. Just for example, CAD data might have design information, manufacturing information might come from there or another system. CRM data might have support information. Maintenance repair and overhaul applications might have information on how it's serviced. Then you also connect the physical version with the digital version with essentially telemetry data that says how its been operating over time. That sort of helps define its behavior so you can manipulate that and predict things or simulate things that you couldn't do with just the physical version. >> You have to think about combined with say 3D printers, you could create a hot physical back up of some malfunctioning thing in the field because you have the entire design, you have the entire history of its behavior and its current state before it went kablooey. Conceivably, it can be fabricated on the fly and reconstituted as a physicologic from the digital twin that was maintained. >> George: Yes, you know what actually that raises a good point which is that the behavior that was represented in the telemetry helps the designer simulate a better version for the next version. Just what you're saying. Then with 3D printing, you can either make a prototype or another instance. Some of the printers are getting sophisticated enough to punch out better versions or parts for better versions. That's a really good point. There's one thing that has to hold all this stuff together which is really kind of difficult, which is challenging technology. IBM calls it a knowledge graph. It's pretty much in anyone's version. They might not call it a knowledge graph. It's a graph is, instead of a tree where you have a parent and then children and then the children have more children, a graph, many things can relate to many things. The reason I point that out is that puts a holistic structure over all these desperate sources of data behavior. You essentially talk to the graph, sort of like with Arnold, talk to the hand. That didn't, I got crickets. (laughing) Let me give you guys the, I put a definitions table in this dock. I had a couple things. Beta models. These are some important terms. Beta model represents the structure but not the behavior of the digital twin. The API represents the behavior of the digital twin and it should conform to the data model for maximum developer usability. Jim, jump in anywhere where you feel like you want to correct or refine. The object model is a combination of the data model and API. You were going to say something? >> Jim: No, I wasn't. >> George: Okay. The object model ultimately is the digital twin. Another way of looking at it, defining the structure and behavior. This sounds like one of these, say "T" words, the canonical model. It's a generic version of the digital twin or really the one where you're going to have a representation that doesn't have customer specific extensions. This is important because the way these things are getting built today is mostly custom spoke and so if you want to be able to reuse work. If someone's building this for you like a system integrator, you want to be able to, or they want to be able to reuse this on the next engagement and you want to be able to take the benefit of what they've learned on the next engagement back to you. There has to be this canonical model that doesn't break every time you essentially add new capabilities. It doesn't break your existing stuff. Knowledge graph again is this thing that holds together all the pieces and makes them look like one coherent hole. I'll get to, I talked briefly about network compatibility and I'll get to level of detail. Let me go back to, I'm sort of doing this from crib notes. We talked about telemetry which is sort of combining the physical and the twin. Again, telemetry's really important because this is like the time series database. It says, this is all the stuff that was going on over time. Then you can look at telemetry data that tells you, we got a dirty power spike and after three of those, this machine sort of started vibrating. That's part of how you're looking to learn about its behavior over time. In that process, models get better and better about predicting and enabling you to optimize their behavior and the business process with which it integrates. I'll give some examples of that. Twins, these digital twins can themselves be composed in levels of detail. I think I used the example of a robotic machine tool. Then you might have a bunch of machine tools on an assembly line and then you might have a bunch of assembly lines in a factory. As you start modeling, not just the single instance, but the collections that higher up and higher levels of extractions, or levels of detail, you get a richer and richer way to model the behavior of your business. More and more of your business. Again, it's not just the assets, but it's some of the processes. Let me now talk a little bit about how the continual improvement works. As Jim was talking about, we have data feedback loops in our machine learning models. Once you have a good quality digital twin in place, you get the benefit of increasing returns from the data feedback loops. In other words, if you can get to a better starting point than your competitor and then you get on the increasing returns of the data feedback loops, that is improving the fidelity of the digital twins now faster than your competitor. For one twin, I'll talk about how you want to make the whole ecosystem of twins sort of self-reinforcing. I'll get to that in a sec. There's another point to make about these data feedback loops which is traditional apps, and this came up with Jim and Neil, traditional apps are static. You want upgrades, you get stuff from the vendor. With digital twins, they're always learning from the customer's data and that has implications when the partner or vendor who helped build it for a customer takes learnings from the customer and goes to a similar customer for another engagement. I'll talk about the implications from that. This is important because it's half packaged application and half bespoke. The fact that you don't have to take the customer's data, but your model learns from the data. Think of it as, I'm not going to take your coffee beans, your data, but I'm going to run or make coffee from your beans and I'm going to take that to the next engagement with another customer who could be your competitor. In other words, you're extracting all the value from the data and that helps modify the behavior of the model and the next guy gets the benefit of it. Dave, this is the stuff where IBM keeps saying, we don't take your data. You're right, but you're taking the juice you squeezed out of it. That's one of my next reports. >> Dave: It's interesting, George. Their contention is, they uniquely, unlike Amazon and Google, don't swap spit, your spit with their competitors. >> George: That's misleading. To say Amazon and Google, those guys aren't building digital twins. Parametric technology is. I've got this definitely from a parametric technical fellow at an AWS event last week, which is they, not only don't use the data, they don't use the structure of the twin either from engagement to engagement. That's a big difference from IBM. I have a quote, Chris O'Connor from IBM Munich saying, "We'll take the data model, "but we won't take the data." I'm like, so you take the coffee from the beans even if you don't take the beans? I'm going to be very specific about saying that saying you don't do what Google and FaceBook do, what they do, it's misleading. >> Dave: My only caution there is do some more vetting and checking. A lot of times what some guy says on a Cube interview, he or she doesn't even know, in my experience. Make sure you validate that. >> George: I'll send it to them for feedback, but it wasn't just him. I got it from the CTO of the IOT division as well. >> Dave: When you were in Munich? >> George: This wasn't on the Cube either. This was by the side of, at the coffee table during our break. >> Dave: I understand and CTO's in theory should know. I can't tell you how many times I've gotten a definitive answer from a pretty senior level person and it turns out it was, either they weren't listening to me or they didn't know or they were just yessing me or whatever. Just be really careful and make sure you do your background checks. >> George: I will. I think the key is leave them room to provide a nuanced answer. It's more of a really, really, really concrete about really specific edge conditions and say do you or don't you. >> Dave: This is a pretty big one. If I'm a CIO, a chief digital officer, a chief data officer, COO, head of IT, head of data science, what should I be doing in this regard? What's the advice? >> George: Okay, can I go through a few more or are we out of time? >> Dave: No, we have time. >> George: Let me do a couple more points. I talked about training a single twin or an instance of a twin and I talked about the acceleration of the learning curve. There's edge analytics, David has educated us with the help of looking at GE Predicts. David, you have been talking about this fpr a long time. You want edge analytics to inform or automate a low latency decision and so this is where you're going to have to run some amount of analytics. Right near the device. Although I got to mention, hopefully this will elicit a chuckle. When you get some vendors telling you what their edge and cloud strategies are. Map R said, we'll have a hadoop cluster that only needs four or five nodes as our edge device. And we'll need five admins to care and feed it. He didn't say the last part, but that obviously isn't going to work. The edge analytics could be things like recalibrating the machine for different tolerance. If it's seeing that it's getting out of the tolerance window or something like that. The cloud, and this is old news for anyone who's been around David, but you're going to have a lot of data, not all of it, but going back to the cloud to train both the instances of each robotic machine tool and the master of that machine tool. The reason is, an instance would be oh I'm operating in a high humidity environment, something like that. Another one would be operating where there's a lot of sand or something that screws up the behavior. Then the master might be something that has behavior that's sort of common to all of them. It's when the training, the training will take place on the instances and the master and will in all likelihood push down versions of each. Next to the physical device process, whatever, you'll have the instance one and a class one and between the two of them, they should give you the optimal view of behavior and the ability to simulate to improve things. It's worth mentioning, again as David found out, not by talking to GE, but by accidentally looking at their documentation, their whole positioning of edge versus cloud is a little bit hand waving and in talking to the guys from ThingWorks which is a division of what used to be called Parametric Technology which is just PTC, it appears that they're negotiating with GE to give them the orchestration and distributed database technology that GE can't build itself. I've heard also from two ISV's, one a major one and one a minor one who are both in the IOT ecosystem one who's part of the GE ecosystem that predicts as a mess. It's analysis paralysis. It's not that they don't have talent, it's just that they're not getting shit done. Anyway, the key thing now is when you get all this - >> David: Just from what I learned when I went to the GE event recently, they're aware of their requirement. They've actually already got some sub parts of the predix which they can put in the cloud, but there needs to be more of it and they're aware of that. >> George: As usual, just another reason I need a red phone hotline to David for any and all questions I have. >> David: Flattery will get you everywhere. >> George: All right. One of the key takeaways, not the action item, but the takeaway for a customer is when you get these data feedback loops reinforcing each other, the instances of say the robotic machine tools to the master, then the instance to the assembly line to the factory, when all that is being orchestrated and all the data is continually enhancing the models as well as the manual process of adding contextual information or new levels of structure, this is when you're on increasing returns sort of curve that really contributes to sustaining competitive advantage. Remember, think of how when Google started off on search, it wasn't just their algorithm, but it was collecting data about which links you picked, in which order and how long you were there that helped them reinforce the search rankings. They got so far ahead of everyone else that even if others had those algorithms, they didn't have that data to help refine the rankings. You get this same process going when you essentially have your ecosystem of learning models across the enterprise sort of all orchestrating. This sounds like motherhood and apple pie and there's going to be a lot of challenges to getting there and I haven't gotten all the warts of having gone through, talked to a lot of customers who've gotten the arrows in the back, but that's the theoretical, really cool end point or position where the entire company becomes a learning organization from these feedback loops. I want to, now that we're in the edit process on the overall digital twin, I do want to do a follow up on IBM's approach. Hopefully we can do it both as a report and then as a version that's for Silicon Angle because that thing I wrote on Cloudera got the immediate attention of Cloudera and Amazon and hopefully we can both provide client proprietary value add, but also the public impact stuff. That's my high level. >> This is fascinating. If you're the Chief of Data Science for example, in a large industrial company, having the ability to compile digital twins of all your edge devices can be extraordinarily valuable because then you can use that data to do more fine-grained segmentation of the different types of edges based on their behavior and their state under various scenarios. Basically then your team of data scientists can then begin to identify the extent to which they need to write different machine learning models that are tuned to the specific requirements or status or behavior of different end points. What I'm getting at is ultimately, you're going to have 10 zillion different categories of edge devices performing in various scenarios. They're going to be driven by an equal variety of machine learning, deep learning AI and all that. All that has to be built up by your data science team in some coherent architecture where there might be a common canonical template that all devices will, all the algorithms and so forth on those devices are being built from. Each of those algorithms will then be tweaked to the specific digital twins profile of each device is what I'm getting at. >> George: That's a great point that I didn't bring up which is folks who remember object oriented programming, not that I ever was able to write a single line of code, but the idea, go into this robotic machine tool, you can inherit a couple of essentially component objects that can also be used in slightly different models, but let's say in this machine tool, there's a model for a spinning device, I forget what it's called. Like a drive shaft. That drive shaft can be in other things as well. Eventually you can compose these twins, even instances of a twin with essentially component models themselves. Thing Works does this. I don't know if GE does this. I don't think IBM does. The interesting thing about IBM is, their go to market really influences their approach to this which is they have this huge industry solutions group and then obviously the global business services group. These guys are all custom development and domain experts so they'll go into, they're literally working with Airbus and with the goal of building a model of a particular airliner. Right now I think they're doing the de-icing subsystem, I don't even remember on which model. In other words they're helping to create this bespoke thing and so that's what actually gets them into trouble with potentially channel conflict or maybe it's more competitor conflict because Airbus is not going to be happy if they take their learnings and go work with Boeing next. Whereas with PTC and Thing Works, at least their professional services arm, they treat this much more like the implementation of a packaged software product and all the learnings stay with the customer. >> Very good. >> Dave: I got a question, George. In terms of the industrial design and engineering aspect of building products, you mentioned PTC which has been in the CAD business and the engineering business for software for 50 years, and Ansis and folks like that who do the simulation of industrial products or any kind of a product that gets built. Is there a natural starting point for digital twin coming out of that area? That would be the vice president of engineering would be the guy that would be a key target for this kind of thinking. >> George: Great point. This is, I think PTC is closely aligned with Terradata and they're attitude is, hey if it's not captured in the CAD tool, then you're just hand waving because you won't have a high fidelity twin. >> Dave: Yeah, it's a logical starting point for any mechanical kind of device. What's a thing built to do and what's it built like? >> George: Yeah, but if it's something that was designed in a CAD tool, yes, but if it's something that was not, then you start having to build it up in a different way. I think, I'm trying to remember, but IBM did not look like they had something that was definitely oriented around CAD. Theirs looked like it was more where the knowledge graph was the core glue that pulled all the structure and behavior together. Again, that was a reflection of their product line which doesn't have a CAD tool and the fact that they're doing these really, really, really bespoke twins. >> Dave: I'm thinking that it strikes me that from the industrial design in engineering area, it's really the individual product is really the focus. That's one part of the map. The dynamic you're pointing at, there's lots of other elements of the map in terms of an operational, a business process. That might be the fleet of wind turbines or the fleet of trucks. How they behave collectively. There's lots of different entry points. I'm just trying to grapple with, isn't the CAD area, the engineering area at least for hard products, have an obvious starting point for users to begin to look at this. The BP of Engineering needs to be on top of this stuff. >> George: That's a great point that I didn't bring up which is, a guy at Microsoft who was their CTO in their IT organization gave me an example which was, you have a pipeline that's 1,000 miles long. It's got 10,000 valves in it, but you're not capturing the CAD design of the valve, you just put a really simple model that measures pressure, temperature, and leakage or something. You string 10,000 of those together into an overall model of the pipeline. That is a low fidelity thing, but that's all they need to start with. Then they can see when they're doing maintenance or when the flow through is higher or what the impact is on each of the different valves or flanges or whatever. It doesn't always have to start with super high fidelity. It depends on which optimizing for. >> Dave: It's funny. I had a conversation years ago with a guy, the engineering McNeil Schwendler if you remember those folks. He was telling us about 30 to 40 years ago when they were doing computational fluid dynamics, they were doing one dimensional computational fluid dynamics if you can imagine that. Then they were able, because of the compute power or whatever, to get the two dimensional computational fluid dynamics and finally they got to three dimensional and they're looking also at four and five dimensional as well. It's serviceable, I guess what I'm saying in that pipeline example, the way that they build that thing or the way that they manage that pipeline is that they did the one dimensional model of a valve is good enough, but over time, maybe a two or three dimensional is going to be better. >> George: That's why I say that this is a journey that's got to take a decade or more. >> Dave: Yeah, definitely. >> Take the example of airplane. The old joke is it's six million parts flying in close formation. It's going to be a while before you fit that in one model. >> Dave: Got it. Yes. Right on. When you have that model, that's pretty cool. All right guys, we're about out of time. I need a little time to prep for my next meeting which is in 15 minutes, but final thoughts. Do you guys feel like this was useful in terms of guiding things that you might be able to write about? >> George: Hugely. This is hugely more valuable than anything we've done as a team. >> Jim: This is great, I learned a lot. >> Dave: Good. Thanks you guys. This has been recorded. It's up on the cloud and I'll figure out how to get it to Peter and we'll go from there. Thanks everybody. (closing thank you's)

Published Date : May 9 2017

SUMMARY :

There you go. and maybe the key issues that you see and is coming even more deeply into the core practice You had mentioned, you rattled off a bunch of parameters. It's all about the core team needs to be, I got a minimal modular, incremental, iterative, iterative, adaptive, and co-locational. in the context of data science, and get automation of many of the aspects everything that these people do needs to be documented that the whole rapid idea development flies in the face of that create the final product that has to go into production and the algorithms and so forth that were used and the working model is obviously a subset that handle the continuous training and retraining David: Is that the right way of doing it, Jim? and come back to sort of what I was trying to get to before Dave: Please, that would be great. so how in the world are you going to agilize that? I think if you try to represent data science the algorithm to be fit for purpose and he said something to me the other day. If you look at - Just to clarify, he said agile's dead? Dave: Go ahead, Jim. and the functional specifications and all that. and all that is increasingly the core that the key aspect of all the data scientists that incorporates the crux of data science Nick, you there? Tough to hear you. pivoting off the Micron news. the ability to create a whole number of nodes. Participant: This latency and the node count At the moment, 3D Crosspoint is a nice to have That is the secret sauce which allows you The latency is incredibly short. Move the processing to that particular node Is Micron the first market with this capability, David? David: Over fabric? and who are coming to market with their own versions. Dave: David? bring the application to the data, Now that the impact of non co-location and you have to do a join, and take the output of that and bring that together. All of the data science is there. NVMe, by definition is a point to point architecture. Dave: Oh, got it. Does that definition of (mumbling) make sense to you guys? Nick: You're emphasizing the network topology, the whole idea was the thrust was performance. of the systems as a whole Then the fact that you have a large memory space is that the amount of data that you can pass through it You just referenced 200 milliseconds or microseconds? David: Did I say milliseconds? Relate that to the five microsecond thing again. anywhere else in the thousand nodes, That's the reason why you can now do what I talked about when you have an MPI like mesh than a ring. They believe that the cost of the equivalent DSSD system Who develops the query optimizer for the system? Jim: The DBMS vendor would have to re-write theirs I don't recognize the voice. Dave: That was Neil. It happens to be very low latency which is that you can get at all the data, Yeah, the array controller is software from a company called That's the company that has produced the software from the ISDs to support this and other developers. and the agility of an organization in the marketplace. AI, the draw of AI is all this training data. for the cloud vendors to not just offer are the SaaS vendors who can take their applications and then there are going to be individual companies Latency and throughput and this starts to push Dave: Okay, good. I guess, I'm not sure you coined it, and the idea is that the twin captures the structure Conceivably, it can be fabricated on the fly and it should conform to the data model and that helps modify the behavior Dave: It's interesting, George. saying, "We'll take the data model, Make sure you validate that. I got it from the CTO of the IOT division as well. This was by the side of, at the coffee table I can't tell you how many times and say do you or don't you. What's the advice? of behavior and the ability to simulate to improve things. of the predix which they can put in the cloud, I need a red phone hotline to David and all the data is continually enhancing the models having the ability to compile digital twins and all the learnings stay with the customer. and the engineering business for software hey if it's not captured in the CAD tool, What's a thing built to do and what's it built like? and the fact that they're doing these that from the industrial design in engineering area, but that's all they need to start with. and finally they got to three dimensional that this is a journey that's got to take It's going to be a while before you fit that I need a little time to prep for my next meeting This is hugely more valuable than anything we've done how to get it to Peter and we'll go from there.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Jim	PERSON	0.99+
Chris O'Connor	PERSON	0.99+
George	PERSON	0.99+
Dave	PERSON	0.99+
Airbus	ORGANIZATION	0.99+
Boeing	ORGANIZATION	0.99+
Jim Kobeielus	PERSON	0.99+
James	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Neil	PERSON	0.99+
Joe	PERSON	0.99+
Nick	PERSON	0.99+
David Floyer	PERSON	0.99+
George Gilbert	PERSON	0.99+
1,000 miles	QUANTITY	0.99+
10	QUANTITY	0.99+
Peter	PERSON	0.99+
195 microseconds	QUANTITY	0.99+

Rob Thomas, IBM | BigDataNYC 2016

>> Narrator: Live from New York, it's the Cube. Covering Big Data New York City 2016. Brought to you by headline sponsors: Cisco, IBM, Nvidia, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and Jeff Frick. >> Welcome back to New York City, everybody. This is the Cube, the worldwide leader in live tech coverage. Rob Thomas is here, he's the GM of products for IBM Analytics. Rob, always good to see you, man. >> Yeah, Dave, great to see you. Jeff, great to see you as well. >> You too, Rob. World traveller. >> Been all over the place, but good to be here, back in New York, close to home for one day. (laughs) >> Yeah, at least a day. So the whole community is abuzz with this article that hit. You wrote it last week. It hit NewCo Shift, I guess just today or yesterday: The End of Tech Companies. >> Rob: Yes. >> Alright, and you've got some really interesting charts in there, you've got some ugly charts. You've got HDP, you've got, let's see... >> Rob: You've got Imperva. >> TerraData, Imperva. >> Rob: Yes. >> Not looking pretty. We talked about this last year, just about a year ago. We said, the nose of the plane is up. >> Yep. >> Dave: But the planes are losing altitude. >> Yep. >> Dave: And when the funding dries up, look out. Interesting, some companies still are getting funding, so this makes rip currents. But in general, it's not pretty for pure play, dupe companies. >> Right. >> Dave: Something that you guys predicted, a long time ago, I guess. >> So I think there's a macro trend here, and this is really, I did a couple months of research, and this is what went into that end of tech companies post. And it's interesting, so you look at it in the stock market today: the five highest valued companies are all tech companies, what we would call. And that's not a coincidence. The reality is, I think we're getting past the phase of there being tech companies, and tech is becoming the default, and either you're going to be a tech company, or you're going to be extinct. I think that's the MO that every company has to operate with, whether you're a retailer, or in healthcare, or insurance, in banking, it doesn't matter. If you don't become a tech company, you're not going to be a company. That's what I was getting at. And so some of the pressures I was highlighting was, I think what's played out in enterprise software is what will start to play out in other traditional industries over the next five years. >> Well, you know, it's interesting, we talk about these things years and years and years in advance and people just kind of ignore it. Like Benioff even said, more SaaS companies are going to come out of non-tech companies than tech companies, OK. We've been talking for years about how the practitioners of big data are actually going to make more money than the big data vendors. Peter Goldmacher was actually the first, that was one of his predictions that hit true. Many of them didn't. (laughs) You know, Peter's a good friend-- >> Rob: Peter's a good friend of mine as well, so I always like pointing out what he says that's wrong. >> But, but-- >> Thinking of you, Peter. >> But we sort of ignored that, and now it's all coming to fruition, right? >> Right. >> Your article talks about, and it's a long read, but it's not too long to read, so please read it. But it talks about how basically every industry is, of course, getting disrupted, we know that, but every company is a tech company. >> Right. >> Or else. >> Right. And, you know, what I was, so John Battelle called me last week, he said hey, I want to run this, he said, because I think it's going to hit a nerve with people, and we were talking about why is that? Is it because of the election season, or whatever. People are concerned about the macro view of what's happening in the economy. And I think this kind of strikes at the nerve that says, one is you have to make this transition, and then I go into the article with some specific things that I think every company has to be doing to make this transition. It starts with, you've got to rethink your capital structure because the investments you made, the distribution model that you had that got you here, is not going to be sufficient for the future. You have to rethink the tools that you're utilitizing and the workforce, because you're going to have to adopt a new way to work. And that starts at the top, by the way. And so I go through a couple different suggestions of what I think companies should look at to make this transition, and I guess what scares me is, I visit companies all over the world, I see very few companies making these kind of moves. 'Cause it's a major shake-up to culture, it's a major shake-up to how they run their business, and, you know, I use the Warren Buffett quote, "When the tide goes out, you can see who's been swimming naked." The tide may go out pretty soon here, you know, it'll be in the next five years, and I think you're going to see a lot of companies that thought they could never be threatened by tech, if you will, go the wrong way because they're not making those moves now. >> Well, let's stay cognitive, now that we're on this subject, because you know, you're having a pretty frank conversation here. A lot of times when you talk to people inside of IBM about cognitive and the impact it's going to have, they don't want to talk about that. But it's real. Machines have always replaced humans, and now we're seeing that replacement of cognitive functions, so that doesn't mean value can't get created. In fact, way more value is going to be created than we can even imagine, but you have to change the way in which you do things in order to take advantage of that. >> Right, right. One thing I say in the article is I think we're on the cusp of the great reskilling, which is, you take all the traditional IT jobs, I think over the next decade half those jobs probably go away, but they're replaced by a new set of capabilities around data science and machine learning, and advanced analytics, things that are leveraging cognitive capabilities, but doing it with human focus as well. And so, you're going to see a big shift in skills. This is why we're partnering with companies like Galvanize, I saw Jim Deters when I was walking in. Galvanize is at the forefront of helping companies do that reskilling. We want to help them do that reskilling as well, and we're going to provide them a platform that automates the process of doing a lot of these analytics. That's what the new project Dataworks, the new Watson project is all about, is how we begin to automate what have traditionally been very cumbersome and difficult problems to solve in an organization, but we're helping clients that haven't done that reskilling yet, we're helping them go ahead and get an advantage through technology. >> Rob, I want to follow up too on that concept on the capital markets and how this stuff is measured, because as you pointed out in your article, valuations of the top companies are huge. That's not a multiple of data right now. We haven't really figured that out, and it's something that we're looking at, the Wikibon team is how do you value the data from what used to be liability 'cause you had to put it on machines and pay for it. Now it's really the driver, there's some multiple of data value that's driving those top-line valuations that you point out in that article. >> You know it's interesting, and nobody has really figured that out, 'cause you don't see it showing up, at least I don't think, in any stock prices, maybe CoStar would be one example where it probably has, they've got a lot of data around commercial real estate, that one sticks out to me, but I think about in the current era that we're in there's three ways to drive competitive advantage: one is economies of scale, low-cost manufacturing; another is through network effects, you know, a number of social media companies have done that well; but third is, machine learning on a large corpus of data is a competitive advantage. If you have the right data assets and you can get better answers, your models will get smarter over time, how's anybody going to catch up with you? They're not going to. So I think we're probably not too far from what you say, Jeff, which is companies starting to be looked at as a value of their data assets, and maybe data should be on the balance sheet. >> Well that's what I'm saying, eventually does it move to the balance sheet as something that you need to account for? Because clearly there's something in the Apple number, in the Alphabet number, in the Microsoft number, that's more than regular. >> Exactly, it's not just about, it's not just about the distribution model, you know, large companies for a long time, certainly in tech, we had a huge advantage because of distribution, our ability to get to other countries face to face, but as the world has moved to the Internet and digital sales and try/buy, it's changed that. Distribution can still be an advantage, but is no longer the advantage, and so companies are trying to figure out what are the next set of assets? It used to be my distribution model, now maybe it's my data, or perhaps it's the insight that I develop from the data. That's really changed. >> Then, in the early days of the sort of big data meme taking off, people would ask, OK, how can I monetize the data? As opposed to what I think they're really asking is, how could I use data to support making money? >> Rob: Right. Right. >> And that's something a lot of people I don't think really understood, and it's starting to come into focus now. And then, once you figure that out, you can figure out what data sources, and how to get quality in that data and enrich that data and trust that data, right? Is that sort of a logical sequence that companies are now going through? >> It's an interesting observation, because you think about it, the companies that were early on in purely monetizing data, companies like Dun & Bradstreet come to mind, Nielsen come to mind, they're not the super-fast-growing companies today. So it's kind of like, there was an era where data monetization was a viable strategy, and there's still some of that now, but now it's more about, how do you turn your data assets into a new business model? There was actually a great, new Clay Christensen article, it was published I think last week, talking about companies need to develop new business models. We're at the time, everybody's kind of developed in, we sell hardware, we sell software, we sell services, or whatever we sell, and his point was now is the time to develop a new business model, and those will, now my view, those will largely be formed on the basis of data, so not necessarily just monetizing the data, to your point, Dave, but on the basis of that data. >> I love the music industry, because they're always kind of out at the front of this evolving business model for digital assets in this new world, and it keeps jumping, right? It jumped, it was free, then people went ahead and bought stuff on iTunes, now Spotify has flipped it over to a subscription model, and the innovation of change in the business model, not necessarily the products that much, it's very different. The other thing that's interesting is just that digital assets don't have scarcity, right? >> Rob: Right. >> There's scarcity around the data, but not around the assets, per se. So it's a very different way of thinking about distribution and kind of holding back, how do you integrate with other people's data? It's not, not the same. >> So think about, that's an interesting example, because think about the music, there's a great documentary on Netflix about Tower Records, and how Tower Records went through the big spike and now is kind of, obviously no longer really around. Same thing goes for the Blockbusters of the world. So they got disrupted by digital, because their advantage was a distribution channel that was in the physical world, and that's kind of my assertion in that post about the end of tech companies is that every company is facing that. They may not know it yet, but if you're in agriculture, and your traditional dealer network is how you got to market, whether you know it or not, that is about to be disrupted. I don't know exactly what form that will take, but it's going to be different. And so I think every company to your point on, you know, you look at the music industry, kind of use it as a map, that's an interesting way to look at a lot of industries in terms of what could play out in the next five years. >> It's interesting that you say though in all your travels that people aren't, I would think they would be clamoring, oh my gosh, I know it's coming, what do I do, 'cause I know it's coming from an angle that I'm not aware of as opposed to, like you say a lot of people don't see it coming. You know, it's not my industry. Not going to happen to me. >> You know it's funny, I think, I hear two, one perception I hear is, well, we're not a tech company so we don't have to worry about that, which is totally flawed. Two is, I hear companies that, I'd say they use the right platitudes: "We need to be digital." OK, that's great to say, but are you actually changing your business model to get there? Maybe not. So I think people are starting to wake up to this, but it's still very much in its infancy, and some people are going to be left behind. >> So the tooling and the new way to work are sort of intuitive. What about capital structure? What's the implication to capital structures, how do you see that changing? So it's a few things. One is, you have to relook at where you're investing capital today. The majority of companies are still investing in what got them to where they are versus where they need to be. So you need to make a very conscious shift, and I use the old McKinsey model of horizon one, two and three, but I insert the idea that there should be a horizon zero, where you really think about what are you really going to start to just outsource, or just altogether stop doing, because you have to aggressively shift your investments to horizon two, horizon three, you've really got to start making bets on the future, so that's one is basically a capital shift. Two is, to attract this new workforce. When I talked about the great reskilling, people want to come to work for different reasons now. They want to come to work, you know, to work in the right kind of office in the right location, that's going to require investment. They want a new comp structure, they're no longer just excited by a high base salary like, you know, they want participation in upside, even if you're a mature company that's been around for 50 years, are you providing your employees meaningful upside in terms of bonus or stock? Most companies say, you know, we've always reserved that stuff for executives. That's not, there's too many other companies that are providing that as an alternative today, so you have to rethink your capital structure in that way. So it's how you spend your money, but also, you know, as you look at the balance sheet, how you actually are, you know, I'd say spreading money around the company, and I think that changes as well. >> So how does this all translate into how IBM behaves, from a product standpoint? >> We have changed a lot of things in IBM. Obviously we've made a huge move towards what we think is the future, around artificial intelligence and machine learning with everything that we've done around the Watson platform. We've made huge capital investments in our cloud capability all over the world, because that is an arms race right now. We've made a huge change in how we're hiring, we're rebuilding offices, so we put an office in Cambridge, downtown Boston. Put an office here in New York downtown. We're opening the office in San Francisco very soon. >> Jeff: The Sparks Center downtown. >> Yeah. So we've kind of come to urban areas to attract this new type of skill 'cause it's really important to us. So we've done it in a lot of different ways. >> Excellent. And then tonight we're going to hear more about that, right? >> Rob: Yes. >> You guys have a big announcement tonight? >> Rob: Big announcement tonight. >> Ritica was on, she showed us a little bit about what's coming, but what can you tell us about what we can expect tonight? >> Our focus is on building the first enterprise platform for data, which is steeped in artificial intelligence. First time you've seen anything like it. You think about it, the platform business model has taken off in some sectors. You can see it in social media, Facebook is very much a platform. You can see it in entertainment, Netflix is very much a platform. There hasn't really been a platform for enterprise data and IP. That's what we're going to be delivering as part of this new Watson project, which is Dataworks, and we think it'll be very interesting. Got a great ecosystem of partners that will be with us at the event tonight, that're bringing their IP and their data to be part of the platform. It will be a unique experience. >> What do you, I know you can't talk specifics on M&A, but just in general, in concept, in terms of all the funding, we talked last year at this event how the whole space was sort of overfunded, overcrowded, you know, and something's got to give. Do you feel like there's been, given the money that went in, is there enough innovation coming out of the Hadoop big data ecosystem? Or is a lot of that money just going to go poof? >> Well, you know, we're in an interesting time in capital markets, right? When you loan money and get back less than you loan, because interest rates are negative, it's almost, there's no bad place to put money. (laughing) Like you can't do worse than that. But I think, you know the Hadoop ecosystem, I think it's played out about like we envisioned, which is it's becoming cheap storage. And I do see a lot of innovation happening around that, that's why we put so much into Spark. We're now the number one contributor around machine learning in the Spark project, which we're really proud of. >> Number one. >> Yes, in terms of contributions over the last year. Which has been tremendous. And in terms of companies in the ecos-- look, there's been a lot of money raised, which means people have runway. I think what you'll see is a lot of people that try stuff, it doesn't work out, they'll try something else. Look, there's still a lot of great innovation happening, and as much as it's the easiest time to start a company in terms of the cost of starting a company, I think it's probably one of the hardest times in terms of getting time and attention and scale, and so you've got to be patient and give these bets some time to play out. >> So you're still sanguine on the future of big data? Good. When Rob turns negative, then I'm concerned. >> It's definitely, we know the endpoint is going to be massive data environments in the cloud, instrumented, with automated analytics and machine learning. That's the future, Watson's got a great headstart, so we're proud of that. >> Well, you've made bets there. You've also, I mean, IBM, obviously great services company, for years services led. You're beginning to automate a lot of those services, package a lot of those services into industry-specific software and other SaaS products. Is that the future for IBM? >> It is. I mean, I think you need it two ways. One is, you need domain solutions, verticalized, that are solving a specific problem. But underneath that you need a general-purpose platform, which is what we're really focused on around Dataworks, is providing that. But when it comes to engaging a user, if you're not engaging what I would call a horizontal user, a data scientist or a data engineer or developer, then you're engaging a line-of-business person who's going to want something in their lingua franca, whether that's wealth management and banking, or payer underwriting or claims processing in healthcare, they're going to want it in that language. That's why we've had the solutions focus that we have. >> And they're going to want that data science expertise to be operationalized into the products. >> Rob: Yes. >> It was interesting, we had Jim on and Galvanize and what they're doing. Sharp partnership, Rob, you guys have, I think made the right bets here, and instead of chasing a lot of the shiny new toys, you've sort of thought ahead, so congratulations on that. >> Well, thanks, it's still early days, we're still playing out all the bets, but yeah, we've had a good run here, and look forward to the next phase here with Dataworks. >> Alright, Rob Thomas, thanks very much for coming on the Cube. >> Thanks guys, nice to see you. >> Jeff: Appreciate your time today, Rob. >> Alright, keep it right there, everybody. We'll be back with our next guest right after this. This is the Cube, we're live from New York City, right back. (electronic music)

Published Date : Sep 28 2016

SUMMARY :

Brought to you by headline sponsors: This is the Cube, the worldwide leader Jeff, great to see you as well. Been all over the So the whole community is abuzz Alright, and you've got some We said, the nose of the plane is up. Dave: But the planes But in general, it's not you guys predicted, and tech is becoming the default, than the big data vendors. friend of mine as well, about, and it's a long read, because the investments you made, A lot of times when you of the great reskilling, on that concept on the capital markets and you can get better answers, as something that you need to account for? the distribution model, you know, Rob: Right. and it's starting to come into focus now. now is the time to develop and the innovation of change but not around the assets, per se. Blockbusters of the world. It's interesting that you but are you actually but I insert the idea that all over the world, because 'cause it's really important to us. to hear more about that, right? the first enterprise platform for data, of the Hadoop big data ecosystem? in the Spark project, which and as much as it's the on the future of big data? the endpoint is going to be Is that the future for IBM? they're going to want it in that language. And they're going to want lot of the shiny new toys, and look forward to the next thanks very much for coming on the Cube. This is the Cube, we're live

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Dave	PERSON	0.99+
Nvidia	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
Peter	PERSON	0.99+
Rob Thomas	PERSON	0.99+
John Battelle	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Peter Goldmacher	PERSON	0.99+
Rob	PERSON	0.99+
San Francisco	LOCATION	0.99+
Jeff Frick	PERSON	0.99+
New York City	LOCATION	0.99+
CoStar	ORGANIZATION	0.99+
last week	DATE	0.99+
yesterday	DATE	0.99+
Cambridge	LOCATION	0.99+
Apple	ORGANIZATION	0.99+
New York	LOCATION	0.99+
Benioff	PERSON	0.99+
New York City	LOCATION	0.99+
tonight	DATE	0.99+
Warren Buffett	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Galvanize	ORGANIZATION	0.99+
first	QUANTITY	0.99+
two	QUANTITY	0.99+
one	QUANTITY	0.99+
last year	DATE	0.99+
Jim Deters	PERSON	0.99+
today	DATE	0.99+
last year	DATE	0.99+
two ways	QUANTITY	0.99+
Clay Christensen	PERSON	0.99+
three ways	QUANTITY	0.99+
Alphabet	ORGANIZATION	0.99+
One	QUANTITY	0.99+
Two	QUANTITY	0.99+
third	QUANTITY	0.99+
iTunes	TITLE	0.99+
one day	QUANTITY	0.99+
Jim	PERSON	0.99+
Spotify	ORGANIZATION	0.99+
Nielsen	ORGANIZATION	0.99+
Tower Records	ORGANIZATION	0.98+
IBM Analytics	ORGANIZATION	0.98+
Netflix	ORGANIZATION	0.98+
Wikibon	ORGANIZATION	0.98+
one example	QUANTITY	0.98+
McKinsey	ORGANIZATION	0.98+
Facebook	ORGANIZATION	0.98+
First time	QUANTITY	0.98+

Ed Albanese - Hadoop World 2011 - theCUBE

>>Ed, welcome to the Cube. All right, Thanks guys. Good >>To see you. Thanks. Good to see you as well, >>John. Okay. Ed runs Biz dev for Cloudera, Industry veteran, worked at VMware. Ed, gotten to know you the past year. You guys have been doing great. What a difference one year makes, right? I mean, absolutely. Tell us, just let's start it off with what's happened in a year. I mean, you know, here at Hadoop World Cloudera, the ecosystem. Just give us your view of your perspective of what a difference one year makes. >>I think more than double is probably the, the fastest answer I could give you, which is, I mean, even looking around at the conference, it's, it itself is literally double from what it was last year. But in terms of the number of partners that have entered the market and really decided to work with, with Cloudera, but also in general, just the, the, the, the scope and size of the ecosystem itself, investors from every angle. You've got companies really well-branded marquee companies like Oracle coming into the mix and saying, Hey, Hadoop is the, is the real deal and we need to invest here. Marquee companies like IBM and EMC also doing the same. And of course, you know, as a result, you know, lots and lots of customer interest in the technology. And Cloudera's been fortunate to have been in the market early and really made the right investments with the right team. And so we're able to serve a lot of those customer needs. So it's been really, it's been a fantastic year for the company. >>So we had a great day yesterday with Cloudera. We had Kirk on, we had AER on twice, who by the way went viral with his modern warfare review, but we had Jeff Harmar Baer on, so we had pretty much the brain trust, Mike and Michaelson. Yep. The brain trust, the Cloudera. So we talked about the risk factors for Cloudera. Obviously you guys are number one, you've been kind of had untouchable lead and then all of a sudden boom competition. So Mike talked about that. So the strategy and the product side, they addressed, you're on the, the biz dev side, so you know, when you were number one, everyone wants to stand next to you and your phone rings off the hook from tier one partners all the way down to anyone's just getting in the business. Who wants a big data strategy on the execution. Now, what are you guys doing right now to, to continue your lead on the, on the sales marketing biz dev? I mean, I know you get the partner program, but what's your strategy for Phil, how to continue >>In that lead? The, the beautiful thing is honestly, our strategy hasn't changed at all. And I know that might sound counterintuitive, but we started off with a, a really crisp vision. And we want, what we wanna do is create a very attractive platform for partners. And, and, you know, one of the core, you know, sort of corporate strategy, Edix for Quadera is a recognition that the end of the day, the platform itself, Hado is an input into a solution. And Quadra is not likely to deliver the complete solution to market. Instead, it's going to be companies like Dell, for example, or it's going to be companies on the, on the ISV side like Informatica, which you're gonna deliver not only a base platform, but also the, the, the, the BI or analytics or data integration technologies on top. And as a result, what we've done is we've really focused in on creating a very attractive platform to vendors to build on. >>And one of the, I think one of the biggest misconceptions that I'm excited about that, you know, we are now having an opportunity to correct and that's a result, frankly, of the additional competitive dynamic. And I think the, the Wiki bond team pointed that out rather pointedly in their most recent articles. But is, is the sort of the lack of understanding around what CDH is and also the, some of the other investments that we're making to create a truly attractive platform for vendors to build on. And you know, I mean, I think you, you may have familiarity with exactly what CDH is, but for the sake of the audience here, what I'd like to do is say, say, first off, you know, first and foremost this is a hundred percent free in Apache license open source. But more importantly, it is everything that we build on the platform, meaning it's completely full featured. >>We put all of that out in the open. There's no turbo version of Hadoop that we've got hiding in the closet for our, our four pay customers. We're absolutely making investment. But I think, you know, when you think about it from the vendor perspective, and that's my bias. So I always think about, I treat all of the potential partners as really my customer. And when you think about it from that perspective, the things that matter most to vendors, number one, transparency. They need to understand exactly what our business model is, where we plan to make money and where we plan, don't make money. They need to know what we're really good at developing and what we're not so good at developing. And sort of where we draw the, the boundaries around that investment. I think, you know, a testament to that, for example, is tomorrow we're hosting a partner summit. >>So after this event, there are gonna be over 60 individuals, but they max two per per vendor. So we're gonna have over 35 vendors attending this event. And what they're gonna hear from is our entire management team is as deeply as we can and as open as we can. And you know, it, it's, it's, it's funny, you know, I think I saw this article in Forbes the other day about Cloudera. It was this, the title of the article was something like Spies Like Us. And it it, and it, what it highlighted was that some, some competitor of Cloudera had actually hired a, a, a competitive intelligence agency to go on and, and try to engage with, you know, and, and try to learn more about Cloudera. And so they went on to Cora, which we have a lot of active engineers on Cora. And they, you know, they went out and they asked a bunch of product related questions to our to, to someone on Cora. And our engineers immediately responded and they started being very transparent, completely open to what, what they're building and why they're building it. And the article basically summarized to say, Hey, you know what, you know, clearly some people aren't all that sophisticated in figuring out, you know, who they're talking to. And it's really important to do that. And they got the absolute wrong conclusion. Our engineers are actually encouraged and in fact rewarded for being extremely transparent in the market because we believe that it's transparency will ultimately allow us to be that platform vendor. >>And that's what attracts me. Jeff Hummer Bucker, who's active on core as well, he's recruiting there too. So you guys are out engaging the community. Yeah. So just let me just review, cuz this is cool that you're addressing this because Hortonworks and others, and I'll say the name Hortonworks has been pumping up the PR and creating a lot of noise around open and kind of Depositioning Cloudera. So you guys are completely open, a hundred percent Hadoop, open source, everything you build in, in every way, in every way. You have engineers building core, you've got tools and all the other stuff is being built in Cloudera then contributing into the community. >>Actually it's the other way around. We build it and the community@apache.org. So all of our technology is built@apache.org. It's, it's developed there. It's, it's, it's initially shared there. And then we have another team inside our company that pulls down bits from apache.org and then assembles them and integrates them. So it's really, it's a really key thing. And there's no, we do, we have no bits that we don't develop@apache.org that are part of cdh. So there, I mean there can be no mistake that everything that that is in CDH is everything we got. >>So CDH is free. >>It is free >>And every it's open source. It's open you >>Charge enterprise edition. That's the only thing that's different you guys charge >>Yeah. Which is your management console, right. >>Management >>Suite and all kinds of >>The tools. And that's not free and that's not open source. That's correct. Just to be clear. Yep. But so AER took us yesterday through, I don't know, half a dozen probably open source projects and then the one is the, the management console. And that's what you charge for, that's where you're gonna make money? >>Yeah. We, we manufacture, essentially we manufacture two products, but we sell one. So we manufacture the Quadera distribution, including Apache Duke, that's free. It's free. And then we all in open source and built it Apache and, and really heavily tested and well documented and, and, and well integrated. And then we also manufacture quadera Enterprise, which includes support and indemnities and warranties for that full featured CDH product and also includes the Quadra management suite. And >>That's a subscription. >>And that's a subscription. And so customers can, can run cdh, they can then buy and license Cloudera Enterprise and then someday if they decide they don't need Cloud Air Enterprise for whatever reason, if they're, if their team are scripting wizards and they've decided that they, you know, they don't need the extra opportunity for being able to track all of the things that Cloudier Enterprise allows 'em to, they can step off of cloud enterprise and continue to use full feature to do as they see >>Fit. So take an example of one of your partners that you announced this week. NetApp NetApp's gonna package your cdh CDH and the subscription Correct. To their, their customers. And then they're gonna let their channel either, you know, they'll pre bule it or do a reference architecture, you'll get paid for that subscription that's bundled. That's correct. Will make money off of its filers. Yes. And the customer gets a package solution. >>Exactly. Right. And in fact, that's another important thing that you know, is probably worth discussing, which is our go to market model. I don't know if you guys had a chance to talk with anyone yesterday on that, but I'm responsible for our channel strategy and one of the key things that we've agreed to as a, as a company is that we really are gonna go to market through channel partners. Yeah. >>We covered sgi, that was a great announcement. >>Yep, a >>Hundred percent >>As, as close as we can get. Okay. I mean that is our, he's >>Still doing the direct deals. You still have that belly to belly sales force because it's still early, right? So there's a mix of direct and indirects, not a pure >>Indirect, but as, and that's only, that's only as we're able to, until we're able to ramp up our partners fully, in which case we really want our, the current team that is working belly to belly to really support our partners. >>So all so VMware like, but I I wanted to ask >>You VMware, like NetApp, like very similar. >>Yes. Very, very NetApp. Like NetApp probably 75%, you know. Exactly. What are the similarities and differences with VMware in, in the ecosystem? You know it well, >>I do know it well. Yeah. I spent several years working at VMware and you know, I think, I mean the first and most obvious difference is that when you think, when I think about platform software in general, you know, there are a few different flavors of platform. One of the things that makes Hadoop very unique, very unique relative to other platforms is that it, not only is it Apache license, but it really is, it's dependent upon other external innovators to, to create the entire full value of the ecosystem. So, or, or you know, of the solution, right? So unlike for example, so like, let's take a platform like everyone's familiar with like Apple iTunes, right? What happens is Apple creates the platform and they put it kind of in the middle on top of and behind the scenes is the innovator, the app builder, he builds it, he publishes it on Apple, and then Apple controls all access to the >>Customer. Yep. >>That's not adu, right? Right. Let's take VMware or Red Hat for example. So in that case, they publish a platform they own and control the, the absolute structure and boundaries of what that platform is. And then on top of that application vendors build and then they deliver to the, the customer. But you know, at the end of the day, the, you know, the relationship really is, you know, from that external innovator straight down, and there's no, there's, you know, there's no way for them to really modify the platform. And you take kadu, which is a hundred percent Apache licensed to open source, and you really, you really open up the opportunity for vendors to take ADU as an input into their system and then deliver it straight to their customers or for customers themselves to say, I want straight up vanilla Hadoop, I'm gonna go this way and I'm gonna add on my own be app of applications. So you're, we're seeing all sorts of variants right now in the market. We're seeing software as a service being delivered that's based on Hadoop. There was a great announcement a few weeks ago from a company named Tidemark, previously known as Per Ferry, and they're taking all of cdh. They're, but they're, the customer doesn't know that they're, and what they're doing is they're delivering software as a, as a service based on adu. >>Yeah. So I mean, you know, we are psyched that you're clearing this up because obviously we're seeing, we saw all that stuff, but I really think that indirect strategy as a home run, I'm said it when we talked about the SGI thing, and it's accelerates you guys, you enable, but you know, channels is an interesting business. I mean the, you have to have pure transparency as you mentioned, but they need comp, people need confidence and, and they don't, they worry about competition. So channel conflict is always the big issue, right? Right. Is Cloudera gonna compete with us? So talk that, talk us through that, that strategy. So obviously the market's growing, new solutions are coming around the corner, These guys wanna make money. I mean channel, it's all about, you know, what have you done for me today? >>Right. That, that is exactly right. And you know what, that's, that's why we decided on the channel strategy specifically around our product is because we recognize that each and every single potential channel partner of ours can actually innovate themselves on top of and create differentiation. And we're not an obstacle to that process. So we provide our platform as an input and we're capable of managing that platform, but ultimately creating differentiation is all in the hands of our partners and we're there to help, but it gives them wide latitudes. So take for example, the differences between Dell and NetApp solution, they are very different reference architectures leveraging the exact same platform. >>Yeah. And they have to make money. I mean, the money making side of it is, you know, people have kind of, don't really talk about that, but, you know, channel partners loyalty is all about who can help them make cash. Right. Right. Exactly. What are you hearing there in terms of the ecosystem? Has the channels Bess and the partnerships or the more as size, what's the profile of your, of your partners? I mean, can you give us the breakdown of Sure. We have what you look like from Dell. We know Dell and NetApp, but they're gear guys. But, >>So a big part of our strategy is to work with IHVs and then Ihv resellers. So you're talking about companies like Dell, like sgi, like NetApp, for example, independent hardware manufacturers. Another part of our strategy though, and a key, a key requirement from our customers is to work with a whole variety of ISVs, particularly in the data management space. So you've got really marquee companies in the database space like IBM's Netezza or Terradata. You've got in companies like Informatica and Talent, you've got companies on the BI side, like Micro Strategy and Tableau. These kinds of technologies are currently in play at our customers that have made substantial investments. And ultimately they want to be able to continue to leverage them with the data platform, whichever data platform that they end up choosing. So we invest considerably there. A big part of that has been our Qera Connect partner program. >>It's an opportunity for us to help the customer to understand which technologies work and work well with, with our platform. It's also an opportunity for us to engage directly and assist the vendor. So one of the things that we created as part of that program is first off, immediate and absolute discounted access to any part of our training. Second, lots of free information, access to our world class knowledge base, access to our support team, direct access to our support team. The, the vendors also get access to a developer portal that would created specifically for them. So if, if you think about it this way, Hadoop gets built@apache.org, but solutions don't get built@apache.org. Right? So what we're really trying to help our vendors do is be able to develop their solutions by having real clear visibility to the API level points of Hadoop. They're not necessarily interested in, in trying to figure out how, how MR two works or, or contributing code to that. >>But they absolutely are interested in figuring out how to run and execute their software on top of a do. So when I think about the things that matter to create an attractive platform, and at the end of the day, that's what we're really trying to do, first and foremost is transparency, right? Second really ultimately is really clear visibility to the APIs and the documentation of that platform so that there's no ambiguity that the, the vendor, this is the user in this case, it's building a solution, can absolutely absorb all of that content really cleanly. And then ultimately, you know, I think it's customers, right? Users of the technology. And I think our download numbers are, they're, they're, there's something we're proud of. >>We, we are, we're hearing good feedback. I mean, the feedback we hear from folks is, yeah, I love how they take away the complexity of handling versions and whatnot. So, you know, I think totally is a great way, The CDH is a great bundle. You know, the questions that we have for you is what are you hearing about the other products, the ones you're actually selling? Does that create the lock in? So that's something that we asked Elmer directly, you know, is that the, is that the lock in and what happens when the deployments get so big? You know, >>I mean, the way, I >>Don't really see an issue there, but that's what people are afraid of. I mean, that's kind of the, it's more of fear. I mean, some people can use that fear and, and >>Play against. I think, I think what we've seen in other markets is that management tools are ultimately interchangeable. And the only way that we're gonna retain a customer is by out innovating the competition on the management side, the lock in, the lock in component, as you will, is not really part of our business model. It's very difficult to achieve with an Apache licensed platform and a management suite that sits on outside of that, that licensed artifact. So ultimately, if we don't owe innovate, we're gonna lose. So we're working on the innovation and that's, >>How's the hiring go? Oh, go ahead. >>I, I had a, I wanted to come back to that. You mentioned download numbers. Can you share the numbers >>With the others? I can't, I can't share them publicly, but what I can say is that they've been on an incredible trajectory. Okay. That, and what we've seen is month to month growth rates, every single month we continue to see really significant growth rates. >>And then I, I had a follow up question on, you talked about the, the partner program. How do you manage all those partners? How do you prioritize them? I mean, the, the hardware vendors, it's pretty easy. There's a few big whales, but the, the ISVs, they're, I mean, your phone, like John said, must be ringing off the hook. How do you juggle that and, and can you do it better than VMware, for example? >>Well, we do it, we handle the, the influx of partner interest in two ways. One, we've been relatively structured with the Quadra Connect partner program, and we make real investments there. So we have dedicated folks that are there to help. We have our engineering team that is actually feeding inputs, and we're, we're leveraging some of the same resources that we provide to our customers and feeding those directly to our partners as well. So that's one way that we handle it. But the other way, frankly, is, I mean, customers help here having access to and, and a real customer population, they help you set priorities pretty quickly. And so we're able to understand what we track in inside of our systems, which, which technologies our customers use. So we know, for example, what percentage of our customer base has has SaaS installed, and we'd like to use that with a, do we know which percentage of our customer base is currently running on Red Hat and which is not. So having core visibility, that helps us to prioritize. >>How about incentives? I mean, obviously channel businesses as, like I said, very fickle people, you know, you know the channel business, I spent, you know, almost a decade in, in HP's channel organization and you know, you have to provide soft dollars. There's a lot of kind of blocking and tackling. You guys are clearly building out that tier one with the SGIs of the world and other vendors, and then get the partner connect program for kinda everyone else who's gonna grow up into a tier one. Yeah. Training, soft dollars incentives. You guys have that going yet, or is the >>Roadmap? We do. And in fact, you know, in addition to the sort of more wide publicized relationships you see with companies like Dell and Cloudera, we're actually building a very successful network of independent ours. And the VAs in general. What we do is we prioritize and select ours based on the top level relationships that we have, because that really helps them to hone in. They've got validation from, for, for example, someone that sells resells. SGI is an organization that now is heard really loud and clear from sgi the, the specific platform configurations that they're gonna represent to their customers, and they ultimately wanna represent them directly. And how we make investments is we're, I mean, the investments we're making ultimately in our sales org, I'm gonna lose the word direct from that conversation because our sales org is being built to help our partners succeed. And I think that's where you're, >>The end game is to go completely indirect and have all your support go into managing that channel. What, what's the mix of revenue generation from your partners? Obviously as a, you know, with sgi they have pre-built channels that you're funneling in, you got NetApp and they're wrapping their products and services around it. How much is services and how much is a solution specifically? Do you have any visibility or a feel for that at this >>Point? I mean, services relative to, You mean for Cloudera particularly, or for our >>Partner? No, for the, for the part. I mean, if I'm a partner, I'm like, Hey, okay, I'm gonna use cdh. I'm on bundles. I don't mind paying you a wholesale if I'm gonna be able to throw off more cash on, you know, deployment and cloud and services, et cetera. And or if I'm a product manufacturer, a product, a solution I fund you in. I need to have that step >>Up a absolutely great question. So depending upon the partner we're dealing with, they like to either monetize or generate their revenue in different ways. So for example, NetApp, NetApp is a company that has very limited services, and their, their focus is a business is really on delivering hardware and software configured together. And they, they rely heavily on a services channel to fulfill, you take in, in contrast to a company like, for example, Dell, which has a very successful services business and really is excited about having service offerings around Hadoop. So it depends upon the company. But when we talk about our VAR channel in particular, one of the things that's a, in an internal acronym, but I'll share it publicly here. We, we call our, our supervisors and what makes them super and why, why we've selected the, the, the organizations that we are selecting right now to be our bar is that they not only can fulfill orders for hardware and software, particularly data management or infrastructure software, but they also have a services team on hand because we recognize that there is a services opportunity with every Hadoop deployment. And we want our partners to have that. So as an organization, we're structuring our, our services staff to facilitate and enable our partners not to be sold >>Directly. Okay. So that's the follow up that I had tomorrow when the partners ask, Okay, what do you want to be when you're really growing up? Is it services, is it software? >>Is it Carter is a software company, Crewing through, >>Oh, er we kind of got ett, well, he didn't say it, but we said it's a operating system. Yeah. >>So given that, so given that, I mean, you can make money on services, right? People need services. Okay, great. >>And partners will make that money for >>Us. And, and you know, early on you, you had to do some of that and you're, you've been very clear about where it's going. It's hard to make money in software when you're given all the software away for free. Well, >>We're not giving all >>The software. I know you've got that piece now, but, but here's my question. As ADU goes into the enterprise, which is clearly doing, is that that whole bundling, like what you're doing with NetApp is that really ultimately how you're gonna start to, to monetize and, and successfully monetize your software, >>Is by pushing it through >>Yeah. Packaging and that bundling that solution, in other words, our enterprise customer is gonna be more receptive to that solution package than say the, the fridge that has been using Hadoop for the last >>Two or three years. I think there's no question about it. If you, if you look at what Quadra Enterprise does, I don't know if, if you've had a chance to attend any of the sessions, maybe where Quadra Enterprise is, is currently being demonstrated. >>We just had Alex Williams as about on the air. Did a review, >>Okays >>Been going good and impressed with it? >>Yeah, there's no question about it. And I, I don't, and Alex probably hasn't seen the new version that, you know, our team is working on and it's, you know, quietly working on in the background. Incredible, incredible developments in, And that's really a function of when you have direct access to so many customers and you're getting so much input and feedback and they're the kinds of access to the kinds of customers we ultimately wanna serve. So real enterprises, what you get is really fast innovation from a really talented team that knows to do well. I mean, we are years ahead on the management side. Absolutely. Years ahead. And you know, I, so I was a guy who worked at VMware for several years, and I can tell you that while the hypervisor itself was, was a core component to VMware success, the monetization strategy was very squarely around vCenter. Yeah. Yes. Out. And we're not ignorant to that. Yeah. >>You can learn a lot from your VMware experience cause absolutely. The, the market changed significantly. And, you know, >>There were free hypervisors available all of a sudden. VMware itself had a free hypervisor. We had, we had VMware server and we had also our VMware player products, right? And those were all free. And they were very good technology. They were the best available in the market for free. And they were better, in my opinion, they were better than anything else. Open or not. No, our time >>Too, since still >>Are, they were, they, they were, they were superior products in every way. But yet how VMware was successful was recognizing that in the interest of running a production environment with an sola, you need management software. And they've also built the best management software. And there's no question that we understand that strategy and >>A phenomenal ecosystem. I mean, there's the >>Similarities, right? They did. And you, and the, and the ecosystem was in, in large part predicated on transparency act, very clear access to the APIs and a willingness to help partners be successful with those APIs. And ultimately drawing a very tight box about what the company wanted to do and didn't want to do. >>I mean, look, you're not, you're not gonna lose friends when you make people money. That's my philosophy, right? I agree. So when you're in that business where you can come in and enable a channel and have options on your growth strategy, which you do, I mean, you can say, Okay, bundling, I can go, you know, I can have this sold direct, or at least as long as you've got the options, you can grow with that market. So, you know, again, the, it's a money making opportunity for the partnerships, but there's >>More than that, right? Because you mentioned Apple, iTunes, Oracle's another example. And the way you make money with Apple and the way you make money with Oracle is different than the way you make money with VMware and presumably Cloudera. >>Yeah, I mean, our strategy is, if you make this base platform easier to install, more reliable, and you make it ultimately, you know, really rock solid from an integration standpoint, more people are gonna use it. So what happens when more people use it? First thing that happens is more solu, it's out there. So it's more solutions get built. When more solutions get built, then you see more clusters get developed. When more clusters are out there, they start to move into production. And then they, they need an sla when they need an sla, Cloudera and Enterprise gets purchased. But along that path, when those solutions got built, guess what else happened? More cloud units got sold, more servers got sold, more networking. Gear got sold, more services got created. You get, you get ultimately more operating systems got sold, more databases, got data into them, more BI clients got created. The ecosystem is deep and rich, and a lot of people stand to make money hop >>In people. The water's great. >>What about, what about support? Okay, so, you know, the other guys are saying, We're just gonna make money on support. I mean support, You guys still are doing support, right? I mean, you're selling >>Support. There's no question. Quad Enterprise contains two things, right? The management suite and support this is, this is not uncomplicated technology and having a world class support team is of value and customers do want to pay for that value. But we, we believe that support in and of itself is not enough. And that ultimately, when you wanna deliver an sla, being able to call when you have a problem is the wrong approach. You want to be proactive and understand the problem well in advance of it actually occurring. That's really important. When, for example, if you're a customer, a lot of our customers have a data pipeline that >>They, they're building out basically. I mean they're, it's, it's new and emerging. So they're building out, It's not just support. They need other tools. >>Yeah. And it building out I think is an understatement for some, where some of our customers are. I mean, when you have a thousand node cluster that you're operating Yeah, Yeah. To, that's mission critical to your business. I don't think that's building out anymore. I think that's an investment in a technology that's mission critical. And what you wanna see when you have a mission critical technology is you wanna know early and often when a problem may emerge. Not, Oh, oh my gosh, we have a problem now I need to go, you know, phone a friend, phone a friend is, is kind of a last resort. We offer that. But what we really do is, and that's the, that's the beau, That's why we don't decouple our support from our management suite. It's not about phone a friend. It's about understanding the operation of your cluster the entire way through 24. >>And the other op the other thing that people don't talk about in the support is that with open source, a lot of support gets handled in the community as well. So like That's right. So in a way, you're already pre cannibalized with the community >>By us and by others. Absolutely. But you, you'll never see to that Forbes article I referenced earlier. You will never, you will not see our, our engineers are not trained to withhold information and under any circumstances to anyone free or paying. Yeah. This is about getting, You >>Don't wanna hold back your business. I mean, you have nothing to hide. It's open rights. >>Open source. It's open. And we're here to help. We're here to help. Whether you're paying us or not, >>This is value to that anticipatory >>Remediation. Yeah. That's what you're packaging and clearing up the air. Great. Great cube guest, you're awesome on the cube. Gonna have you more on because great to get the info out there. Really impressed with the channel strategy. Love the love the growth strategy, the cloud air. You guys are really impressive. I'm really, really impressed to see that you guys got everything pumping on all cylinders, Kirk, and you are cranking out on the business execution. We're in the team playing this chest mask open. Perfect. So great. Congratulations. Great. Thanks. You guys just in the financing. >>Oh, thank you as >>Well. Hey, Ed from Cloudera, clearing it up here inside the cube. We're gonna take a quick break and we'll be right back with more video. >>Thanks guys. All right.

Published Date : Apr 30 2012

SUMMARY :

Ed, welcome to the Cube. All right, Thanks guys. Good to see you as well, I mean, you know, here at Hadoop World Cloudera, the ecosystem. And of course, you know, as a result, you know, lots and lots of customer I know you get the partner program, but what's your strategy for Phil, how to continue And, and, you know, one of the core, you know, sort of corporate strategy, but for the sake of the audience here, what I'd like to do is say, say, first off, you know, first and foremost this I think, you know, a testament to that, for example, is tomorrow we're hosting a partner summit. And you know, it, it's, it's, it's funny, you know, I think I saw this article So you guys are out engaging the community. And then we have another team inside our company that pulls down bits from apache.org and then assembles them and integrates It's open you That's the only thing that's different you guys charge And that's what you charge for, that's where you're gonna make money? And then we also manufacture quadera Enterprise, if they're, if their team are scripting wizards and they've decided that they, you know, either, you know, they'll pre bule it or do a reference architecture, you'll get paid for that subscription And in fact, that's another important thing that you know, is probably worth discussing, I mean that is our, he's You still have that belly to belly sales force because it's still early, right? Indirect, but as, and that's only, that's only as we're able to, until we're able to ramp up our partners fully, Like NetApp probably 75%, you know. I mean the first and most obvious difference is that when you think, when I think about platform software in Yep. But you know, at the end of the day, the, you know, the relationship really is, I mean the, you have to have pure transparency as you mentioned, but they need comp, And you know what, that's, that's why we decided on the channel strategy specifically I mean, the money making side of it is, you know, people have kind of, don't really talk about that, So a big part of our strategy is to work with IHVs and then Ihv resellers. So if, if you think about it And then ultimately, you know, I think it's customers, You know, the questions that we have for you is what are you hearing about I mean, that's kind of the, it's more of fear. the lock in, the lock in component, as you will, is not really part of our business model. How's the hiring go? Can you share the numbers I can't, I can't share them publicly, but what I can say is that they've been on an incredible And then I, I had a follow up question on, you talked about the, the partner program. So we know, for example, what percentage of our customer base has has SaaS installed, and we'd like to use that with a, and you know, you have to provide soft dollars. And in fact, you know, in addition to the sort of more wide publicized relationships you see with companies like Dell Obviously as a, you know, if I'm gonna be able to throw off more cash on, you know, deployment and cloud and services, So for example, NetApp, NetApp is a company that has very limited services, Is it services, is it software? Oh, er we kind of got ett, well, he didn't say it, but we said it's a operating system. So given that, so given that, I mean, you can make money on services, right? Us. And, and you know, early on you, you had to do some of that and you're, you've been very clear about where it's going. that really ultimately how you're gonna start to, to monetize and, and successfully monetize your to that solution package than say the, the fridge that has been using Hadoop for the last I don't know if, if you've had a chance to attend any of the sessions, maybe where Quadra Enterprise is, We just had Alex Williams as about on the air. you know, our team is working on and it's, you know, quietly working on in the background. And, you know, And they were very that in the interest of running a production environment with an sola, you need management software. I mean, there's the And ultimately drawing a very tight box about what the company wanted to do and didn't want to do. So, you know, again, And the way you make money with Apple and Yeah, I mean, our strategy is, if you make this base platform easier to install, The water's great. Okay, so, you know, the other guys are saying, We're just gonna make money on support. And that ultimately, when you wanna deliver an sla, being able to call when you have a problem is the wrong approach. So they're building out, It's not just support. And what you wanna see when And the other op the other thing that people don't talk about in the support is that with open source, a lot of support gets handled in the You will never, you will not see our, our engineers are not trained to withhold information and under any circumstances to I mean, you have nothing to hide. And we're here to help. I'm really, really impressed to see that you guys got everything pumping on all cylinders, Kirk, and you are cranking We're gonna take a quick break and we'll be right back with more All right.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
EMC	ORGANIZATION	0.99+
Mike	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Ed	PERSON	0.99+
John	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
Phil	PERSON	0.99+
Alex Williams	PERSON	0.99+
Cloudera	ORGANIZATION	0.99+
Jeff Hummer Bucker	PERSON	0.99+
last year	DATE	0.99+
Alex	PERSON	0.99+
yesterday	DATE	0.99+
two products	QUANTITY	0.99+
SGI	ORGANIZATION	0.99+
half a dozen	QUANTITY	0.99+
HP	ORGANIZATION	0.99+
Second	QUANTITY	0.99+
Ed Albanese	PERSON	0.99+
Jeff Harmar Baer	PERSON	0.99+
75%	QUANTITY	0.99+
Cora	ORGANIZATION	0.99+
Spies Like Us	TITLE	0.99+
Hortonworks	ORGANIZATION	0.99+
Tidemark	ORGANIZATION	0.99+
two things	QUANTITY	0.99+
Informatica	ORGANIZATION	0.99+
community@apache.org	OTHER	0.99+
NetApp	ORGANIZATION	0.99+
first	QUANTITY	0.99+
twice	QUANTITY	0.99+
VMware	ORGANIZATION	0.99+
Hundred percent	QUANTITY	0.99+
tomorrow	DATE	0.99+
this week	DATE	0.99+
Terradata	ORGANIZATION	0.98+
past year	DATE	0.98+
Cloudier Enterprise	TITLE	0.98+
Two	QUANTITY	0.98+
two ways	QUANTITY	0.98+
built@apache.org	OTHER	0.98+
over 60 individuals	QUANTITY	0.98+
Michaelson	PERSON	0.98+
Cloudera	TITLE	0.98+
one year	QUANTITY	0.98+
Netezza	ORGANIZATION	0.98+
Hadoop	TITLE	0.98+
One	QUANTITY	0.98+
one	QUANTITY	0.98+
Talent	ORGANIZATION	0.98+
three years	QUANTITY	0.98+
one way	QUANTITY	0.98+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Terradata: