Wikibon Research Meeting
>> Dave: The cloud. There you go. I presume that worked. >> David: Hi there. >> Dave: Hi David. We had agreed, Peter and I had talked and we said let's just pick three topics, allocate enough time. Maybe a half hour each, and then maybe a little bit longer if we have the time. Then try and structure it so we can gather some opinions on what it all means. Ultimately the goal is to have an outcome with some research that hits the network. The three topics today, Jim Kobeielus is going to present on agile and data science, David Floyer on NVMe over fabric and of course keying off of the Micron news announcement. I think Nick is, is that Nick who just joined? He can contribute to that as well. Then George Gilbert has this concept of digital twin. We'll start with Jim. I guess what I'd suggest is maybe present this in the context of, present a premise or some kind of thesis that you have and maybe the key issues that you see and then kind of guide the conversation and we'll all chime in. >> Jim: Sure, sure. >> Dave: Take it away, Jim. >> Agile development and team data science. Agile methodology obviously is well-established as a paradigm and as a set of practices in various schools in software development in general. Agile is practiced in data science in terms of development, the pipelines. The overall premise for my piece, first of all starting off with a core definition of what agile is as a methodology. Self-organizing, cross-functional teams. They sprint toward results in steps that are fast, iterative, incremental, adaptive and so forth. Specifically the premise here is that agile has already come to data science and is coming even more deeply into the core practice of data science where data science is done in team environment. It's not just unicorns that are producing really work on their own, but more to the point, it's teams of specialists that come together in co-location, increasingly in co-located environments or in co-located settings to produce (banging) weekly check points and so forth. That's the basic premise that I've laid out for the piece. The themes. First of all, the themes, let me break it out. In terms of the overall how I design or how I'm approaching agile in this context is I'm looking at the basic principles of agile. It's really practices that are minimal, modular, incremental, iterative, adaptive, and co-locational. I've laid out how all that maps in to how data science is done in the real world right now in terms of tight teams working in an iterative fashion. A couple of issues that I see as regards to the adoption and sort of the ramifications of agile in a data science context. One of which is a co-location. What we have increasingly are data science teams that are virtual and distributed where a lot of the functions are handled by statistical modelers and data engineers and subject matter experts and visualization specialists that are working remotely from each other and are using collaborative tools like the tools from the company that I just left. How can agile, the co-location work primer for agile stand up in a world with more of the development team learning deeper and so forth is being done on a scrutiny basis and needs to be by teams of specialists that may be in different cities or different time zones, operating around the clock, produce brilliant results? Another one of which is that agile seems to be predicated on the notion that you improvise the process as you go, trial and error which seems to fly in the face of documentation or tidy documentation. Without tidy documentation about how you actually arrived at your results, how come those results can not be easily reproduced by independent researchers, independent data scientists? If you don't have well defined processes for achieving results in a certain data science initiative, it can't be reproduced which means they're not terribly scientific. By definition it's not science if you can't reproduce it by independent teams. To the extent that it's all loosey-goosey and improvised and undocumented, it's not reproducible. If it's not reproducible, to what extent should you put credence in the results of a given data science initiative if it's not been documented? Agile seems to fly in the face of reproducibility of data science results. Those are sort of my core themes or core issues that I'm pondering with or will be. >> Dave: Jim, just a couple questions. You had mentioned, you rattled off a bunch of parameters. You went really fast. One of them was co-location. Can you just review those again? What were they? >> Sure. They are minimal. The minimum viable product is the basis for agile, meaning a team puts together data a complete monolithic sect, but an initial deliverable that can stand alone, provide some value to your stakeholders or users and then you iteratively build upon that in what I call minimum viable product going forward to pull out more complex applications as needed. There's sort of a minimum viable product is at the heart of agile the way it's often looked at. The big question is, what is the minimum viable product in a data science initiative? One way you might approach that is saying that what you're doing, say you're building a predictive model. You're predicting a single scenario, for example such as whether one specific class of customers might accept one specific class of offers under the constraining circumstances. That's an example of minimum outcome to be achieved from a data science deliverable. A minimum product that addresses that requirement might be pulling the data from a single source. We'll need a very simplified feature set of predictive variables like maybe two or three at the most, to predict customer behavior, and use one very well understood algorithm like linear regressions and do it. With just a few lines of programming code in Python or Aura or whatever and build us some very crisp, simple rules. That's the notion in a data science context of a minimum viable product. That's the foundation of agile. Then there's the notion of modular which I've implied with minimal viable product. The initial product is the foundation upon which you build modular add ons. The add ons might be building out more complex algorithms based on more data sets, using more predictive variables, throwing other algorithms in to the initiative like logistic regression or decision trees to do more fine-grained customer segmentation. What I'm giving you is a sense for the modular add ons and builds on to the initial product that generally weaken incrementally in the course of a data science initiative. Then there's this, and I've already used the word incremental where each new module that gets built up or each new feature or tweak on the core model gets added on to the initial deliverable in a way that's incremental. Ideally it should all compose ultimately the sum of the useful set of capabilities that deliver a wider range of value. For example, in a data science initiative where it's customer data, you're doing predictive analysis to identify whether customers are likely to accept a given offer. One way to add on incrementally to that core functionality is to embed that capability, for example, in a target marketing application like an outbound marketing application that uses those predictive variables to drive responses in line to, say an e-commerce front end. Then there's the notion of iterative and iterative really comes down to check points. Regular reviews of the standards and check points where the team comes together to review the work in a context of data science. Data science by its very nature is exploratory. It's visualization, it's model building and testing and training. It's iterative scoring and testing and refinement of the underlying model. Maybe on a daily basis, maybe on a weekly basis, maybe adhoc, but iteration goes on all the time in data science initiatives. Adaptive. Adaptive is all about responding to circumstances. Trial and error. What works, what doesn't work at the level of the clinical approach. It's also in terms of, do we have the right people on this team to deliver on the end results? A data science team might determine mid-way through that, well we're trying to build a marketing application, but we don't have the right marketing expertise in our team, maybe we need to tap Joe over there who seems to know a little bit about this particular application we're trying to build and this particular scenario, this particular customers, we're trying to get a good profile of how to reach them. You might adapt by adding, like I said, new data sources, adding on new algorithms, totally changing your approach for future engineering as you go along. In addition to supervised learning from ground troops, you might add some unsupervised learning algorithms to being able to find patterns in say unstructured data sets as you bring those into the picture. What I'm getting at is there's a lot, 10 zillion variables that, for a data science team that you have to add in to your overall research plan going forward based on, what you're trying to derive from data science is its insights. They're actionable and ideally repeatable. That you can embed them in applications. It's just a matter of figuring out what actually helps you, what set of variables and team members and data and sort of what helps you to achieve the goals of your project. Finally, co-locational. It's all about the core team needs to be, usually in the same physical location according to the book how people normally think of agile. The company that I just left is basically doing a massive social engineering exercise, ongoing about making their marketing and R&D teams a little more agile by co-locating them in different cities like San Francisco and Austin and so forth. The whole notion that people will collaborate far better if they're not virtual. That's highly controversial, but none-the-less, that's the foundation of agile as it's normally considered. One of my questions, really an open question is what hard core, you might have a sprawling team that's doing data science, doing various aspects, but what solid core of that team needs to be physically co-located all or most of the time? Is it the statistical modeler and a data engineer alone? The one who stands up how to do cluster and the person who actually does the building and testing of the model? Do the visualization specialists need to be co-located as well? Are other specialties like subject matter experts who have the knowledge in marketing, whatever it is, do they also need to be in the physical location day in, day out, week in and week out to achieve results on these projects? Anyway, so there you go. That's how I sort of appealed the argument of (mumbling). >> Dave: Okay. I got a minimal modular, incremental, iterative, adaptive, co-locational. What was six again? I'm sorry. >> Jim: Co-locational. >> Dave: What was the one before that? >> Jim: I'm sorry. >> Dave: Adaptive. >> Minimal, modular, incremental, iterative, adaptive, and co-locational. >> Dave: Okay, there were only six. Sorry, I thought it was seven. Good. A couple of questions then we can get the discussion going here. Of course, you're talking specifically in the context of data science, but some of the questions that I've seen around agile generally are, it's not for everybody, when and where should it be used? Waterfalls still make sense sometimes. Some of the criticisms I've read, heard, seen, and sometimes experienced with agile are sort of quality issues, I'll call it lack of accountability. I don't know if that's the right terminology. We're going for speed so as long as we're fast, we checked that box, quality can sacrifice. Thoughts on that. Where does it fit and again understanding specifically you're talking about data science. Does it always fit in data science or because it's so new and hip and cool or like traditional programming environments, is it horses for courses? >> David: Can I add to that, Dave? It's a great, fundamental question. It seems to me there's two really important aspects of artificial intelligence. The first is the research part of it which is developing the algorithms, developing the potential data sources that might or might not matter. Then the second is taking that and putting it into production. That is that somewhere along the line, it's saving money, time, etc., and it's integrated with the rest of the organization. That second piece is, the first piece it seems to be like most research projects, the ROI is difficult to predict in a new sort of way. The second piece of actually implementing it is where you're going to make money. Is agile, if you can integrate that with your systems of record, for example and get automation of many of the aspects that you've researched, is agile the right way of doing it at that stage? How would you bridge the gap between the initial development and then the final instantiation? >> That's an important concern, David. Dev Ops, that's a closely related issue but it's not exactly the same scope. As data science and machine learning, let's just net it out. As machine learning and deep learning get embedded in applications, in operations I should say, like in your e-commerce site or whatever it might be, then data science itself becomes an operational function. The people who continue to iterate those models in line the operational applications. Really, where it comes down to an operational function, everything that these people do needs to be documented and version controlled and so forth. These people meaning data science professionals. You need documentation. You need accountability. The development of these assets, machine learning and so forth, needs to be, is compliance. When you look at compliance, algorithmic accountability comes into it where lawyers will, like e-discovery. They'll subpoena, theoretically all your algorithms and data and say explain how you arrived at this particular recommendation that you made to grant somebody or not grant somebody a loan or whatever it might be. The transparency of the entire development process is absolutely essential to the data science process downstream and when it's a production application. In many ways, agile by saying, speed's the most important thing. Screw documentation, you can sort of figure that out and that's not as important, that whole pathos, it goes by the wayside. Agile can not, should not skip on documentation. Documentation is even more important as data science becomes an operational function. That's one of my concerns. >> David: I think it seems to me that the whole rapid idea development is difficult to get a combination of that and operational, boring testing, regression testing, etc. The two worlds are very different. The interface between the two is difficult. >> Everybody does their e-commerce tweaks through AB testing of different layouts and so forth. AB testing is fundamentally data science and so it's an ongoing thing. (static) ... On AB testing in terms of tweaking. All these channels and all the service flow, systems of engagement and so forth. All this stuff has to be documented so agile sort of, in many ways flies in the face of that or potentially compromises the visibility of (garbled) access. >> David: Right. If you're thinking about IOT for example, you've got very expensive machines out there in the field which you're trying to optimize true put through and trying to minimize machine's breaking, etc. At the Micron event, it was interesting that Micron's use of different methodologies of putting systems together, they were focusing on the data analysis, etc., to drive greater efficiency through their manufacturing process. Having said that, they need really, really tested algorithms, etc. to make sure there isn't a major (mumbling) or loss of huge amounts of potential revenue if something goes wrong. I'm just interested in how you would create the final product that has to go into production in a very high value chain like an IOT. >> When you're running, say AI from learning algorithms all the way down to the end points, it gets even trickier than simply documenting the data and feature sets and the algorithms and so forth that were used to build up these models. It also comes down to having to document the entire life cycle in terms of how these algorithms were trained to make the predictors of whatever it is you're trying to do at the edge with a particular algorithm. The whole notion of how are all of these edge points applications being trained, with what data, at what interval? Are they being retrained on a daily basis, hourly basis, moment by moment basis? All of those are critical concerns to know whether they're making the best automated decisions or actions possible in all scenarios. That's like a black box in terms of the sheer complexity of what needs to be logged to figure out whether the application is doing its job as best a possible. You need a massive log, you need a massive event log from end to end of the IOT to do that right and to provide that visibility ongoing into the performance of these AI driven edge devices. I don't know anybody who's providing the tool to do it. >> David: If I think about how it's done at the moment, it's obviously far too slow at the moment. At the same time, you've got to have some testing and things like that. It seems to me that you've got a research model on one side and then you need to create a working model from that which is your production model. That's the one that goes through the testing and everything of that sort. It seems to me that the interface would be that transition from the research model to the working model that would be critical here and the working model is obviously a subset and it's going to be optimized for performance, etc. in real time, as opposed to the development model which can be a lot to do and take half a week to manage it necessary. It seems to me that you've got a different set of business pressures on the working model and a different set of skills as well. I think having one team here doesn't sound right to me. You've got to have a Dev Ops team who are going to take the working model from the developers and then make sure that it's sound and save. Especially in a high value IOT area that the level of iteration is not going to be nearly as high as in a lower cost marketing type application. Does that sound sensible? >> That sounds sensible. In fact in Dev Ops, the Dev Ops team would definitely be the ones that handle the continuous training and retraining of the working models on an ongoing basis. That's a core observation. >> David: Is that the right way of doing it, Jim? It seems to me that the research people would be continuing to adapt from data from a lot of different places whereas the operational model would be at a specific location with a specific IOT and they wouldn't have necessarily all the data there to do that. I'm not quite sure whether - >> Dave: Hey guys? Hey guys, hey guys? Can I jump in here? Interesting discussion, but highly nuanced and I'm struggling to figure out how this turns into a piece or sort of debating some certain specifics that are very kind of weedy. I wonder if we could just reset for a second and come back to sort of what I was trying to get to before which is really the business impact. Should this be applied broadly? Should this be applied specifically? What does it mean if I'm a practitioner? What should I take away from, Jim your premise and your sort of fixed parameters? Should I be implementing this? Why? Where? What's the value to my organization - the value I guess is obvious, but does it fit everywhere? Should it be across the board? Can you address that? >> Neil: Can I jump in here for a second? >> Dave: Please, that would be great. Is that Neil? >> Neil: Neil. I've never been a data scientist, but I was an actuary a long time ago. When the truth actuary came to me and said we need to develop a liability insurance coverage for floating oil rigs in the North Sea, I'm serious, it took a couple of months of research and modeling and so forth. If I had to go to all of those meetings and stand ups in an agile development environment, I probably would have gone postal on the place. I think that there's some confusion about what data science is. It's not a vector. It's not like a Dev Op situation where you start with something and you go (mumbling). When a data scientist or whatever you want to call them comes up with a model, that model has to be constantly revisited until it's put out of business. It's refined, it's evaluated. It doesn't have an end point like that. The other thing is that data scientist is typically going to be running multiple projects simultaneously so how in the world are you going to agilize that? I think if you look at the data science group, they're probably, I think Nick said this, there are probably groups in there that are doing fewer Dev Ops, software engineering and so forth and you can apply agile techniques to them. The whole data science thing is too squishy for that, in my opinion. >> Jim: Squishy? What do you mean by squishy, Neil? >> Neil: It's not one thing. I think if you try to represent data science as here's a project, we gather data, we work on a model, we test it, and then we put it into production, it doesn't end there. It never ends. It's constantly being revised. >> Yeah, of course. It's akin to application maintenance. The application meaning the model, the algorithm to be fit for purpose has to continually be evaluated, possibly tweaked, always retrained to determine its predictive fit for whatever task it's been assigned. You don't build it once and assume its strong predictive fit forever and ever. You can never assume that. >> Neil: James and I called that adaptive control mechanisms. You put a model out there and you monitor the return you're getting. You talk about AB testing, that's one method of doing it. I think that a data scientist, somebody who really is keyed into the machine learning and all that jazz. I just don't see them as being project oriented. I'll tell you one other thing, I have a son who's a software engineer and he said something to me the other day. He said, "Agile? Agile's dead." I haven't had a chance to find out what he meant by that. I'll get back to you. >> Oh, okay. If you look at - Go ahead. >> Dave: I'm sorry, Neil. Just to clarify, he said agile's dead? Was that what he said? >> Neil: I didn't say it, my son said it. >> Dave: Yeah, yeah, yeah right. >> Neil: No idea what he was talking about. >> Dave: Go ahead, Jim. Sorry. >> If you look at waterfall development in general, for larger projects it's absolutely essential to get requirements nailed down and the functional specifications and all that. Where you have some very extensive projects and many moving parts, obviously you need a master plan that it all fits into and waterfall, those checkpoints and so forth, those controls that are built into that methodology are critically important. Within the context of a broad project, some of the assets being build up might be machine loading models and analytics models and so forth so in the context of our broader waterfall oriented software development initiative, you might need to have multiple data science projects spun off within the sub-projects. Each of those would fit into, by itself might be indicated sort of like an exploration task where you have a team doing data visualization, exploration in more of an open-ended fashion because while they're trying to figure out the right set of predictors and the right set of data to be able to build out the right model to deliver the right result. What I'm getting at is that agile approaches might be embedded into broader waterfall oriented development initiatives, agile data science approaches. Fundamentally, data science began and still is predominantly very smart people, PhDs in statistics and math, doing open-ended exploration of complex data looking for non-obvious patterns that you wouldn't be able to find otherwise. Sort of a fishing expedition, a high priced fishing expedition. Kind of a mode of operation as how data science often is conducted in the real world. Looking for that eureka moment when the correlations just jump out at you. There's a lot of that that goes on. A lot of that is very important data science, it's more akin to pure science. What I'm getting at is there might be some role for more structure in waterfall development approaches in projects that have a data science, core data science capability to them. Those are my thoughts. >> Dave: Okay, we probably should move on to the next topic here, but just in closing can we get people to chime in on sort of the bottom line here? If you're writing to an audience of data scientists or data scientist want to be's, what's the one piece of advice or a couple of pieces of advice that you would give them? >> First of all, data science is a developer competency. The modern developers are, many of them need to be data scientists or have a strong grounding and understanding of data science, because much of that machine learning and all that is increasingly the core of what software developers are building so you can't not understand data science if you're a modern software developer. You can't understand data science as it (garbled) if you don't understand the need for agile iterative steps within the, because they're looking for the needle in the haystack quite often. The right combination of predictive variables and the right combination of algorithms and the right training regimen in order to get it all fit. It's a new world competency that need be mastered if you're a software development professional. >> Dave: Okay, anybody else want to chime in on the bottom line there? >> David: Just my two penny worth is that the key aspect of all the data scientists is to come up with the algorithm and then implement them in a way that is robust and it part of the system as a whole. The return on investment on the data science piece as an insight isn't worth anything until it's actually implemented and put into production of some sort. It seems that second stage of creating the working model is what is the output of your data scientists. >> Yeah, it's the repeatable deployable asset that incorporates the crux of data science which is algorithms that are data driven, statistical algorithms that are data driven. >> Dave: Okay. If there's nothing else, let's close this agenda item out. Is Nick on? Did Nick join us today? Nick, you there? >> Nick: Yeah. >> Dave: Sounds like you're on. Tough to hear you. >> Nick: How's that? >> Dave: Better, but still not great. Okay, we can at least hear you now. David, you wanted to present on NVMe over fabric pivoting off the Micron news. What is NVMe over fabric and who gives a fuck? (laughing) >> David: This is Micron, we talked about it last week. This is Micron announcement. What they announced is NVMe over fabric which, last time we talked about is the ability to create a whole number of nodes. They've tested 250, the architecture will take them to 1,000. 1,000 processor or 1,000 nodes, and be able to access the data on any single node at roughly the same speed. They are quoting 200 microseconds. It's 195 if it's local and it's 200 if it's remote. That is a very, very interesting architecture which is like nothing else that's been announced. >> Participant: David, can I ask a quick question? >> David: Sure. >> Participant: This latency and the node count sounds astonishing. Is Intel not replicating this or challenging in scope with their 3D Crosspoint? >> David: 3D Crosspoint, Intel would love to sell that as a key component of this. The 3D Crosspoint as a storage device is very, very, very expensive. You can replicate most of the function of 3D Crosspoint at a much lower price point by using a combination of D-RAM and protective D-RAM and Flash. At the moment, 3D Crosspoint is a nice to have and there'll be circumstances where they will use it, but at the meeting yesterday, I don't think they, they might have brought it up once. They didn't emphasize it (mumbles) at all as being part of it. >> Participant: To be clear, this means rather than buying Intel servers rounded out with lots of 3D Crosspoint, you buy Intel servers just with the CPU and then all the Micron niceness for their NVMe and their Interconnect? >> David: Correct. They are still Intel servers. The ones they were displaying yesterday were HP1's, they also used SuperMicro. They want certain characteristics of the chip set that are used, but those are just standard pieces. The other parts of the architecture are the Mellanox, the 100 gigabit converged ethernet and using Rocky which is IDMA over converged ethernet. That is the secret sauce which allows you and Mellanox themselves, their cards have a lot of offload of a lot of functionality. That's the secret sauce which allows you to go from any point to any point in 5 microseconds. Then create a transfer and other things. Files are on top of that. >> Participant: David, Another quick question. The latency is incredibly short. >> David: Yep. >> Participant: What happens if, as say an MPP SQL database with 1,000 nodes, what if they have to shuffle a lot of data? What's the throughput? Is it limited by that 100 gig or is that so insanely large that it doesn't matter? >> David: They key is this, that it allows you to move the processing to wherever the data is very, very easily. In the principle that will evolve from this architecture, is that you know where the data is so don't move the data around, that'll block things up. Move the processing to that particular node or some adjacent node and do the processing as close as possible. That is as an architecture is a long term goal. Obviously in the short term, you've got to take things as they are. Clearly, a different type of architecture for databases will need to eventually evolve out of this. At the moment, what they're focusing on is big problems which need low latency solutions and using databases as they are and the whole end to end use stack which is a much faster way of doing it. Then over time, they'll adapt new databases, new architectures to really take advantage of it. What they're offering is a POC at the moment. It's in Beta. They had their customers talking about it and they were very complimentary in general about it. They hope to get it into full production this year. There's going to be a host of other people that are doing this. I was trying to bottom line this in terms of really what the link is with digital enablement. For me, true digital enablement is enabling any relevant data to be available for processing at the point of business engagement in real time or near real time. The definition that this architecture enables. It's a, in my view a potential game changer in that this is an architecture which will allow any data to be available for processing. You don't have to move the data around, you move the processing to that data. >> Is Micron the first market with this capability, David? NV over Me? NVMe. >> David: Over fabric? Yes. >> Jim: Okay. >> David: Having said that, there are a lot of start ups which have got a significant amount of money and who are coming to market with their own versions. You would expect Dell, HP to be following suit. >> Dave: David? Sorry. Finish your thought and then I have another quick question. >> David: No, no. >> Dave: The principle, and you've helped me understand this many times, going all the way back to Hadoop, bring the application to the data, but when you're using conventional relational databases and you've had it all normalized, you've got to join stuff that might not be co-located. >> David: Yep. That's the whole point about the five microseconds. Now that the impact of non co-location if you have to join stuff or whatever it is, is much, much lower. It's so you can do the logical draw in, whatever it is, very quickly and very easily across that whole fabric. In terms of processing against that data, then you would choose to move the application to that node because it's much less data to move, that's an optimization of the architecture as opposed to a fundamental design point. You can then optimize about where you run the thing. This is ideal architecture for where I personally see things going which is traditional systems of record which need to be exactly as they've ever been and then alongside it, the artificial intelligence, the systems of understanding, data warehouses, etc. Having that data available in the same space so that you can combine those two elements in real time or in near real time. The advantage of that in terms of business value, digital enablement, and business value is the biggest thing of all. That's a 50% improvement in overall productivity of a company, that's the thing that will drive, in my view, 99% of the business value. >> Dave: Going back just to the joint thing, 100 gigs with five microseconds, that's really, really fast, but if you've got petabytes of data on these thousand nodes and you have to do a join, you still got to go through that 100 gig pipe of stuff that's not co-located. >> David: Absolutely. The way you would design that is as you would design any query. You've got a process you would need, a process in front of that which is query optimization to be able to farm all of the independent jobs needed to do in each of the nodes and take the output of that and bring that together. Both the concepts are already there. >> Dave: Like a map. >> David: Yes. That's right. All of the data science is there. You're starting from an architecture which is fundamentally different from the traditional let's get it out architectures that have existed, by removing that huge overhead of going from one to another. >> Dave: Oh, because this goes, it's like a mesh not a ring? >> David: Yes, yes. >> Dave: It's like the high performance compute of this MPI type architecture? >> David: Absolutely. NVMe, by definition is a point to point architecture. Rocky, underneath it is a point to point architecture. Everything is point to point. Yes. >> Dave: Oh, got it. That really does call for a redesign. >> David: Yes, you can take it in steps. It'll work as it is and then over time you'll optimize it to take advantage of it more. Does that definition of (mumbling) make sense to you guys? The one I quoted to you? Enabling any relevant data to be available for processing at the point of business engagement, in real time or near real time? That's where you're trying to get to and this is a very powerful enabler of that design. >> Nick: You're emphasizing the network topology, while I kind of thought the heart of the argument was performance. >> David: Could you repeat that? It's very - >> Dave: Let me repeat. Nick's a little light, but I could hear him fine. You're emphasizing the network topology, but Nick's saying his takeaway was the whole idea was the thrust was performance. >> Nick: Correct. >> David: Absolutely. Absolutely. The result of that network topology is a many times improvement in performance of the systems as a whole that you couldn't achieve in any previous architecture. I totally agree. That's what it's about is enabling low latency applications with much, much more data available by being able to break things up in parallel and delivering multiple streams to an end result. Yes. >> Participant: David, let me just ask, if I can play out how databases are designed now, how they can take advantage of it unmodified, but how things could be very, very different once they do take advantage of it which is that today, if you're doing transaction processing, you're pretty much bottle necked on a single node that sort of maintains the fresh cache of shared data and that cache, even if it's in memory, it's associated with shared storage. What you're talking about means because you've got memory speed access to that cache from anywhere, it no longer is tied to a node. That's what allows you to scale out to 1,000 nodes even for transaction processing. That's something we've never really been able to do. Then the fact that you have a large memory space means that you no longer optimize for mapping back and forth from disk and disk structures, but you have everything in a memory native structure and you don't go through this thing straw for IO to storage, you go through memory speed IO. That's a big, big - >> David: That's the end point. I agree. That's not here quite yet. It's still IO, so the IO has been improved dramatically, the protocol within the Me and the over fabric part of it. The elapsed time has been improved, but it's not yet the same as, for example, the HPV initiative. That's saying you change your architecture, you change your way of processing just in the memory. Everything is assumed to be memory. We're not there yet. 200 microseconds is still a lot, lot slower than the process that - one impact of this architecture is that the amount of data that you can pass through it is enormously higher and therefore, the memory sizes themselves within each node will need to be much, much bigger. There is a real opportunity for architectures which minimize the impact, which hold data coherently across multiple nodes and where there's minimal impact of, no tapping on the shoulder for every byte transferred so you can move large amounts of data into memory and then tell people that it's there and allow it to be shared, for example between the different calls and the GPUs and FPGAs that will be in these processes. There's more to come in terms of the architecture in the future. This is a step along the way, it's not the whole journey. >> Participant: Dave, another question. You just referenced 200 milliseconds or microseconds? >> David: Did I say milliseconds? I meant microseconds. >> Participant: You might have, I might have misheard. Relate that to the five microsecond thing again. >> David: If you have data directly attached to your processor, the access time is 195 microseconds. If you need to go to a remote, anywhere else in the thousand nodes, your access time is 200 microseconds. In other words, the additional overhead of that data is five microseconds. >> Participant: That's incredible. >> David: Yes, yes. That is absolutely incredible. That's something that data scientists have been working on for years and years. Okay. That's the reason why you can now do what I talked about which was you can have access from any node to any data within that large amount of nodes. You can have petabytes of data there and you can have access from any single node to any of that data. That, in terms of data enablement, digital enablement, is absolutely amazing. In other words, you don't have to pre put the data that's local in one application in one place. You're allowing an enormous flexibility in how you design systems. That coming back to artificial intelligence, etc. allows you a much, much larger amount of data that you can call on for improving applications. >> Participant: You can explore and train models, huge models, really quickly? >> David: Yes, yes. >> Participant: Apparently that process works better when you have an MPI like mesh than a ring. >> David: If you compare this architecture to the DSST architecture which was the first entrance into this that MP bought for a billion dollars, then that one stopped at 40 nodes. It's architecture was very, very proprietary all the way through. This one takes you to 1,000 nodes with much, much lower cost. They believe that the cost of the equivalent DSSD system will be between 10 and 20% of that cost. >> Dave: Can I ask a question about, you mentioned query optimizer. Who develops the query optimizer for the system? >> David: Nobody does yet. >> Jim: The DBMS vendor would have to re-write theirs with a whole different pensive cost. >> Dave: So we would have an optimizer database system? >> David: Who's asking a question, I'm sorry. I don't recognize the voice. >> Dave: That was Neil. Hold on one second, David. Hold on one second. Go ahead Nick. You talk about translation. >> Nick: ... On a network. It's SAN. It happens to be very low latency and very high throughput, but it's just a storage sub-system. >> David: Yep. Yep. It's a storage sub-system. It's called a server SAN. That's what we've been talking about for a long time is you need the same characteristics which is that you can get at all the data, but you need to be able to get at it in compute time as opposed to taking a stroll down the road time. >> Dave: Architecturally it's a SAN without an array controller? >> David: Exactly. Yeah, the array controller is software from a company called Xcellate, what was the name of it? I can't remember now. Say it again. >> Nick: Xcelero or Xceleron? >> David: Xcelero. That's the company that has produced the software for the data services, etc. >> Dave: Let's, as we sort of wind down this segment, let's talk about the business impact again. We're talking about different ways potentially to develop applications. There's an ecosystem requirement here it sounds like, from the ISDs to support this and other developers. It's the final, portends the elimination of the last electromechanical device in computing which has implications for a lot of things. Performance value, application development, application capability. Maybe you could talk about that a little bit again thinking in terms of how practitioners should look at this. What are the actions that they should be taking and what kinds of plans should they be making in their strategies? >> David: I thought Neil's comment last week was very perceptive which is, you wouldn't start with people like me who have been imbued with the 100 database call limits for umpteen years. You'd start with people, millennials, or sub-millenials or whatever you want to call them, who can take a completely fresh view of how you would exploit this type of architecture. Fundamentally you will be able to get through 10 or 100 times more data in real time than you can with today's systems. There's two parts of that data as I said before. The traditional systems of record that need to be updated, and then a whole host of applications that will allow you to do processes which are either not possible, or very slow today. To give one simple example, if you want to do real time changing of pricing based on availability of your supply chain, based on what you've got in stock, based on the delivery capabilities, that's a very, very complex problem. The optimization of all these different things and there are many others that you could include in that. This will give you the ability to automate that process and optimize that process in real time as part of the systems of record and update everything together. That, in terms of business value is extracting a huge number of people who previously would be involved in that chain, reducing their involvement significantly and making the company itself far more agile, far more responsive to change in the marketplace. That's just one example, you can think of hundreds for every marketplace where the application now becomes the systems of record, augmented by AI and huge amounts more data can improve the productivity of an organization and the agility of an organization in the marketplace. >> This is a godsend for AI. AI, the draw of AI is all this training data. If you could just move that in memory speed to the application in real time, it makes the applications much sharper and more (mumbling). >> David: Absolutely. >> Participant: How long David, would it take for the cloud vendors to not just offer some instances of this, but essentially to retool their infrastructure. (laughing) >> David: This is, to me a disruption and a half. The people who can be first to market in this are the SaaS vendors who can take their applications or new SaaS vendors. ISV. Sorry, say that again, sorry. >> Participant: The SaaS vendors who have their own infrastructure? >> David: Yes, but it's not going to be long before the AWS' and Microsofts put this in their tool bag. The SaaS vendors have the greatest capability of making this change in the shortest possible time. To me, that's one area where we're going to see results. Make no mistake about it, this is a big change and at the Micron conference, I can't remember what the guys name was, he said it takes two Olympics for people to start adopting things for real. I think that's going to be shorter than two Olympics, but it's going to be quite a slow process for pushing this out. It's radically different and a lot of the traditional ways of doing things are going to be affected. My view is that SaaS is going to be the first and then there are going to be individual companies that solve the problems themselves. Large companies, even small companies that put in systems of this sort and then use it to outperform the marketplace in a significant way. Particularly in the finance area and particularly in other data intent areas. That's my two pennies worth. Anybody want to add anything else? Any other thoughts? >> Dave: Let's wrap some final thoughts on this one. >> Participant: Big deal for big data. >> David: Like it, like it. >> Participant: It's actually more than that because there used to be a major trade off between big data and fast data. Latency and throughput and this starts to push some of those boundaries out so that you sort of can have both at once. >> Dave: Okay, good. Big deal for big data and fast data. >> David: Yeah, I like it. >> Dave: George, you want to talk about digital twins? I remember when you first sort of introduced this, I was like, "Huh? What's a digital twin? "That's an interesting name." I guess, I'm not sure you coined it, but why don't you tell us what digital twin is and why it's relevant. >> George: All right. GE coined it. I'm going to, at a high level talk about what it is, why it's important, and a little bit about as much as we can tell, how it's likely to start playing out and a little bit on the differences of the different vendors who are going after it. As far as sort of defining it, I'm cribbing a little bit from a report that's just in the edit process. It's data representation, this is important, or a model of a product, process, service, customer, supplier. It's not just an industrial device. It can be any entity involved in the business. This is a refinement sort of Peter helped with. The reason it's any entity is because there is, it can represent the structure and behavior, not just of a machine tool or a jet engine, but a business process like sales order process when you see it on a screen and its workflow. That's a digital twin of what used to be a physical process. It applied to both the devices and assets and processes because when you can model them, you can integrate them within a business process and improve that process. Going back to something that's more physical so I can do a more concrete definition, you might take a device like a robotic machine tool and the idea is that the twin captures the structure and the behavior across its lifecycle. As it's designed, as it's built, tested, deployed, operated, and serviced. I don't know if you all know the myth of, in the Greek Gods, one of the Goddesses sprang fully formed from the forehead of Zeus. I forgot who it was. The point of that is digital twin is not going to spring fully formed from any developers head. Getting to the level of fidelity I just described is a journey and a long one. Maybe a decade or more because it's difficult. You have to integrate a lot of data from different systems and you have to add structure and behavior for stuff that's not captured anywhere and may not be captured anywhere. Just for example, CAD data might have design information, manufacturing information might come from there or another system. CRM data might have support information. Maintenance repair and overhaul applications might have information on how it's serviced. Then you also connect the physical version with the digital version with essentially telemetry data that says how its been operating over time. That sort of helps define its behavior so you can manipulate that and predict things or simulate things that you couldn't do with just the physical version. >> You have to think about combined with say 3D printers, you could create a hot physical back up of some malfunctioning thing in the field because you have the entire design, you have the entire history of its behavior and its current state before it went kablooey. Conceivably, it can be fabricated on the fly and reconstituted as a physicologic from the digital twin that was maintained. >> George: Yes, you know what actually that raises a good point which is that the behavior that was represented in the telemetry helps the designer simulate a better version for the next version. Just what you're saying. Then with 3D printing, you can either make a prototype or another instance. Some of the printers are getting sophisticated enough to punch out better versions or parts for better versions. That's a really good point. There's one thing that has to hold all this stuff together which is really kind of difficult, which is challenging technology. IBM calls it a knowledge graph. It's pretty much in anyone's version. They might not call it a knowledge graph. It's a graph is, instead of a tree where you have a parent and then children and then the children have more children, a graph, many things can relate to many things. The reason I point that out is that puts a holistic structure over all these desperate sources of data behavior. You essentially talk to the graph, sort of like with Arnold, talk to the hand. That didn't, I got crickets. (laughing) Let me give you guys the, I put a definitions table in this dock. I had a couple things. Beta models. These are some important terms. Beta model represents the structure but not the behavior of the digital twin. The API represents the behavior of the digital twin and it should conform to the data model for maximum developer usability. Jim, jump in anywhere where you feel like you want to correct or refine. The object model is a combination of the data model and API. You were going to say something? >> Jim: No, I wasn't. >> George: Okay. The object model ultimately is the digital twin. Another way of looking at it, defining the structure and behavior. This sounds like one of these, say "T" words, the canonical model. It's a generic version of the digital twin or really the one where you're going to have a representation that doesn't have customer specific extensions. This is important because the way these things are getting built today is mostly custom spoke and so if you want to be able to reuse work. If someone's building this for you like a system integrator, you want to be able to, or they want to be able to reuse this on the next engagement and you want to be able to take the benefit of what they've learned on the next engagement back to you. There has to be this canonical model that doesn't break every time you essentially add new capabilities. It doesn't break your existing stuff. Knowledge graph again is this thing that holds together all the pieces and makes them look like one coherent hole. I'll get to, I talked briefly about network compatibility and I'll get to level of detail. Let me go back to, I'm sort of doing this from crib notes. We talked about telemetry which is sort of combining the physical and the twin. Again, telemetry's really important because this is like the time series database. It says, this is all the stuff that was going on over time. Then you can look at telemetry data that tells you, we got a dirty power spike and after three of those, this machine sort of started vibrating. That's part of how you're looking to learn about its behavior over time. In that process, models get better and better about predicting and enabling you to optimize their behavior and the business process with which it integrates. I'll give some examples of that. Twins, these digital twins can themselves be composed in levels of detail. I think I used the example of a robotic machine tool. Then you might have a bunch of machine tools on an assembly line and then you might have a bunch of assembly lines in a factory. As you start modeling, not just the single instance, but the collections that higher up and higher levels of extractions, or levels of detail, you get a richer and richer way to model the behavior of your business. More and more of your business. Again, it's not just the assets, but it's some of the processes. Let me now talk a little bit about how the continual improvement works. As Jim was talking about, we have data feedback loops in our machine learning models. Once you have a good quality digital twin in place, you get the benefit of increasing returns from the data feedback loops. In other words, if you can get to a better starting point than your competitor and then you get on the increasing returns of the data feedback loops, that is improving the fidelity of the digital twins now faster than your competitor. For one twin, I'll talk about how you want to make the whole ecosystem of twins sort of self-reinforcing. I'll get to that in a sec. There's another point to make about these data feedback loops which is traditional apps, and this came up with Jim and Neil, traditional apps are static. You want upgrades, you get stuff from the vendor. With digital twins, they're always learning from the customer's data and that has implications when the partner or vendor who helped build it for a customer takes learnings from the customer and goes to a similar customer for another engagement. I'll talk about the implications from that. This is important because it's half packaged application and half bespoke. The fact that you don't have to take the customer's data, but your model learns from the data. Think of it as, I'm not going to take your coffee beans, your data, but I'm going to run or make coffee from your beans and I'm going to take that to the next engagement with another customer who could be your competitor. In other words, you're extracting all the value from the data and that helps modify the behavior of the model and the next guy gets the benefit of it. Dave, this is the stuff where IBM keeps saying, we don't take your data. You're right, but you're taking the juice you squeezed out of it. That's one of my next reports. >> Dave: It's interesting, George. Their contention is, they uniquely, unlike Amazon and Google, don't swap spit, your spit with their competitors. >> George: That's misleading. To say Amazon and Google, those guys aren't building digital twins. Parametric technology is. I've got this definitely from a parametric technical fellow at an AWS event last week, which is they, not only don't use the data, they don't use the structure of the twin either from engagement to engagement. That's a big difference from IBM. I have a quote, Chris O'Connor from IBM Munich saying, "We'll take the data model, "but we won't take the data." I'm like, so you take the coffee from the beans even if you don't take the beans? I'm going to be very specific about saying that saying you don't do what Google and FaceBook do, what they do, it's misleading. >> Dave: My only caution there is do some more vetting and checking. A lot of times what some guy says on a Cube interview, he or she doesn't even know, in my experience. Make sure you validate that. >> George: I'll send it to them for feedback, but it wasn't just him. I got it from the CTO of the IOT division as well. >> Dave: When you were in Munich? >> George: This wasn't on the Cube either. This was by the side of, at the coffee table during our break. >> Dave: I understand and CTO's in theory should know. I can't tell you how many times I've gotten a definitive answer from a pretty senior level person and it turns out it was, either they weren't listening to me or they didn't know or they were just yessing me or whatever. Just be really careful and make sure you do your background checks. >> George: I will. I think the key is leave them room to provide a nuanced answer. It's more of a really, really, really concrete about really specific edge conditions and say do you or don't you. >> Dave: This is a pretty big one. If I'm a CIO, a chief digital officer, a chief data officer, COO, head of IT, head of data science, what should I be doing in this regard? What's the advice? >> George: Okay, can I go through a few more or are we out of time? >> Dave: No, we have time. >> George: Let me do a couple more points. I talked about training a single twin or an instance of a twin and I talked about the acceleration of the learning curve. There's edge analytics, David has educated us with the help of looking at GE Predicts. David, you have been talking about this fpr a long time. You want edge analytics to inform or automate a low latency decision and so this is where you're going to have to run some amount of analytics. Right near the device. Although I got to mention, hopefully this will elicit a chuckle. When you get some vendors telling you what their edge and cloud strategies are. Map R said, we'll have a hadoop cluster that only needs four or five nodes as our edge device. And we'll need five admins to care and feed it. He didn't say the last part, but that obviously isn't going to work. The edge analytics could be things like recalibrating the machine for different tolerance. If it's seeing that it's getting out of the tolerance window or something like that. The cloud, and this is old news for anyone who's been around David, but you're going to have a lot of data, not all of it, but going back to the cloud to train both the instances of each robotic machine tool and the master of that machine tool. The reason is, an instance would be oh I'm operating in a high humidity environment, something like that. Another one would be operating where there's a lot of sand or something that screws up the behavior. Then the master might be something that has behavior that's sort of common to all of them. It's when the training, the training will take place on the instances and the master and will in all likelihood push down versions of each. Next to the physical device process, whatever, you'll have the instance one and a class one and between the two of them, they should give you the optimal view of behavior and the ability to simulate to improve things. It's worth mentioning, again as David found out, not by talking to GE, but by accidentally looking at their documentation, their whole positioning of edge versus cloud is a little bit hand waving and in talking to the guys from ThingWorks which is a division of what used to be called Parametric Technology which is just PTC, it appears that they're negotiating with GE to give them the orchestration and distributed database technology that GE can't build itself. I've heard also from two ISV's, one a major one and one a minor one who are both in the IOT ecosystem one who's part of the GE ecosystem that predicts as a mess. It's analysis paralysis. It's not that they don't have talent, it's just that they're not getting shit done. Anyway, the key thing now is when you get all this - >> David: Just from what I learned when I went to the GE event recently, they're aware of their requirement. They've actually already got some sub parts of the predix which they can put in the cloud, but there needs to be more of it and they're aware of that. >> George: As usual, just another reason I need a red phone hotline to David for any and all questions I have. >> David: Flattery will get you everywhere. >> George: All right. One of the key takeaways, not the action item, but the takeaway for a customer is when you get these data feedback loops reinforcing each other, the instances of say the robotic machine tools to the master, then the instance to the assembly line to the factory, when all that is being orchestrated and all the data is continually enhancing the models as well as the manual process of adding contextual information or new levels of structure, this is when you're on increasing returns sort of curve that really contributes to sustaining competitive advantage. Remember, think of how when Google started off on search, it wasn't just their algorithm, but it was collecting data about which links you picked, in which order and how long you were there that helped them reinforce the search rankings. They got so far ahead of everyone else that even if others had those algorithms, they didn't have that data to help refine the rankings. You get this same process going when you essentially have your ecosystem of learning models across the enterprise sort of all orchestrating. This sounds like motherhood and apple pie and there's going to be a lot of challenges to getting there and I haven't gotten all the warts of having gone through, talked to a lot of customers who've gotten the arrows in the back, but that's the theoretical, really cool end point or position where the entire company becomes a learning organization from these feedback loops. I want to, now that we're in the edit process on the overall digital twin, I do want to do a follow up on IBM's approach. Hopefully we can do it both as a report and then as a version that's for Silicon Angle because that thing I wrote on Cloudera got the immediate attention of Cloudera and Amazon and hopefully we can both provide client proprietary value add, but also the public impact stuff. That's my high level. >> This is fascinating. If you're the Chief of Data Science for example, in a large industrial company, having the ability to compile digital twins of all your edge devices can be extraordinarily valuable because then you can use that data to do more fine-grained segmentation of the different types of edges based on their behavior and their state under various scenarios. Basically then your team of data scientists can then begin to identify the extent to which they need to write different machine learning models that are tuned to the specific requirements or status or behavior of different end points. What I'm getting at is ultimately, you're going to have 10 zillion different categories of edge devices performing in various scenarios. They're going to be driven by an equal variety of machine learning, deep learning AI and all that. All that has to be built up by your data science team in some coherent architecture where there might be a common canonical template that all devices will, all the algorithms and so forth on those devices are being built from. Each of those algorithms will then be tweaked to the specific digital twins profile of each device is what I'm getting at. >> George: That's a great point that I didn't bring up which is folks who remember object oriented programming, not that I ever was able to write a single line of code, but the idea, go into this robotic machine tool, you can inherit a couple of essentially component objects that can also be used in slightly different models, but let's say in this machine tool, there's a model for a spinning device, I forget what it's called. Like a drive shaft. That drive shaft can be in other things as well. Eventually you can compose these twins, even instances of a twin with essentially component models themselves. Thing Works does this. I don't know if GE does this. I don't think IBM does. The interesting thing about IBM is, their go to market really influences their approach to this which is they have this huge industry solutions group and then obviously the global business services group. These guys are all custom development and domain experts so they'll go into, they're literally working with Airbus and with the goal of building a model of a particular airliner. Right now I think they're doing the de-icing subsystem, I don't even remember on which model. In other words they're helping to create this bespoke thing and so that's what actually gets them into trouble with potentially channel conflict or maybe it's more competitor conflict because Airbus is not going to be happy if they take their learnings and go work with Boeing next. Whereas with PTC and Thing Works, at least their professional services arm, they treat this much more like the implementation of a packaged software product and all the learnings stay with the customer. >> Very good. >> Dave: I got a question, George. In terms of the industrial design and engineering aspect of building products, you mentioned PTC which has been in the CAD business and the engineering business for software for 50 years, and Ansis and folks like that who do the simulation of industrial products or any kind of a product that gets built. Is there a natural starting point for digital twin coming out of that area? That would be the vice president of engineering would be the guy that would be a key target for this kind of thinking. >> George: Great point. This is, I think PTC is closely aligned with Terradata and they're attitude is, hey if it's not captured in the CAD tool, then you're just hand waving because you won't have a high fidelity twin. >> Dave: Yeah, it's a logical starting point for any mechanical kind of device. What's a thing built to do and what's it built like? >> George: Yeah, but if it's something that was designed in a CAD tool, yes, but if it's something that was not, then you start having to build it up in a different way. I think, I'm trying to remember, but IBM did not look like they had something that was definitely oriented around CAD. Theirs looked like it was more where the knowledge graph was the core glue that pulled all the structure and behavior together. Again, that was a reflection of their product line which doesn't have a CAD tool and the fact that they're doing these really, really, really bespoke twins. >> Dave: I'm thinking that it strikes me that from the industrial design in engineering area, it's really the individual product is really the focus. That's one part of the map. The dynamic you're pointing at, there's lots of other elements of the map in terms of an operational, a business process. That might be the fleet of wind turbines or the fleet of trucks. How they behave collectively. There's lots of different entry points. I'm just trying to grapple with, isn't the CAD area, the engineering area at least for hard products, have an obvious starting point for users to begin to look at this. The BP of Engineering needs to be on top of this stuff. >> George: That's a great point that I didn't bring up which is, a guy at Microsoft who was their CTO in their IT organization gave me an example which was, you have a pipeline that's 1,000 miles long. It's got 10,000 valves in it, but you're not capturing the CAD design of the valve, you just put a really simple model that measures pressure, temperature, and leakage or something. You string 10,000 of those together into an overall model of the pipeline. That is a low fidelity thing, but that's all they need to start with. Then they can see when they're doing maintenance or when the flow through is higher or what the impact is on each of the different valves or flanges or whatever. It doesn't always have to start with super high fidelity. It depends on which optimizing for. >> Dave: It's funny. I had a conversation years ago with a guy, the engineering McNeil Schwendler if you remember those folks. He was telling us about 30 to 40 years ago when they were doing computational fluid dynamics, they were doing one dimensional computational fluid dynamics if you can imagine that. Then they were able, because of the compute power or whatever, to get the two dimensional computational fluid dynamics and finally they got to three dimensional and they're looking also at four and five dimensional as well. It's serviceable, I guess what I'm saying in that pipeline example, the way that they build that thing or the way that they manage that pipeline is that they did the one dimensional model of a valve is good enough, but over time, maybe a two or three dimensional is going to be better. >> George: That's why I say that this is a journey that's got to take a decade or more. >> Dave: Yeah, definitely. >> Take the example of airplane. The old joke is it's six million parts flying in close formation. It's going to be a while before you fit that in one model. >> Dave: Got it. Yes. Right on. When you have that model, that's pretty cool. All right guys, we're about out of time. I need a little time to prep for my next meeting which is in 15 minutes, but final thoughts. Do you guys feel like this was useful in terms of guiding things that you might be able to write about? >> George: Hugely. This is hugely more valuable than anything we've done as a team. >> Jim: This is great, I learned a lot. >> Dave: Good. Thanks you guys. This has been recorded. It's up on the cloud and I'll figure out how to get it to Peter and we'll go from there. Thanks everybody. (closing thank you's)
SUMMARY :
There you go. and maybe the key issues that you see and is coming even more deeply into the core practice You had mentioned, you rattled off a bunch of parameters. It's all about the core team needs to be, I got a minimal modular, incremental, iterative, iterative, adaptive, and co-locational. in the context of data science, and get automation of many of the aspects everything that these people do needs to be documented that the whole rapid idea development flies in the face of that create the final product that has to go into production and the algorithms and so forth that were used and the working model is obviously a subset that handle the continuous training and retraining David: Is that the right way of doing it, Jim? and come back to sort of what I was trying to get to before Dave: Please, that would be great. so how in the world are you going to agilize that? I think if you try to represent data science the algorithm to be fit for purpose and he said something to me the other day. If you look at - Just to clarify, he said agile's dead? Dave: Go ahead, Jim. and the functional specifications and all that. and all that is increasingly the core that the key aspect of all the data scientists that incorporates the crux of data science Nick, you there? Tough to hear you. pivoting off the Micron news. the ability to create a whole number of nodes. Participant: This latency and the node count At the moment, 3D Crosspoint is a nice to have That is the secret sauce which allows you The latency is incredibly short. Move the processing to that particular node Is Micron the first market with this capability, David? David: Over fabric? and who are coming to market with their own versions. Dave: David? bring the application to the data, Now that the impact of non co-location and you have to do a join, and take the output of that and bring that together. All of the data science is there. NVMe, by definition is a point to point architecture. Dave: Oh, got it. Does that definition of (mumbling) make sense to you guys? Nick: You're emphasizing the network topology, the whole idea was the thrust was performance. of the systems as a whole Then the fact that you have a large memory space is that the amount of data that you can pass through it You just referenced 200 milliseconds or microseconds? David: Did I say milliseconds? Relate that to the five microsecond thing again. anywhere else in the thousand nodes, That's the reason why you can now do what I talked about when you have an MPI like mesh than a ring. They believe that the cost of the equivalent DSSD system Who develops the query optimizer for the system? Jim: The DBMS vendor would have to re-write theirs I don't recognize the voice. Dave: That was Neil. It happens to be very low latency which is that you can get at all the data, Yeah, the array controller is software from a company called That's the company that has produced the software from the ISDs to support this and other developers. and the agility of an organization in the marketplace. AI, the draw of AI is all this training data. for the cloud vendors to not just offer are the SaaS vendors who can take their applications and then there are going to be individual companies Latency and throughput and this starts to push Dave: Okay, good. I guess, I'm not sure you coined it, and the idea is that the twin captures the structure Conceivably, it can be fabricated on the fly and it should conform to the data model and that helps modify the behavior Dave: It's interesting, George. saying, "We'll take the data model, Make sure you validate that. I got it from the CTO of the IOT division as well. This was by the side of, at the coffee table I can't tell you how many times and say do you or don't you. What's the advice? of behavior and the ability to simulate to improve things. of the predix which they can put in the cloud, I need a red phone hotline to David and all the data is continually enhancing the models having the ability to compile digital twins and all the learnings stay with the customer. and the engineering business for software hey if it's not captured in the CAD tool, What's a thing built to do and what's it built like? and the fact that they're doing these that from the industrial design in engineering area, but that's all they need to start with. and finally they got to three dimensional that this is a journey that's got to take It's going to be a while before you fit that I need a little time to prep for my next meeting This is hugely more valuable than anything we've done how to get it to Peter and we'll go from there.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Jim | PERSON | 0.99+ |
Chris O'Connor | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Airbus | ORGANIZATION | 0.99+ |
Boeing | ORGANIZATION | 0.99+ |
Jim Kobeielus | PERSON | 0.99+ |
James | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Neil | PERSON | 0.99+ |
Joe | PERSON | 0.99+ |
Nick | PERSON | 0.99+ |
David Floyer | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
1,000 miles | QUANTITY | 0.99+ |
10 | QUANTITY | 0.99+ |
Peter | PERSON | 0.99+ |
195 microseconds | QUANTITY | 0.99+ |
Tendu Yogurtcu, Syncsort - #BigDataSV 2016 - #theCUBE
from San Jose in the heart of Silicon Valley it's the kue covering big data sv 2016 now your host John furrier and George Gilbert okay welcome back on we are here live in Silicon Valley for the cubes looking angles flagship program we go out to the events and extract the signal from the noise i'm john furrier mykos george gilbert big data analyst at Wikibon calm our next guest is 10 do yoga coo to yogurt coo coo I you see your last name yo Joe okay I gots clothes GM with big David sinks or welcome back to the cube sink starts been a long time guess one of those companies we love to cover because your value publishes is right in the center of all the action around mainframes and you know Dave and I always love to talk about mainframe not mean frame guys we know that we remember those days and still powering a lot of the big enterprises so I got to ask you you know what's your take on the show here one of the themes that came up last night on crowd chatters why is enterprise data warehousing failing so you know got some conversation but you're seeing a transformation what do you guys see thank you for having me it's great to be here yes we are seeing the transformation of the next generation data warehouse and evolution of the data warehouse architecture and as part of that mainframes are a big part of this data warehouse architecture because still seventy percent of data is on the mainframes world's data seventy percent of world's data this is a large amount of data so when we talk about big data architecture and making big data and enterprise data useful for the business and having advanced analytics not just gaining operational efficiencies with the new architecture and also having new products new services available to the customers of those organizations this data is intact and making that part of this next-generation data warehouse architecture is a big part of the initiatives and we play a very strong core role in this bridging the gap between mainframes and the big data platforms because we have product offerings spanning across platforms and we are very focused on accessing and integrating data accessing and integrating in a secure way from mainframes to the big data plan one is one of the things that's the mainframe highlights kind of a dynamic in the marketplace and wrong hall customers whether they have many firms are not your customers who have mainframes they already have a ton of data their data full as we say in the cube they have a ton of data do it but they spend a lot of times you mentioned cleaning the data how do you guys specifically solve that because that's a big hurdle that they want to just put behind they want to clean fast and get on to other things yes we see a few different trends and challenges first of all from the Big Data initiatives everybody is really trying to either gain operational efficiency business agility and make use of some of the data they weren't able to make use of before and enrich this data with some of the new data sources they might be actually adding to the data pipeline or they are trying to provide new products and services to their customers so when we talk about the mainframe data it's a it's really a how you access this mainframe data in a secure way and how you make that data preparation very easy for the data scientists the data scientists are still spending close to eighty percent of their time in data preparation and if you can't think of it when we talk about the compute frameworks like spark MapReduce flink versus the technology stack technologies these should not be relevant to the data scientist they should be just worried about how do i create my data pipeline what are the new insights that I'm trying to get from this data the simplification we bring in that data cleansing and data preparation is one well we are bringing simple way to access and integrate all of the enterprise data not just the legacy mainframe and the relational data sources and also the emerging data sources with streaming data sources the messaging frameworks new data sources we also make this in a cross-platform secure way and some of the new features for example we announced where we were simply the best in terms of accessing all of the mainframe data and having this available on Hadoop and spark we now also makes park and Hadoop understand this data in its original format you do not have to change the original record format which is very important for highly regulated industries like financial services banking and insurance and health care because you want to be able to do the data sanitization and data cleansing and yet bring that mainframe data in its original format for audit and compliance reasons okay so this is this is the product i think where you were telling us earlier that you can move the processing you can move the data from the mainframe do processing at scale and at cost that's not possible or even ii is is easy on the mainframe do it on a distributed platform like a dupe it preserves its original sort of way of being encoded send it back but then there's also this new way of creating a data fabric that we were talking about earlier where it used to be sort of point-to-point from the transactional systems to the data warehouse and now we've basically got this richer fabric and your tools sitting on some technologies perhaps like spark and Kafka tell us what that world looks like and how it was different from we see a greater interest in terms of the concept of a database because some organizations call it data as a service some organizations call it a Hadoop is a service but ultimately an easy way of publishing data and making data available for both the internal clients of the organization's and external clients of the organization's so Kafka is in the center of this and we see a lot of other partners of us including hot dog vendors like Cloudera map r & Horton works as well as data bricks and confluent are really focused on creating that data bus and servicing so we play a very strong there because phase one project for these organizations how do I create this enterprise data lake or enterprise data hub that is usually the phase one project because for advanced analytics or predictive analytics or when you make an engine your mortgage application you want to be able to see that change on your mobile phone under five minutes likewise when you make a change in your healthcare coverage or telecom services you want to be able to see that under five minutes on your phone these things really require easy access to that enterprise data hub what we have we have a tool called data funnel this basically simplifies in a one click and reduces the time for creating the enterprise data hub significantly and our customers are using this to migrate and make I would not say my great access data from the database tables like db2 for example thousands of tables populating an automatically mapping metadata whether that metadata is higher tables or parquet files or whatever the format is going to be in the distributed platform so this really simplifies the time to create the enterprise data hub it sounds actually really interesting when I'm hearing what you're saying the first sort of step was create this this data lake lets you know put data in there and start getting our feet wet and learning new analysis patterns but what if I'm hearing you correctly you're saying now radiating out of that is a new sort of data backbone that's much lower latency that gets data out of the analytic systems perhaps back into the operational systems or into new systems at a speed that we didn't do before so that we can now make decisions or or do an analysis and make decisions very quickly yes that's true basically operational intelligence and mathematics are converging okay and in that convergence what we are basically seeing is that I'm analyzing security data I'm analyzing telemetry data that's a streamed and I want to be able to react as fast as possible and some of the interest in the emerging computer platforms is really driven by this they eat the use case right many of our customers are basic saying that today operating under five minutes is enough for me however I want to be prepared I want to future-proof my applications because in a year it might be that I have to respond under a minute even in sub seconds when they talk about being future proofed and you mentioned to time you know time sort of brackets on either end our customers saying they're looking at a speed that current technologies don't support in other words are they evaluating some things that are you know essentially research projects right now you know very experimental or do they see a set of technologies that they can pick and choose from to serve those different latency needs we published a Hadoop survey earlier this year in january according to the results from that Hadoop survey seventy percent of the respondents were actually evaluating spark and this is very confused consistent with our customer base as well and the promise of spark is driven by multiple use cases and multiple workload including predictive analytics and streaming analytics and bat analytics all of these use cases being able to run on the same platform and all of the Hadoop vendors are also supporting this so we see as our customer base are heavy enterprise customers they are in production already in Hadoop so running spark on top of their Hadoop cluster is one way they are looking for future proofing their applications and this is where we also bring value because we really abstract that insulate the user while we are liberating all of the data from the enterprise whether it's on the relational legis data warehouse or it's on the mainframe side or it's coming from new web clients we are also helping them insulate their applications because they don't really need to worry about what's the next compute framework that's going to be the fastest most reliable and low latency they need to focus on the application layer they need to focus on creating that data pipeline today I want to ask you about the state of syncsort you guys have been great success with the mainframe this concept of data funneling or you can bring stuff in very fast new management new ownership what's the update on the market dynamics because now ingestion zev rethink data sources how do you guys view what's the plan for syncsort going forward share with the folks out there sure our new investors clearlake capital is very supportive of both organic and inorganic growth so acquisitions are one of the areas for us we plan to actually make one or two acquisitions this year and companies with the products in the near adjacent markets are real value add for us so that's one area in addition to organic growth in terms of the organic growth our investments are really we have been very successful with a lot of organizations insurance financial services banking and healthcare many many of the verticals very successful with helping our customers create the enterprise data hub integrate access all of the data integrated and now carrying them to the next generating generation frameworks those are the areas that we have been partnering with them the next is for us is really having streaming data sources as well as batch data sources through the single data pipeline and this includes bringing telemetry data and security data to the advanced analytics as well okay so it sounds like you're providing a platform that can handle the today's needs which were mostly batch but the emerging ones which are streaming and so you've got that sort of future proofing that customers are looking for once they've got that those types of data coming together including stuff from the mainframe that they want might want to enrich from public sources what new things do you see them doing predictive analytics and machine learning is a big part of this because ultimately once there are different phases right operational efficiency phase was the low-hanging fruit for many organizations I want to understand what I can do faster and serve my clients faster and create that operational efficiency in a cost-effective scalable way second was what our new for go to market opportunities with transformative applications what can I do by recognizing how my telco customers are interacting with the SAS services to help and how like under a couple of minutes I react to their responses or cell service is the second one and then the next phase is that how do I use this historical data in addition to the streaming of data rapidly I'm collecting to actually predict and prevent some of the things and this is already happening with a guy with banking for example it's really with the fraud detection a lot of predictive analysis happens so advanced analytics using AI advanced analytics using machine learning will be a very critical component of this moving forward this is really interesting because now you're honing in on a specific industry use case and something that you know every vendor is trying to sort of solve the fraud detection fraud prevention how repeatable is it across your customers is this something they have to build from scratch because there's no templates that get them fifty percent of the way there seventy percent of the way there actually there's an opportunity here because if you look at the health care or telco or financial services or insurance verticals there are repeating patterns and that one is fraud for fraud or some of the new use cases in terms of customer churn analytics or cosmetics estate so these patterns and the compliance requirements in these verticals creates an opportunity actually to come up with application applications for new companies start for new startups okay then do final question share with the folks out there to view the show right now this is ten years of Hadoop seven years of this event Big Data NYC we had a great event there New York City Silicon Valley what's the vibe here in Silicon Valley here this is one of the best events I really enjoy strata San Jose and I'm looking forward two days of keynotes and hearing from colleagues and networking with colleagues this is really the heartbeat happens because with the hadoop world and strata combined actually we started seeing more business use cases and more discussions around how to enable the business users which means the technology stack is maturing and the focus is really on the business and creating more insights and value for the businesses ten do you go to welcome to the cube thanks for coming by really appreciate it go check out our Dublin event on fourteenth of April hadoop summit will be in europe for that event of course go to SiliconANGLE TV check out our women in check every week we feature women in tech on wednesday thanks for joining us thanks for sharing the inside would sink so i really appreciate it thanks for coming by this turkey will be right back with more coverage live and Silicon Valley into the short break you
**Summary and Sentiment Analysis are not been shown because of improper transcript**
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
fifty percent | QUANTITY | 0.99+ |
john furrier | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
seventy percent | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
San Jose | LOCATION | 0.99+ |
one | QUANTITY | 0.99+ |
Dave | PERSON | 0.99+ |
John furrier | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
ten years | QUANTITY | 0.99+ |
telco | ORGANIZATION | 0.99+ |
george gilbert | PERSON | 0.99+ |
seven years | QUANTITY | 0.99+ |
Wikibon | ORGANIZATION | 0.99+ |
NYC | LOCATION | 0.98+ |
Hadoop | TITLE | 0.98+ |
thousands of tables | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
under five minutes | QUANTITY | 0.98+ |
europe | LOCATION | 0.98+ |
Joe | PERSON | 0.98+ |
january | DATE | 0.97+ |
wednesday | DATE | 0.97+ |
one click | QUANTITY | 0.97+ |
under five minutes | QUANTITY | 0.97+ |
under a minute | QUANTITY | 0.97+ |
one area | QUANTITY | 0.97+ |
phase one | QUANTITY | 0.96+ |
earlier this year | DATE | 0.96+ |
both | QUANTITY | 0.96+ |
a year | QUANTITY | 0.96+ |
second one | QUANTITY | 0.96+ |
Dublin | LOCATION | 0.95+ |
this year | DATE | 0.95+ |
New York City Silicon Valley | LOCATION | 0.92+ |
last night | DATE | 0.91+ |
one way | QUANTITY | 0.9+ |
a ton of data | QUANTITY | 0.9+ |
Cloudera map r & Horton | ORGANIZATION | 0.9+ |
a ton of data | QUANTITY | 0.89+ |
two acquisitions | QUANTITY | 0.89+ |
turkey | LOCATION | 0.88+ |
one of the best events | QUANTITY | 0.87+ |
first sort | QUANTITY | 0.86+ |
eighty percent | QUANTITY | 0.85+ |
mykos | PERSON | 0.84+ |
SiliconANGLE TV | ORGANIZATION | 0.83+ |
2016 | DATE | 0.82+ |
second | QUANTITY | 0.81+ |
Tendu Yogurtcu | PERSON | 0.81+ |
single data | QUANTITY | 0.8+ |
David | PERSON | 0.8+ |
park | TITLE | 0.8+ |
a lot of times | QUANTITY | 0.78+ |
10 | QUANTITY | 0.77+ |
db2 | TITLE | 0.76+ |
under a couple of minutes | QUANTITY | 0.75+ |
areas | QUANTITY | 0.71+ |
things | QUANTITY | 0.71+ |
Syncsort | ORGANIZATION | 0.71+ |
every week | QUANTITY | 0.7+ |
one of | QUANTITY | 0.7+ |
sv | EVENT | 0.69+ |
of April | DATE | 0.69+ |
first | QUANTITY | 0.68+ |
#BigDataSV | EVENT | 0.66+ |
themes | QUANTITY | 0.66+ |
spark | ORGANIZATION | 0.65+ |
summit | EVENT | 0.62+ |
Kafka | TITLE | 0.62+ |
every vendor | QUANTITY | 0.61+ |
use | QUANTITY | 0.6+ |
Big Data | EVENT | 0.6+ |
MapReduce | TITLE | 0.55+ |
Jack Norris - Hadoop Summit 2014 - theCUBE - #HadoopSummit
>>The queue at Hadoop summit, 2014 is brought to you by anchor sponsor Hortonworks. We do, I do. And headline sponsor when disco we make Hadoop invincible >>Okay. Welcome back. Everyone live here in Silicon valley in San Jose. This is a dupe summit. This is Silicon angle and Wiki bonds. The cube is our flagship program. We go out to the events and extract the signal to noise. I'm John barrier, the founder SiliconANGLE joins my cohost, Jeff Kelly, top big data analyst in the, in the community. Our next guest, Jack Norris, COO of map R security enterprise. That's the buzz of the show and it was the buzz of OpenStack summit. Another open source show. And here this year, you're just seeing move after, move at the moon, talking about a couple of critical issues. Enterprise grade Hadoop, Hortonworks announced a big acquisition when all in, as they said, and now cloud era follows suit with their news. Today, I, you sitting back saying, they're catching up to you guys. I mean, how do you look at that? I mean, cause you guys have that's the security stuff nailed down. So what Dan, >>You feel about that now? I think I'm, if you look at the kind of Hadoop market, it's definitely moving from a test experimental phase into a production phase. We've got tremendous customers across verticals that are doing some really interesting production use cases. And we recognized very early on that to really meet the needs of customers required some architectural innovation. So combining the open source ecosystem packages with some innovations underneath to really deliver high availability, data protection, disaster recovery features, security is part of that. But if you can't predict the PR protect the data, if you can't have multitenancy and separate workflows across the cluster, then it doesn't matter how secure it is. You know, you need those. >>I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. Silicon lucky bond is so successful, but I just don't understand your business model without plates were free content and they have some underwriters. So you guys have been very successful yet. People aren't looking at map are as good at the quiet leader, like you doing your business, you're making money. Jeff. He had some numbers with us that in the Hindu community, about 20% are paying subscriptions. That's unlike your business model. So explain to the folks out there, the business model and specifically the traction because you have >>Customers. Yeah. Oh no, we've got, we've got over 500 paying customers. We've got at least $1 million customer in seven different verticals. So we've got breadth and depth and our business model is simple. We're an enterprise software company. That's looking at how to provide the best of open source as well as innovations underneath >>The most open distribution of Hadoop. But you add that value separately to that, right? So you're, it's not so much that you're proprietary at all. Right. Okay. >>You clarify that. Right. So if you look at, at this exciting ecosystem, Hadoop is fairly early in its life cycle. If it's a commoditization phase like Linux or, or relational database with my SQL open source, kind of equates the whole technology here at the beginning of this life cycle, early stages of the life cycle. There's some architectural innovations that are really required. If you look at Hadoop, it's an append only file system relying on Linux. And that really limits the types of operations. That types of use cases that you can do. What map ours done is provide some deep architectural innovations, provide complete read-write file systems to integrate data protection with snapshots and mirroring, et cetera. So there's a whole host of capabilities that make it easy to integrate enterprise secure and, and scale much better. Do you think, >>I feel like you were maybe a little early to the market in the sense that we heard Merv Adrian and his keynote this morning. Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about nine years into Hadoop. Do you feel like maybe you guys were a little early and now you're at a tipping point, whereas these more, as more and more deployments get ready to go to production, this is going to be an area that's going to become increasingly important. >>I think, I think our timing has been spectacular because we, we kind of came out at a time when there was some customers that were really serious about Hadoop. We were able to work closely with them and prove our technology. And now as the market is just ramping, we're here with all of those features that they need. And what's a, what's an issue. Is that an incremental improvement to provide those kind of key features is not really possible if the underlying architecture isn't there and it's hard to provide, you know, online real-time capabilities in a underlying platform that's append only. So the, the HDFS layer written in Java, relying on the Linux file system is kind of the, the weak underbelly, if you will, of, of the ecosystem. There's a lot of, a lot of important developments happening yarn on top of it, a lot of really kind of exciting things. So we're actively participating in including Apache drill and on top of a complete read-write file system and integrated Hindu database. It just makes it all come to life. >>Yeah. I mean, those things on top are critical, but you know, it's, it's the underlying infrastructure that, you know, we asked, we keep on community about that. And what's the, what are the things that are really holding you back from Paducah and production and the, and the biggest challenge is they cited worth high availability, backup, and recovery and maintaining performance at scale. Those are the top three and that's kind of where Matt BARR has been focused, you know, since day one. >>So if you look at a major retailer, 2000 nodes and map bar 50 unique applications running on a single cluster on 10,000 jobs a day running on top of that, if you look at the Rubicon project, they recently went public a hundred million add actions, a hundred billion ad auctions a day. And on top of that platform, beats music that just got acquired for $3 billion. Basically it's the underlying map, our engine that allowed them to scale and personalize that music service. So there's a, there's a lot of proof points in terms of how quickly we scale the enterprise grade features that we provide and kind of the blending of deep predictive analytics in a batch environment with online capabilities. >>So I got to ask you about your go to market. I'll see Cloudera and Hortonworks have different business models. Just talk about that, but Cloudera got the massive funding. So you get this question all the time. What do you, how do you counter that army and the arms race? I think >>I just wrote an article in Forbes and he says cash is not a strategy. And I think that was, that was an excellent, excellent article. And he goes in and, you know, in this fast growing market, you know, an amount of money isn't necessarily translate to architectural innovations or speeding the development of that. This is a fairly fragmented ecosystem in terms of the stack that runs on top of it. There's no single application or single vendor that kind of drives value. So an acquisition strategy is >>So your field Salesforce has direct or indirect, both mixable. How do you handle the, because Cloudera has got feet on the street and every squirrel will find it, not if they're parked there, parking sales reps and SCS and all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. Yeah. And they're going to actually try to engage the clients. So, you know, I guess it is a strategy if they're deploying sales and marketing, right? So >>The beauty about that, and in fact, we're all in this together in terms of sharing an API and driving an ecosystem, it's not a fragmented market. You can start with one distribution and move to another, without recompiling or without doing any sort of changes. So it's a fairly open community. If this were a vendor lock-in or, you know, then spending money on brand, et cetera, would, would be important. Our focus is on the, so the sales execution of direct sales, yes, we have direct sales. We also have partners and it depends on the geographies as to what that percentage is. >>And John Schroeder on with the HP at fifth big data NYC has updated the HP relationship. >>Oh, excellent. In fact, we just launched our application gallery app gallery, make it very easy for administrators and developers and analysts to get access and understand what's available in the ecosystem. That's available directly on our website. And one of the featured applications there today is an integration with the map, our sandbox and HP Vertica. So you can get early access, try it and get the best of kind of enterprise grade SQL first, >>First Hadoop app store, basically. Yeah. If you want to call it that way. Right. So like >>Sure. Available, we launched with close to 30, 30 with, you know, a whole wave kind of following that. >>So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. So, you know, there's a lot of talk about that. Some confusion about the different methods for applying SQL on predicts or map art takes an open approach. I know you'll support things like Impala from, from a competitor Cloudera, talk about that approach from a map arts perspective. >>So I guess our, our, our perspective is kind of unbiased open source. We don't try to pick and choose and dictate what's the right open source based on either our participation or some community involvement. And the reality is with multiple applications being run on the platform, there are different use cases that make difference, you know, make different sense. So whether it's a hive solution or, you know, drill drills available, or HP Vertica people have the choice. And it's part of, of a broad range of capabilities that you want to be able to run on the platform for your workflows, whether it's SQL access or a MapReduce or a spark framework shark, et cetera. >>So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, you've got Impala, you've got hive. And the stinger initiative is, is that whole kind of SQL on Hadoop ecosystem, still working itself out. Are we going to have this many options in a year or two years from now? Or are they complimentary and potentially, you know, each has its has its role. >>I think the major differences is kind of how it deals with the new data formats. Can it deal with self-describing data? Sources can leverage, Jason file does require a centralized metadata, and those are some of the perspectives and advantages say the Apache drill has to expand the data sets that are possible enabled data exploration without dependency on a, on an it administrator to define that, that metadata. >>So another, maybe not always as exciting, but taking workloads from existing systems, moving them to Hadoop is one of the ways that a lot of people get started with, to do whether associated transformation workloads or there's something in that vein. So I know you've announced a partnership with Syncsort and that's one of the things that they focus on is really making it as easy as possible to meet those. We'll talk a little bit about that partnership, why that makes sense for you and, and >>When your customer, I think it's a great proof point because we announced that partnership around mainframe offload, we have flipped comScore and experience in that, in that press release. And if you look at a workload on a mainframe going to duke, that that seems like that's a, that's really an oxymoron, but by having the capabilities that map R has and making that a system of record with that full high availability and that data protection, we're actually an option to offload from mainframe offload, from sand processing and provide a really cost effective, scalable alternative. And we've got customers that had, had tried to offload from the mainframe multiple times in the past, on successfully and have done it successfully with Mapbox. >>So talk a little bit more about kind of the broader partnership strategy. I mean, we're, we're here at Hadoop summit. Of course, Hortonworks talks a lot about their partnerships and kind of their reseller arrangements. Fedor. I seem to take a little bit more of a direct approach what's map R's approach to kind of partnering and, and as that relates to kind of resell arrangements and things like, >>I think the app gallery is probably a great proof point there. The strategy is, is an ecosystem approach. It's having a collection of tools and applications and management facilities as well as applications on top. So it's a very open strategy. We focus on making sure that we have open API APIs at that application layer, that it's very easy to get data in and out. And part of that architecture by presenting standard file system format, by allowing non Java applications to run directly on our platform to support standard database connections, ODBC, and JDBC, to provide database functionality. In addition to kind of this deep predictive analytics really it's about supporting the broadest set of applications on top of a single platform. What we're seeing in this kind of this, this modern architecture is data gravity matters. And the more processing you can do on a single platform, the better off you are, the more agile, the more competitive, right? >>So in terms of, so you're partnering with people like SAS, for example, to kind of bring some of the, some of the analytic capabilities into the platform. Can you kind of tell us a little bit about any >>Companies like SAS and revolution analytics and Skytree, and I mean, just a whole host of, of companies on the analytics side, as well as on the tools and visualization, et cetera. Yeah. >>Well, I mean, I, I bring up SAS because I think they, they get the fact that the, the whole data gravity situation is they've got it. They've got to go to where the data is and not have the data come to them. So, you know, I give them credit for kind of acknowledging that, that kind of big data truth ism, that it's >>All going to the data, not bringing the data >>To the computer. Jack talk about the success you had with the customers had some pretty impressive numbers talking about 500 customers, Merv agent. The garden was on with us earlier, essentially reiterating not mentioning that bar. He was just saying what you guys are doing is right where the puck is going. And some think the puck is not even there at the same rink, some other vendors. So I gotta give you props on that. So what I want you to talk about the success you have in specifically around where you're winning and where you're successful, you guys have struggled with, >>I need to improve on, yeah, there's a, there's a whole class of applications that I think Hadoop is enabling, which is about operations in analytics. It's taking this, this higher arrival rate machine generated data and doing analytics as it happens and then impacting the business. So whether it's fraud detection or recommendation engines, or, you know, supply chain applications using sensor data, it's happening very, very quickly. So a system that can tolerate and accept streaming data sources, it has real-time operations. That is 24 by seven and highly available is, is what really moves the needle. And that's the examples I used with, you know, add a Rubicon project and, you know, cable TV, >>The very outcome. What's the primary outcomes your clients want with your product? Is it stability? And the platform has enabled development. Is there a specific, is there an outcome that's consistent across all your wins? >>Well, the big picture, some of them are focused on revenues. Like how do we optimize revenue either? It's a new data source or it's a new application or it's existing application. We're exploding the dataset. Some of it's reducing costs. So they want to do things like a mainframe offload or data warehouse offload. And then there's some that are focused on risk mitigation. And if there's anything that they have in common it's, as they moved from kind of test and looked at production, it's the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. So it's not, it's not anything new. It's just like, Hey, we've got SLS and I've got data protection policies, and I've got a disaster recovery procedure. And why can't I expect the same level of capabilities in Hindu that I have today in those other systems. >>It's a final question. Where are you guys heading this year? What's your key objectives. Obviously, you're getting these announcements as flurry of announcements, good success state of the company. How many employees were you guys at? Give us a quick update on the numbers. >>So, you know, we just reported this incredible momentum where we've tripled core growth year over year, we've added a tremendous amount of customers. We're over 500 now. So we're basically sticking to our knitting, focusing on the customers, elevating the proof points here. Some of the most significant customers we have in the telco and financial services and healthcare and, and retail area are, you know, view this as a strategic weapon view, this is a huge competitive advantage, and it's helping them impact their business. That's really spring our success. We've, you know, we're, we're growing at an incredible clip here and it's just, it's a great time to have made those calls and those investments early on and kind of reaping the benefits. >>It's. Now I've always said, when we, since the first Hadoop summit, when Hortonworks came out of Yahoo and this whole community kind of burst open, you had to duke world. Now Riley runs at it's a whole different vibe of itself. This was look at the developer vibe. So I got to ask you, and we would have been a big fan. I mean, everyone has enough beachhead to be successful, not about map arbors Hortonworks or cloud air. And this is why I always kind of smile when everyone goes, oh, Cloudera or Hortonworks. I mean, they're two different animals at this point. It would do different things. If you guys were over here, everyone has their quote, swim lanes or beachhead is not a lot of super competition. Do you think, or is it going to be this way for awhile? What's your fork at some? At what point do you see more competition? 10 years out? I mean, Merv was talking a 10 year horizon for innovation. >>I think that the more people learn and understand about Hadoop, the more they'll appreciate these kind of set of capabilities that matter in production and post-production, and it'll migrate earlier. And as we, you know, focus on more developer tools like our sandbox, so people can easily get experienced and understand kind of what map are, is. I think we'll start to see a lot more understanding and momentum. >>Awesome. Jack Norris here, inside the cube CMO, Matt BARR, a very successful enterprise grade, a duke player, a leader in the space. Thanks for coming on. We really appreciate it. Right back after the short break you're live in Silicon valley, I had dupe December, 2014, the right back.
SUMMARY :
The queue at Hadoop summit, 2014 is brought to you by anchor sponsor I mean, cause you guys have that's the security stuff nailed down. I think I'm, if you look at the kind of Hadoop market, I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. That's looking at how to provide the best of open source But you add that value separately to So if you look at, at this exciting ecosystem, Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about isn't there and it's hard to provide, you know, online real-time And what's the, what are the things that are really holding you back from Paducah So if you look at a major retailer, 2000 nodes and map bar 50 So I got to ask you about your go to market. you know, in this fast growing market, you know, an amount of money isn't necessarily all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. We also have partners and it depends on the geographies as to what that percentage So you can get early If you want to call it that way. a whole wave kind of following that. So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. And it's part of, of a broad range of capabilities that you want So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, of the perspectives and advantages say the Apache drill has to expand the data sets why that makes sense for you and, and And if you look at a workload on a mainframe going to duke, So talk a little bit more about kind of the broader partnership strategy. And the more processing you can do on a single platform, the better off you are, Can you kind and I mean, just a whole host of, of companies on the analytics side, as well as on the tools So, you know, I give them credit for kind of acknowledging that, that kind of big data truth So what I want you to talk about the success you have in specifically around where you're winning and you know, add a Rubicon project and, you know, cable TV, And the platform has enabled development. the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. Where are you guys heading this year? So, you know, we just reported this incredible momentum where we've tripled core and this whole community kind of burst open, you had to duke world. And as we, you know, focus on more developer tools like our sandbox, a duke player, a leader in the space.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Kelly | PERSON | 0.99+ |
Jack Norris | PERSON | 0.99+ |
John Schroeder | PERSON | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Jeff | PERSON | 0.99+ |
$3 billion | QUANTITY | 0.99+ |
December, 2014 | DATE | 0.99+ |
Jason | PERSON | 0.99+ |
Matt BARR | PERSON | 0.99+ |
10,000 jobs | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
10 year | QUANTITY | 0.99+ |
Syncsort | ORGANIZATION | 0.99+ |
Dan | PERSON | 0.99+ |
Silicon valley | LOCATION | 0.99+ |
John barrier | PERSON | 0.99+ |
Java | TITLE | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
24 | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
this year | DATE | 0.99+ |
Jack | PERSON | 0.99+ |
fifth | QUANTITY | 0.99+ |
Linux | TITLE | 0.99+ |
Skytree | ORGANIZATION | 0.99+ |
each | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
today | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
Merv | PERSON | 0.98+ |
about 10 years | QUANTITY | 0.98+ |
San Jose | LOCATION | 0.98+ |
Hadoop | EVENT | 0.98+ |
about 20% | QUANTITY | 0.97+ |
seven | QUANTITY | 0.97+ |
over 500 | QUANTITY | 0.97+ |
a year | QUANTITY | 0.97+ |
about 500 customers | QUANTITY | 0.97+ |
SQL | TITLE | 0.97+ |
seven different verticals | QUANTITY | 0.97+ |
two years | QUANTITY | 0.97+ |
single platform | QUANTITY | 0.96+ |
2014 | DATE | 0.96+ |
Apache | ORGANIZATION | 0.96+ |
Hadoop | LOCATION | 0.95+ |
SiliconANGLE | ORGANIZATION | 0.94+ |
comScore | ORGANIZATION | 0.94+ |
single vendor | QUANTITY | 0.94+ |
day one | QUANTITY | 0.94+ |
Salesforce | ORGANIZATION | 0.93+ |
about nine years | QUANTITY | 0.93+ |
Hadoop Summit 2014 | EVENT | 0.93+ |
Merv | ORGANIZATION | 0.93+ |
two different animals | QUANTITY | 0.92+ |
single application | QUANTITY | 0.92+ |
top three | QUANTITY | 0.89+ |
SAS | ORGANIZATION | 0.89+ |
Riley | PERSON | 0.88+ |
First | QUANTITY | 0.87+ |
Forbes | TITLE | 0.87+ |
single cluster | QUANTITY | 0.87+ |
Mapbox | ORGANIZATION | 0.87+ |
map R | ORGANIZATION | 0.86+ |
map | ORGANIZATION | 0.86+ |
Jack Norris - BigDataNYC 2013 - theCUBE - #BigDataNYC
>>I from Midtown Manhattan, the cute quiet coverage of big data NYC Civicon angled, Wiki bonds production made possible by Hortonworks. We do hairdo and lamb disco and new made invincible. And now your hosts, John furrier and Volante >>Hi buddy. We're back. This is Dave Volante with Jeff Kelly with Wiki bond. And this is the cube Silicon angle's continuous production. We're here at big data NYC right across the street from the Hilton where strata comp and a dupe world is going on. We've got a multi-time cube guest, Jack Norris, the CMO of map bars here, Jack. Welcome back to the cube first. So by the way, thank you so much for the support. As you know, we're across the street here at the Warwick hotel map, our, you guys have always been so generous supporting the cube. We can't thank you enough for that. So really appreciate it. Thank you. So we were able to listen to your keynote yesterday. It was, we, we, we weren't broadcasting, you know, head to head yesterday and had an opportunity to hear your keynote. So, first of all, how did that go? I want to ask you some questions about it. >>It, it was a really well-received and I think people were kind of clamoring to try to separate the myths from, from reality on, on Hadoop, >>We had three myths that you talked about, you know, one related to the distraction. I'd like to get into some of those. So what was the, the first myth was around the, the, the, the district distribution battle. So take us through that. >>So, you know, th the impression that it's a knock-down drag-out competitive battle across Hadoop distributions was the first myth. And the reality is that all of the distribution share the same open source Apache code. And this is one of the first markets that's really, really created, or the first open-source technologies it's really created a market. I mean, look, what's happened here with this whole, this whole big data and Hadoop, but given that early stage, there's the requirement to really combine that open source code with additional innovations to meet customer needs. And so what you see is you see those aggregators that are taken open source, you see others that are taking the open source, and then adding maybe management utility, couple of, of, you know, different applications on top. And then our approach at map R is we're taking the open source with those management innovations, doing some development, the open source community with things like Apache drill, and then really focusing on the underlying architecture, the data platform and providing innovations at that layer. So >>Actually sort of the three major destroys that we talk about all the time. You know, you guys, Hortonworks and Hadoop, you guys have been consistent the whole time as has Hortonworks, right? Cloud era basically put out a post recently saying, Hey, kind of going in a different direction, sort of what I call the tapped out of the Hadoop distro, you know, piece of it. But so there's a lot of discussion around it. You're putting forth the, Hey, it's not an internet seen war, but does it matter is my question? >>Well, I think if you take a step back, the Hadoop ecosystem is incredibly strong growing very, very quickly, fastest growing big data technology, one of the top 10 technologies overall. And I think it's because we are sharing the same API. It is possible for customers to learn on one, develop and move seamlessly to another. And, you know, in the keynote, I talked about the difference between the no SQL market, which is, you know, there is no consensus there and, and customers have to figure out not only what's the right word workload, but what's the technology that's actually going to have some staying power, right? >>That's a powerful comment. Amazon turn the data center and into an API, or you as the duke community is essentially turning data, access into an API. And that is a very powerful and leverageable concept. Okay. Your second myth was around the whole, no SQL yes. Piece of it. You help you put up a slide. I thought I read Jeff Kelly's reports. And I thought, I thought I knew them all, but there were a couple in there that I didn't recognize as you probably knew them all, but so take us through myth. Number two >>Too. I'm sure we missed some >>There wasn't room on the slide for anymore. >>The, yeah, it's basically about the consensus. There is no real consensus. There's no common API. There's no ability to move applications seamlessly across no SQL solutions. If you look at one no SQL solution, and that's, HBase a big inherent advantage because it's integrated with Hindu, you know, this whole trend is about compute and data together. So if you've got a no sequel solution, that's on that same, you know, massive data store, you know, big leg up. And, and then we got into the, well, if you've got HBase, it's included in all the distributions and all the distribution share the same open source, then obviously it must run the same across all distributions. And there, we shared some pretty interesting data to show the difference. When you, when you do architectural differences and innovations underneath that you can dramatically change the performance of, of not only MapReduce, but of no SQL. Yes. >>Okay. So not all no SQL is created equally. Not all HBase is created equally as essentially what you're saying there. Now the third piece was to dupe is enterprise ready, right? Yeah. So you guys were first to say, well, we have a Hadoop platform that's enterprise ready way ahead on that. Got criticized a lot for going down that path shrugged and said, okay, we'll just keep doing business with customers. And you've been again, very clear and consistent on that. So talk about the third myth >>And that's, you know, is, is Hadoop ready for prime time? And I think the way to combat that myth is by customer examples and showing the tremendous success that customers are enjoying with Hadoop. And, you know, we, we don't have time on the cube here to go through all of them, but, you know, I like to point out 90 billion auctions a day with Rubicon, they've surpassed Google in terms of ad reach. They're doing that on Mapbox 1.7 trillion events a month with comScore that's on, on map bar. You look in, in traditional enterprise, you know, a single retailer with over 2000 nodes of Hadoop. I mean, it's a key part of their merchandising and retail operations, and combining all sorts of, of data feeds and all sorts of use cases there, financial services over a thousand nodes of risk medication, personalized offers streamlining their operations. I mean, it's, it's dramatic. And then, you know, we shared some of the more, more interesting ones, esoteric ones like garbage and whiskey and weather prediction. >>There was consider these, we even as diverse and eclectic as they are, they consider these mission critical application. >>Oh, absolutely. No it it's. And I think that's the difference because what we're talking about is not Hadoop as this cash, right? This temporary processing, where we can do, you know, some interesting batch analytics and then take that and put that someplace else. And yes, there are applications like that, but companies soon realized that if I'm going to use this as a key part of my operations, and it's about data on compute, then I want a consistent permanent store. I want a system of record. So all of the SLS and high availability and data protection features that they expect in their enterprise applications should be present in Hadoop, right? That's where we focus. Let's run down a couple of those. >>What are some of the key capabilities that you need in an enterprise enterprise grade platform? That map bar is >>Well, let's, let's take, let's take business continuity cause that's important if you're really going to trust data there. And you know, one of the big drivers as you expand data is how much am I going to spend on it? And if you look at a large investment bank, $270 million of their budget, not total, but incremental to address the additional capacity, there's a big emphasis for let's look at a better way to do that. So instead of spending $15,000 a terabyte, if you can spend a few hundred dollars a terabyte, that's a huge, huge advantage. And that's the focus of Hindu, but to do that, well, then the features that are in this enterprise storage have to be present. And we're talking about, you know, mirroring and not a copy table function, but replication, that's how that's how organizations do it, right. If you're going to recovery and recovery, you know, you can't back up a petabyte of information through a copy function, right? You have to do a snapshot and the snapshots have to be consistent, right. And, and we're not saying anything that, you know, an enterprise administrator doesn't know, there is some confusion when you're more on the developer side as to what these features are and the difference between a fuzzy snapshot and a point in time, consistent snaps. >>Got it. So let's talk a little bit about the, the enterprise data hub, this, this concept that Michael Wilson with clutter introduced yesterday. Tell us a little bit about your take on, on, on Mike's I guess, definition and, and essentially I think trying to name the category of kind of what Hadoop can do and what, and where it sits in the architecture. Did you agree with his, his, >>Yeah. I mean, if you look at, at that description, it's about I'm taking important data and I'm putting it in a dupe and I'm combining a lot of different data sources and it's been referred to as a data lake and a data reservoir and a data ocean. I mean, we've heard a lot of terms. We worked with an outside consultant that was originally an architect at Terre data. It's been about eight months, almost a year ago now where he defined it and enterprise data hub. And it's it's, he went through kind of the list of requirements. And once you move from a transitory to a permanent store, then that becomes an enterprise data hub. And an enterprise data hub can be used to select and process information, maybe it's ETL and serve some downstream applications. It can also be useful to do analysis directly on it, to, you know, to serve different business functions. But the system requirements that he established for that I think are absolutely true. And it's, you have to have the full data protection. You have to have the full disaster recovery. You have to have the full high availability because this is going to be important data serving the organization. If it's data that you can lose, if it's data that you, you don't really care about having highly available, then it's a very narrow use case that that data hub serves. >>So you're saying the enterprise data hub isn't ready for prime time. >>No, I'm saying that there, there are requirements. And we have companies today that have deployed an enterprise data hub and they are quite successful with it. And, you know, the quotes are the ETL functions that they're doing on that hub are 10 times faster and it's 10 times cheaper than what they're seeing. >>Soundbite, Dave, >>I agree, but it's nuanced. Right. And so, you know, the customers cause a lot of vendors, right? They're all saying the same thing to the customers, right? So you've got your messaging that you've, you know, you've proven out over the last several years and then the entire market starts to use the same terminology. So it is, this is why I, like, I think this, what is, what are those >>Things? We're in a little bit of this, this kind of marketing fog here in the relative early stages. I think the best response there is customer proof points. And I think some education in the very beginning, you know, when they're in development and test, it's really important to understand, you know, what is Hadoop and what can I use it for and what data source am I going to leverage? I think the features that we're talking about really start to show up as you deploy in production. And as you expand its use in production and there we've enjoyed tremendous success, >>But he would argue that you have a lead in this space. I wouldn't, I don't think you would either the space being robustness enterprise ready, mission criticality is your lead increasing, decreasing staying the same. >>What's your sense? Well, it's hard cause there's no, you know, th th there's no external service that's out there, you know, interviewing every customer and, and giving numbers. I do know that we passed 500 paying customers. I do know that we've got significant deployments and you can measure those in terms of number of nodes, you know, in the thousands of nodes, you can measure those in terms of use cases. So we've got, you know, one company they've passed 20 different use cases on the same cluster. I think that's an interesting proof point. We're scaling in terms of the number of, of people in an organization that are trained in leveraging the data in map are again in the, in the thousands. So, you know, I think this market is so big and so dynamic that this isn't about, you know, one company success at the expense of everyone. Else's zero sum game. I think, you know, we're all here kind of raising this, this boat and focusing on this paradigm shift, but when it comes to production success, that's our focus. And I think that's where we've, we've proven that >>One thing I'm really want to get your opinion on, you know, as, as to do matures and some of the innovations you guys are doing and, and making the platform, you know, basically a multi application platform, you can do more things with Hadoop. And we've been talking about this on the cube, is that as that happens, you're going to start you as an industry. You're going to start bumping up against the EDW vendors and some of the other database vendors in the traditional world. And you're now you're doing some of the things that those, those tools can do now, you know, two years ago, it was very much just, this is all very complimentary Hadoop and your EDW. There's no overlap. We're gonna all play nice. But increasingly we're seeing that there is an overlap. How do you view that? Is that, and what is your relationship with those, with those EDW vendors and, and what are you hearing from customers when you go into a customer? Okay. >>So, I mean, there's a, there's a lot in that question. I think the F the first comment though, is don't look at Hadoop through this single data warehouse lens. And if you look at, at trying to use Hadoop to completely replace an enterprise data warehouse where there's, here's a few decades of experience, there, there are many organizations that have a lot of activities that are based in that data warehouse. And that's where we're seeing a data warehouse offload that is complimentary, but it gives organizations this lever to say, well, I'm going to control the fill rate, and I'm going to take some of the data that's no longer, you know, really active and put that on Hadoop and really change my ability to manage the costs in a data warehouse environment. The other thing that's interesting is that the types of applications that duper doing, I think are creating a new class it's about operations and analytics, kind of combined together, taking high arrival rate data and making very quick micro changes to optimize whether that's fraud detection or recommendation engines, or taking sensor data and predictive analytics for, for maintenance, et cetera. There is just a tremendous number of, of applications. In some cases, leveraging a new data source in some cases, doing new applications, but it's just opening things up. And, and I think organizations are moving to be very data-driven and Hadoop is at the center of that. >>And you control the field, right? That's another really good soundbites. And, and these that, you mentioned this high arrival rate data, this fraud detection, predictive analytics, maintenance, these are things that you're doing today with >>Navarre right? Yeah, >>Absolutely. Great. All right, Jack. Well, listen, always a pleasure. Thanks very much for coming by. Great to see you again. All right. Keep it right there about Uber, right back with our next guest. This is the cube we're live from the big apple.
SUMMARY :
I from Midtown Manhattan, the cute quiet coverage of big data NYC So by the way, thank you so much for the We had three myths that you talked about, you know, one related to the distraction. So, you know, th the impression that it's a knock-down drag-out sort of what I call the tapped out of the Hadoop distro, you know, piece of it. And, you know, in the keynote, I talked about the difference between the no SQL market, And I thought, I thought I knew them all, but there were a couple in there that I didn't recognize as you probably knew them all, that's on that same, you know, massive data store, you know, big leg up. So you guys were first to say, And that's, you know, is, is Hadoop ready for prime time? where we can do, you know, some interesting batch analytics and then take that and put that someplace else. And you know, one of the big drivers as you expand Did you agree with his, his, to, you know, to serve different business functions. And, you know, the quotes are the ETL functions that they're doing on that hub are 10 And so, you know, the customers cause a lot of you know, when they're in development and test, it's really important to understand, you know, I wouldn't, I don't think you would either the space being robustness enterprise so dynamic that this isn't about, you know, one company success at the expense those tools can do now, you know, two years ago, it was very much just, this is all very complimentary Hadoop and your EDW. And if you look at, at trying to use Hadoop to completely replace an enterprise data warehouse And you control the field, right? Great to see you again.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Kelly | PERSON | 0.99+ |
Michael Wilson | PERSON | 0.99+ |
10 times | QUANTITY | 0.99+ |
Jack | PERSON | 0.99+ |
Jack Norris | PERSON | 0.99+ |
10 times | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
$270 million | QUANTITY | 0.99+ |
Mike | PERSON | 0.99+ |
yesterday | DATE | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
third piece | QUANTITY | 0.99+ |
Dave | PERSON | 0.99+ |
Hadoop | TITLE | 0.99+ |
Midtown Manhattan | LOCATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Volante | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
20 different use cases | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
second | QUANTITY | 0.99+ |
John furrier | PERSON | 0.98+ |
NYC | LOCATION | 0.98+ |
two years ago | DATE | 0.98+ |
Hadoop | ORGANIZATION | 0.98+ |
first comment | QUANTITY | 0.98+ |
Rubicon | ORGANIZATION | 0.98+ |
SQL | TITLE | 0.97+ |
Terre data | ORGANIZATION | 0.97+ |
One | QUANTITY | 0.97+ |
1.7 trillion events | QUANTITY | 0.97+ |
third | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
one | QUANTITY | 0.96+ |
single | QUANTITY | 0.96+ |
a year ago | DATE | 0.95+ |
one company | QUANTITY | 0.94+ |
HBase | TITLE | 0.94+ |
Navarre | PERSON | 0.93+ |
EDW | ORGANIZATION | 0.92+ |
over 2000 nodes | QUANTITY | 0.91+ |
big apple | ORGANIZATION | 0.91+ |
first markets | QUANTITY | 0.9+ |
nodes | QUANTITY | 0.89+ |
about eight months | QUANTITY | 0.88+ |
2013 | DATE | 0.88+ |
Soundbite | ORGANIZATION | 0.87+ |
three myths | QUANTITY | 0.87+ |
Hindu | ORGANIZATION | 0.87+ |
first open-source | QUANTITY | 0.86+ |
Wiki bond | ORGANIZATION | 0.85+ |
BigDataNYC | EVENT | 0.85+ |
$15,000 a terabyte | QUANTITY | 0.85+ |
three major | QUANTITY | 0.82+ |
90 billion auctions a day | QUANTITY | 0.81+ |
500 paying customers | QUANTITY | 0.79+ |
comScore | ORGANIZATION | 0.79+ |
map R | ORGANIZATION | 0.78+ |
over a thousand nodes | QUANTITY | 0.77+ |
Hilton | LOCATION | 0.77+ |
few hundred dollars a terabyte | QUANTITY | 0.76+ |
Number two | QUANTITY | 0.76+ |
10 technologies | QUANTITY | 0.74+ |
Jack Norris - Hadoop Summit 2013 - theCUBE - #HadoopSummit
>>Ash it's, you know, what will that mean to my investment? And the announcement fusion IO is that, you know, we're 25 times faster on read intensive HBase applications. The combination. So as organizations are deploying Hadoop, and they're looking at technology changes coming down the pike, they can rest assured that they'll be able to take advantage of those in a much more aggressive fashion with map R than, than other distribution. >>Jack, how I got to ask you, we were talking last night at the Hadoop summit, kind of the kickoff party and, you know, everyone was there. All the top execs were there and all the developers, you know, we were in the queue. I think, I think that either Dave or myself coined the term, the big three of big data, you guys ROMs cloud Cloudera map R and Hortonworks, really at the, at the beginning of the key players early on and Charles from Cloudera was just recently on. And, and he's like, oh no, this, this enterprise grade stuff has been kicked around. It's been there from the beginning. You guys have been there from the beginning and Matt BARR has never, ever waffled on your, on your messaging. You've always been very clear. Hey, we're going to take a dupe open source a dupe and turn it into an enterprise grade product. Right. So that's clear, right? That's, that's, that's a great, that's a great, so what's your take on this because now enterprise grade is kind of there, I guess, the buzz around getting the, like the folks that have crossed the chasm implemented. So what can you comment on that about one enterprise grade, the reality of it, certainly from your perspective, you haven't been any but others. And then those folks that are now rolling it out for the first time, what can you share with them around? What does it mean to be enterprise grade? >>So enterprise grade is more about the customer experience than, than a marketing claim. And, you know, by enterprise grade, what we're talking about are some of the capabilities and features that they've grown to expect in their, their other enterprise applications. So, you know, the ability to meet full S SLA is full ha recovery from multiple failures, rolling upgrades, data protection was consistent snapshots business continuity with mirroring the ability to share a cluster across multiple groups and have, you know, volumes. I mean, there's a, there's a host of features that fall under the umbrella enterprise grade. And when you move from no support for any of those features to support to a few of them, I don't think that's going to, to ha it's more like moving to low availability. And, and there's just a lot of differences in terms of when we say enterprise grade with those features mean versus w what we view as kind of an incomplete story. So >>What do you, what do you mean by low availability? Well, I mean, it's tongue in cheek. It's nice. It's a good term. It's really saying, you know, just available when you sometimes is that what you mean? Is this not true availability? I mean, availability is 99.9%. Right? >>Right. So if you've got a, an ha solution that can't recover from multiple failures, that's downtime. If you've got an HBase application that's running online and you have data that goes down and it takes 10 to 30 minutes to have the region servers recover it from another place in the distribution, that's downtime. If you have snapshots that aren't consistent across the cluster, that doesn't provide data protection, there's no point in time recovery for, for a cluster. So, you know, there's a lot of details underneath that, but what it, what it amounts to is, do you have interruptions? Do you have downtime? Do you have the potential for losing data? And our answer is you need a series of features that are hardened and proven to deliver that. >>What about recoverability? You mentioned that you guys have done a lot of work in that area with snapshotting, that's kind of being kicked around, are our folks addressing, what are the comp what's your competition doing in those areas of recoverability just mentioned availability. Okay, got that. Recoverability security, compliance, and usability. Those are the areas that seem to be the hot focus areas what's going on in the energy. How would you give them the grade, the letter grade, if you will, candidly, compared to what you guys offer? Well, the, >>The first of all, it's take recoverability. You know, one of the tenants is you have a point in time recovery, the ability to restore to a previous point that's consistent across the cluster. And right now there's, there's no point in time recovery for, for HDFS, for the files. And there's no point in time recovery for HBase tables. So there's snapshot support. It's being talked about in the open source community with respect to snapshots, but it's being referred to in the JIRAs as fuzzy snapshots and really compared to copy table. >>So, Jack, I want to turn the conversation to the, kind of the topic we've talked about before kind of the open versus a proprietary that, that whole debate we've, we've, we've heard about that. We talked about that before here on the cube. So just kind of reiterate for us your take. I mean, we, we hear perhaps because of the show we're at, there's a lot of talk about the open source nature of Hadoop and some of the purists, as you might call them are saying, it's gotta be open a hundred percent Patrick compatible, et cetera. And then there's others that are taking a different approach, explain your approach and why you think that's the key way to make, to really spur adoption of a dupe and make it >>W w we're we're a part of the community we're, we've got, you know, commitment going on. We've, you know, pioneered and pushed a patchy drill, but we have done innovations as well. And I think that those innovations are really required to support and extend the, the whole ecosystem. So canonical distributes RN, three D distribution. We've got, you know, all our, our packages are, are available on get hub and, and open source. So it's not, it's not a binary debate. And I think the, the point being that there's companies that have jumped ahead and now that Peloton is, is, you know, pedaling faster and, and we'll, we'll catch up. We'll streamline. I think the difference is we rearchitected. So we're basically in a race car and, you know, are, are racing ahead with, with enterprise grade features that are required. And there's a lot of work that still needs to be done, needs to be accomplished before that full rearchitecture is, is in place. >>Well, I mean, I think for me, the proof is really in the pudding when you, when it comes to talk about customers that are doing real things and real production, grade mission, critical applications that they're running. And to me that shows the successor or relative success of a given approach. So I know you guys are working with companies like ancestry.com, live nation and Quicken loans. Maybe you could, could you walk us through a couple of those scenarios? Let's take ancestry.com. Obviously they've got a huge amount of data based on the kind of geological information, where do you guys do >>With them? Yeah, so they've got, I mean, they've got the world's largest family genealogy services available on the web. So there's a massive amount of data that they make accessible and, and, you know, ability for, for analysis. And then they've rolled out new features and new applications. One of which is to ship a kit out, have people spit in a tube, returned back and they do DNA matching and reveal additional details. So really some really fabulous leading edge things that are being done with, with the use of, of Hadoop. >>Interesting. So talk about when you went to, to work with them, what were some of their key requirements? Was it around, it was more around the enterprise enterprise, grade security and uptime kind of equation, or was it more around some of the analytics? What, what, what's the kind of the killer use case for them? >>It's kind of, you know, it's, it's hard with a specific company or even, you know, to generalize across companies. Cause they're really three main areas in terms of ease of use and administration dependability, which includes the full ha and then, and then performance. And in some cases, it's, it's just one of those that kind of drives it. And it's used to justify, in other cases, it's kind of a collection. The ease of use is being able to use a cluster, not only as Hadoop, but to access it and treat it like enterprise storage. So it's a complete POSIX compliance file system underneath that allows the, the mounting and access and updates and using it in dynamic read-write. So what that means from an application level, it's, it's faster, it's much easier to administer and it's much easier and reliable for developers to, to utilize. >>I got to ask you about the marketing question cause I see, you know, map our, you guys have done a good job of marketing. Certainly we want to be thankful to you guys is supporting the cube in the past and you guys have been great supporters of our mission, but now the ecosystem's evolving a lot more competition. Claudia mentioned those eight companies they're tracking in quote Hadoop, and certainly Jeff and I, and, and SiliconANGLE by look at there's a lot more because Hadoop washing has been going on now for the term Hadoop watching me and jumping in and doing Hadoop, slapping that onto an existing solution. It's not been happening full, full, full bore for a year. At least what's the next for you guys to break above the noise? Obviously the communities are very active projects are coming online. You guys have your mission in the enterprise. What's the strategy for you guys going forward is more of the same and anything new even share. >>Yeah, I, I, I think as far as breaking above the noise, it will be our customers, their success and their use cases that really put the spotlight on what the differences are in terms of, of, you know, using a big data platform. And I think what, what companies will start to realize is I'd rather analogy between supply chain and the big, the big revolution in supply chain was focusing on inventory at each stage in the supply chain. And how do you reduce that inventory level and how do you speed the, the flow of goods and the agility of a company for competitive advantage. And I think we're going to view data the same way. So companies instead of raw data that they're copying and moving across different silos, if they're able to process data in place and send small results sets, they're going to be faster, more agile and more competitive. >>And that puts the spotlight on what data platform is out there that can support a broad set of applications and it can have the broadest set of functionality. So, you know, what we're delivering is a mission grade, you know, enterprise grade mission, critical support platform that supports MapReduce and does that high performance provides NFS POSIX access. So you can use it like a file system integrates, you know, enterprise grade, no SQL applications. So now you can do, you know, high-speed consistent performance, real time operations in addition to batch streaming, integrated search, et cetera. So it's, it's really exciting to provide that platform and have organizations transform what they're doing. >>How's the feedback on with Ted Dunning? I haven't seen a lot of buzz on the Twittersphere is getting positive feedback here. He's a, a tech athlete. He's a guru, he's an expert. He's got his hands in all the pies. He's a scientist type. What's he up to? What's his, what's his role within Mapa and he's obviously playing in the open-source community. What's he up to these days, >>Chief application architect, he's on the leading edge of my house. So machine learning, so, you know, sharing insights there, he was speaking at the storm meetup two nights ago and sharing how you can integrate long running batch, predictive analytics with real-time streaming and how the use of snapshots really that, that easy and possible. He travels the world and is helping organizations understand how they can take some very complex, long running processes and really simplify and shorten those >>Chance to meet him in New York city had last had duke world at a, at a, a party and great guy, fantastic geek, and certainly is doing a great work and shout out to Ted. Congratulations, continue up that support. How's everyone else doing? How's John and Treevis doing how's the team at map are we're pedaling as best as you can growing >>Really quickly. No, we're just shifting gears. Would it be on pedaling >>Engine? >>Yeah. Give us an update on the company in terms of how the growth and kind of where you guys are moving that. >>Yeah. We're, we're expanding worldwide, you know, just this, you know, last few months we've opened up offices and in London and Munich and Paris, we're expanding in Asia, Japan and Korea. So w our, our sales and services and engineering, and basically across the whole company continues to expand rapidly. Some really great, interesting partnerships and, and a lot of growth Natalie's we add customers, but it's, it's nice to see customers that continue to really grow their use of map are within their organization, both in terms of amount of data that they're analyzing and the number of applications that they're bringing to bear on the platform. >>Well, that a little bit, because I think, you know, one of the, one of the trends we do see is when a company brings in big data, big data platform, and they might start experiment experimenting with it, build an application. And then maybe in the, maybe in the marketing department, then the sales guys see it and they say, well, maybe we can do something with that. How is that typically the kind of the experience you're seeing and how do you support companies that want to start expanding beyond those initial use cases to support other departments, potentially even other physical locations around the world? How do you, how do you kind of, >>That's been the beauty of that is if you have a platform that can support those new applications. So if you know, mission critical workloads are not an issue, if you support volumes so that you can logically separate makes it much easier, which we have. So one of our customers Zions bank, they brought in Matt BARR to do fraud detection. And pretty soon the fact that they were able to collect all of that data, they had other departments coming to them and saying, Hey, we'd like to use that to do analysis on because we're not getting that data from our existing system. >>Yeah. They come in and you're sitting on a goldmine, there are use cases. And you also mentioned kind of, as you're expanding internationally, what's your take on the international market for big data to do specifically is, is the U S kind of a leaps and bounds ahead of the rest of the world in terms of adoption of the technology. What are you seeing out there in terms of where, where the rest of the, >>I wouldn't say leaps and bounds, and I think internationally, they're able to maybe skip some of the experimental steps. So we're seeing, we're seeing deployment of class financial services and telecom, and it's, it's fairly broad recruit technologies there. The largest provider of recruiting services, indeed.com is one of their subsidiaries they're doing a lot with, with Hadoop and map are specifically, so it's, it's, it's been, it's been expanding rapidly. Fantastic. >>I also, you know, when you think about Europe, what's going on with Google and some of the, the privacy concerns even here, or I should say, is there, are there different regulatory environments you've got to navigate when you're talking about data and how you use data when you're starting to expand to other, other locales? >>Yeah. There's typically by vertical, there's different, different requirements, HIPAA and healthcare, and basal to, and financial services. And so all of those, and it, it, it basically, it's the same theme of when you're bringing Hadoop into an organization and into a data center, the same sorts of concerns and requirements and privacy that you're applying in other areas will be applied on Hindu. >>I'm now kind of turning back to the technology. You mentioned Apache drill. I'd love to get an update on kind of where, where that stands. You know, it's put, then put that into context for people. We hear a lot about the SQL and Hadoop question here, where does drill fit into that, into that equation? >>Well, the, the, you know, there's a lot of different approaches to provide SQL access. A lot of that is driven by how do you, how do you leverage some of the talent and organization that, you know, speak SQL? So there's developments with respect to hive, you know, there's other projects out there. Apache drill is an open source project, getting a lot of community involvement. And the design center there is pretty interesting. It started from the beginning as an open source project. And two main differences. One was in looking at supporting SQL it's, let's do full ANSI SQL. So it's full 2003 ANSI, sequel, not a SQL like, and that'll support the greatest number of applications and, you know, avoid a lot of support and, and issues. And the second design center is let's support a broad set of data sources. So nested sources like Jason scheme on discovery, and basically fitting it into an enterprise environment, which sometimes is kinda messy and can get messy as acquisitions happen, et cetera. So it's complimentary, it's about, you know, enabling interactive, low latency queries. >>Jack, I want to give you the final word. We are out of time. Thanks for coming on the cube. Really preached. Great to see you again, keep alumni, but final word. And we'll end the segment here on the cube is your quick thoughts on what's happening here at Hadoop world. What is this show about? Share with the audience? What's the vibe, the summary quick soundbite on Hadoop. >>I think I'll go back to how we started. It's not, if you used to do putz, how you use to do and, you know, look at not only the first application, but what it's going to look like in multiple applications and pay attention to what enterprise grade means. >>Okay. They were secure. We got a more coverage coming, Jack Norris with map R I'll say one of the big three original, big three, still on the, on the list in our mind, and the market's mind with a unique approach to Hadoop and the mid-June great. This is the cube I'm Jennifer with Jeff Kelly. We'll be right back after this short break, >>Let's settle the PR program out there and fighting gap tech news right there. Plenty of the attack was that providing a new gadget. Let's talk about the latest game name, but just the.
SUMMARY :
IO is that, you know, we're 25 times faster on read intensive HBase applications. All the top execs were there and all the developers, you know, So, you know, the ability to meet full S SLA is full ha It's really saying, you know, just available when So, you know, there's a lot of details compared to what you guys offer? You know, one of the tenants is you have a point of Hadoop and some of the purists, as you might call them are saying, it's gotta be open a hundred percent that Peloton is, is, you know, pedaling faster and, and we'll, we'll catch up. So I know you guys are working with companies like ancestry.com, live nation and Quicken that they make accessible and, and, you know, ability for, So talk about when you went to, to work with them, what were some of their key requirements? It's kind of, you know, it's, it's hard with a specific company or even, I got to ask you about the marketing question cause I see, you know, map our, you guys have done a good job of marketing. And how do you reduce that inventory level and how do you speed the, you know, what we're delivering is a mission grade, you know, enterprise grade mission, How's the feedback on with Ted Dunning? so, you know, sharing insights there, he was speaking at the storm meetup How's John and Treevis doing how's the team at map are we're pedaling as best as you can No, we're just shifting gears. and basically across the whole company continues to expand rapidly. Well, that a little bit, because I think, you know, one of the, one of the trends we do see is when a company brings in big data, That's been the beauty of that is if you have a platform that can support those And you also mentioned kind of, they're able to maybe skip some of the experimental steps. and it, it, it basically, it's the same theme of when you're bringing Hadoop into We hear a lot about the SQL and Hadoop question support the greatest number of applications and, you know, avoid a lot of support and, Great to see you again, you know, look at not only the first application, but what it's going to look like in multiple This is the cube I'm Jennifer with Jeff Kelly. Plenty of the attack was that providing a new gadget.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Ted | PERSON | 0.99+ |
London | LOCATION | 0.99+ |
Claudia | PERSON | 0.99+ |
Jeff Kelly | PERSON | 0.99+ |
Asia | LOCATION | 0.99+ |
Ted Dunning | PERSON | 0.99+ |
Jack Norris | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Jack | PERSON | 0.99+ |
10 | QUANTITY | 0.99+ |
Paris | LOCATION | 0.99+ |
Korea | LOCATION | 0.99+ |
Matt BARR | PERSON | 0.99+ |
Munich | LOCATION | 0.99+ |
New York | LOCATION | 0.99+ |
99.9% | QUANTITY | 0.99+ |
Jennifer | PERSON | 0.99+ |
Treevis | PERSON | 0.99+ |
25 times | QUANTITY | 0.99+ |
Japan | LOCATION | 0.99+ |
ORGANIZATION | 0.99+ | |
both | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
Jeff | PERSON | 0.99+ |
eight companies | QUANTITY | 0.99+ |
first time | QUANTITY | 0.99+ |
mid-June | DATE | 0.99+ |
Charles | PERSON | 0.98+ |
Europe | LOCATION | 0.98+ |
30 minutes | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
first application | QUANTITY | 0.98+ |
Ash | PERSON | 0.98+ |
two nights ago | DATE | 0.98+ |
Hortonworks | ORGANIZATION | 0.98+ |
each stage | QUANTITY | 0.97+ |
SQL | TITLE | 0.97+ |
SiliconANGLE | ORGANIZATION | 0.97+ |
Natalie | PERSON | 0.97+ |
ancestry.com | ORGANIZATION | 0.96+ |
Hadoop | TITLE | 0.96+ |
Patrick | PERSON | 0.96+ |
last night | DATE | 0.95+ |
Jason | PERSON | 0.95+ |
2003 | DATE | 0.95+ |
Hadoop | EVENT | 0.94+ |
Apache | ORGANIZATION | 0.94+ |
Hadoop | PERSON | 0.93+ |
indeed.com | ORGANIZATION | 0.93+ |
hundred percent | QUANTITY | 0.92+ |
HBase | TITLE | 0.92+ |
Hadoop Summit 2013 | EVENT | 0.92+ |
Quicken loans | ORGANIZATION | 0.92+ |
two main differences | QUANTITY | 0.89+ |
HIPAA | TITLE | 0.89+ |
#HadoopSummit | EVENT | 0.89+ |
S SLA | TITLE | 0.89+ |
Hadoop | ORGANIZATION | 0.88+ |
Cloudera | ORGANIZATION | 0.85+ |
map R | TITLE | 0.85+ |
a year | QUANTITY | 0.83+ |
Zions bank | ORGANIZATION | 0.83+ |
Peloton | LOCATION | 0.78+ |
NFS | TITLE | 0.78+ |
MapReduce | TITLE | 0.77+ |
Cloudera map R | ORGANIZATION | 0.75+ |
live | ORGANIZATION | 0.74+ |
second design center | QUANTITY | 0.73+ |
Hindu | ORGANIZATION | 0.7+ |
theCUBE | ORGANIZATION | 0.7+ |
three main areas | QUANTITY | 0.68+ |
one enterprise grade | QUANTITY | 0.65+ |