theCUBE Previews Supercomputing 22

(inspirational music) >> The history of high performance computing is unique and storied. You know, it's generally accepted that the first true supercomputer was shipped in the mid 1960s by Controlled Data Corporations, CDC, designed by an engineering team led by Seymour Cray, the father of Supercomputing. He left CDC in the 70's to start his own company, of course, carrying his own name. Now that company Cray, became the market leader in the 70's and the 80's, and then the decade of the 80's saw attempts to bring new designs, such as massively parallel systems, to reach new heights of performance and efficiency. Supercomputing design was one of the most challenging fields, and a number of really brilliant engineers became kind of quasi-famous in their little industry. In addition to Cray himself, Steve Chen, who worked for Cray, then went out to start his own companies. Danny Hillis, of Thinking Machines. Steve Frank of Kendall Square Research. Steve Wallach tried to build a mini supercomputer at Convex. These new entrants, they all failed, for the most part because the market at the time just wasn't really large enough and the economics of these systems really weren't that attractive. Now, the late 80's and the 90's saw big Japanese companies like NEC and Fujitsu entering the fray and governments around the world began to invest heavily in these systems to solve societal problems and make their nations more competitive. And as we entered the 21st century, we saw the coming of petascale computing, with China actually cracking the top 100 list of high performance computing. And today, we're now entering the exascale era, with systems that can complete a billion, billion calculations per second, or 10 to the 18th power. Astounding. And today, the high performance computing market generates north of $30 billion annually and is growing in the high single digits. Supercomputers solve the world's hardest problems in things like simulation, life sciences, weather, energy exploration, aerospace, astronomy, automotive industries, and many other high value examples. And supercomputers are expensive. You know, the highest performing supercomputers used to cost tens of millions of dollars, maybe $30 million. And we've seen that steadily rise to over $200 million. And today we're even seeing systems that cost more than half a billion dollars, even into the low billions when you include all the surrounding data center infrastructure and cooling required. The US, China, Japan, and EU countries, as well as the UK, are all investing heavily to keep their countries competitive, and no price seems to be too high. Now, there are five mega trends going on in HPC today, in addition to this massive rising cost that we just talked about. One, systems are becoming more distributed and less monolithic. The second is the power of these systems is increasing dramatically, both in terms of processor performance and energy consumption. The x86 today dominates processor shipments, it's going to probably continue to do so. Power has some presence, but ARM is growing very rapidly. Nvidia with GPUs is becoming a major player with AI coming in, we'll talk about that in a minute. And both the EU and China are developing their own processors. We're seeing massive densities with hundreds of thousands of cores that are being liquid-cooled with novel phase change technology. The third big trend is AI, which of course is still in the early stages, but it's being combined with ever larger and massive, massive data sets to attack new problems and accelerate research in dozens of industries. Now, the fourth big trend, HPC in the cloud reached critical mass at the end of the last decade. And all of the major hyperscalers are providing HPE, HPC as a service capability. Now finally, quantum computing is often talked about and predicted to become more stable by the end of the decade and crack new dimensions in computing. The EU has even announced a hybrid QC, with the goal of having a stable system in the second half of this decade, most likely around 2027, 2028. Welcome to theCUBE's preview of SC22, the big supercomputing show which takes place the week of November 13th in Dallas. theCUBE is going to be there. Dave Nicholson will be one of the co-hosts and joins me now to talk about trends in HPC and what to look for at the show. Dave, welcome, good to see you. >> Hey, good to see you too, Dave. >> Oh, you heard my narrative up front Dave. You got a technical background, CTO chops, what did I miss? What are the major trends that you're seeing? >> I don't think you really- You didn't miss anything, I think it's just a question of double-clicking on some of the things that you brought up. You know, if you look back historically, supercomputing was sort of relegated to things like weather prediction and nuclear weapons modeling. And these systems would live in places like Lawrence Livermore Labs or Los Alamos. Today, that requirement for cutting edge, leading edge, highest performing supercompute technology is bleeding into the enterprise, driven by AI and ML, artificial intelligence and machine learning. So when we think about the conversations we're going to have and the coverage we're going to do of the SC22 event, a lot of it is going to be looking under the covers and seeing what kind of architectural things contribute to these capabilities moving forward, and asking a whole bunch of questions. >> Yeah, so there's this sort of theory that the world is moving toward this connectivity beyond compute-centricity to connectivity-centric. We've talked about that, you and I, in the past. Is that a factor in the HPC world? How is it impacting, you know, supercomputing design? >> Well, so if you're designing an island that is, you know, tip of this spear, doesn't have to offer any level of interoperability or compatibility with anything else in the compute world, then connectivity is important simply from a speeds and feeds perspective. You know, lowest latency connectivity between nodes and things like that. But as we sort of democratize supercomputing, to a degree, as it moves from solely the purview of academia into truly ubiquitous architecture leverage by enterprises, you start asking the question, "Hey, wouldn't it be kind of cool if we could have this hooked up into our ethernet networks?" And so, that's a whole interesting subject to explore because with things like RDMA over converged ethernet, you now have the ability to have these supercomputing capabilities directly accessible by enterprise computing. So that level of detail, opening up the box of looking at the Nix, or the storage cards that are in the box, is actually critically important. And as an old-school hardware knuckle-dragger myself, I am super excited to see what the cutting edge holds right now. >> Yeah, when you look at the SC22 website, I mean, they're covering all kinds of different areas. They got, you know, parallel clustered systems, AI, storage, you know, servers, system software, application software, security. I mean, wireless HPC is no longer this niche. It really touches virtually every industry, and most industries anyway, and is really driving new advancements in society and research, solving some of the world's hardest problems. So what are some of the topics that you want to cover at SC22? >> Well, I kind of, I touched on some of them. I really want to ask people questions about this idea of HPC moving from just academia into the enterprise. And the question of, does that mean that there are architectural concerns that people have that might not be the same as the concerns that someone in academia or in a lab environment would have? And by the way, just like, little historical context, I can't help it. I just went through the upgrade from iPhone 12 to iPhone 14. This has got one terabyte of storage in it. One terabyte of storage. In 1997, I helped build a one terabyte NAS system that a government defense contractor purchased for almost $2 million. $2 million! This was, I don't even know, it was $9.99 a month extra on my cell phone bill. We had a team of seven people who were going to manage that one terabyte of storage. So, similarly, when we talk about just where are we from a supercompute resource perspective, if you consider it historically, it's absolutely insane. I'm going to be asking people about, of course, what's going on today, but also the near future. You know, what can we expect? What is the sort of singularity that needs to occur where natural language processing across all of the world's languages exists in a perfect way? You know, do we have the compute power now? What's the interface between software and hardware? But really, this is going to be an opportunity that is a little bit unique in terms of the things that we typically cover, because this is a lot about cracking open the box, the server box, and looking at what's inside and carefully considering all of the components. >> You know, Dave, I'm looking at the exhibitor floor. It's like, everybody is here. NASA, Microsoft, IBM, Dell, Intel, HPE, AWS, all the hyperscale guys, Weka IO, Pure Storage, companies I've never heard of. It's just, hundreds and hundreds of exhibitors, Nvidia, Oracle, Penguin Solutions, I mean, just on and on and on. Google, of course, has a presence there, theCUBE has a major presence. We got a 20 x 20 booth. So, it's really, as I say, to your point, HPC is going mainstream. You know, I think a lot of times, we think of HPC supercomputing as this just sort of, off in the eclectic, far off corner, but it really, when you think about big data, when you think about AI, a lot of the advancements that occur in HPC will trickle through and go mainstream in commercial environments. And I suspect that's why there are so many companies here that are really relevant to the commercial market as well. >> Yeah, this is like the Formula 1 of computing. So if you're a Motorsports nerd, you know that F1 is the pinnacle of the sport. SC22, this is where everybody wants to be. Another little historical reference that comes to mind, there was a time in, I think, the early 2000's when Unisys partnered with Intel and Microsoft to come up with, I think it was the ES7000, which was supposed to be the mainframe, the sort of Intel mainframe. It was an early attempt to use... And I don't say this in a derogatory way, commodity resources to create something really, really powerful. Here we are 20 years later, and we are absolutely smack in the middle of that. You mentioned the focus on x86 architecture, but all of the other components that the silicon manufacturers bring to bear, companies like Broadcom, Nvidia, et al, they're all contributing components to this mix in addition to, of course, the microprocessor folks like AMD and Intel and others. So yeah, this is big-time nerd fest. Lots of academics will still be there. The supercomputing.org, this loose affiliation that's been running these SC events for years. They have a major focus, major hooks into academia. They're bringing in legit computer scientists to this event. This is all cutting edge stuff. >> Yeah. So like you said, it's going to be kind of, a lot of techies there, very technical computing, of course, audience. At the same time, we expect that there's going to be a fair amount, as they say, of crossover. And so, I'm excited to see what the coverage looks like. Yourself, John Furrier, Savannah, I think even Paul Gillin is going to attend the show, because I believe we're going to be there three days. So, you know, we're doing a lot of editorial. Dell is an anchor sponsor, so we really appreciate them providing funding so we can have this community event and bring people on. So, if you are interested- >> Dave, Dave, I just have- Just something on that point. I think that's indicative of where this world is moving when you have Dell so directly involved in something like this, it's an indication that this is moving out of just the realm of academia and moving in the direction of enterprise. Because as we know, they tend to ruthlessly drive down the cost of things. And so I think that's an interesting indication right there. >> Yeah, as do the cloud guys. So again, this is mainstream. So if you're interested, if you got something interesting to talk about, if you have market research, you're an analyst, you're an influencer in this community, you've got technical chops, maybe you've got an interesting startup, you can contact David, david.nicholson@siliconangle.com. John Furrier is john@siliconangle.com. david.vellante@siliconangle.com. I'd be happy to listen to your pitch and see if we can fit you onto the program. So, really excited. It's the week of November 13th. I think November 13th is a Sunday, so I believe David will be broadcasting Tuesday, Wednesday, Thursday. Really excited. Give you the last word here, Dave. >> No, I just, I'm not embarrassed to admit that I'm really, really excited about this. It's cutting edge stuff and I'm really going to be exploring this question of where does it fit in the world of AI and ML? I think that's really going to be the center of what I'm really seeking to understand when I'm there. >> All right, Dave Nicholson. Thanks for your time. theCUBE at SC22. Don't miss it. Go to thecube.net, go to siliconangle.com for all the news. This is Dave Vellante for theCUBE and for Dave Nicholson. Thanks for watching. And we'll see you in Dallas. (inquisitive music)

Published Date : Oct 25 2022

SUMMARY :

And all of the major What are the major trends on some of the things that you brought up. that the world is moving or the storage cards that are in the box, solving some of the across all of the world's languages a lot of the advancements but all of the other components At the same time, we expect and moving in the direction of enterprise. Yeah, as do the cloud guys. and I'm really going to be go to siliconangle.com for all the news.

ENTITIES

Entity	Category	Confidence
Danny Hillis	PERSON	0.99+
Steve Chen	PERSON	0.99+
NEC	ORGANIZATION	0.99+
Fujitsu	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Steve Wallach	PERSON	0.99+
David	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Dave Nicholson	PERSON	0.99+
NASA	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Steve Frank	PERSON	0.99+
Nvidia	ORGANIZATION	0.99+
Dave	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Seymour Cray	PERSON	0.99+
John Furrier	PERSON	0.99+
Paul Gillin	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Unisys	ORGANIZATION	0.99+
1997	DATE	0.99+
Savannah	PERSON	0.99+
Dallas	LOCATION	0.99+
EU	ORGANIZATION	0.99+
Controlled Data Corporations	ORGANIZATION	0.99+
Intel	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
Penguin Solutions	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Tuesday	DATE	0.99+
siliconangle.com	OTHER	0.99+
AMD	ORGANIZATION	0.99+
21st century	DATE	0.99+
iPhone 12	COMMERCIAL_ITEM	0.99+
10	QUANTITY	0.99+
Cray	PERSON	0.99+
one terabyte	QUANTITY	0.99+
CDC	ORGANIZATION	0.99+
thecube.net	OTHER	0.99+
Lawrence Livermore Labs	ORGANIZATION	0.99+
Broadcom	ORGANIZATION	0.99+
Kendall Square Research	ORGANIZATION	0.99+
iPhone 14	COMMERCIAL_ITEM	0.99+
john@siliconangle.com	OTHER	0.99+
$2 million	QUANTITY	0.99+
November 13th	DATE	0.99+
first	QUANTITY	0.99+
over $200 million	QUANTITY	0.99+
Today	DATE	0.99+
more than half a billion dollars	QUANTITY	0.99+
20	QUANTITY	0.99+
seven people	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
mid 1960s	DATE	0.99+
three days	QUANTITY	0.99+
Convex	ORGANIZATION	0.99+
70's	DATE	0.99+
SC22	EVENT	0.99+
david.vellante@siliconangle.com	OTHER	0.99+
late 80's	DATE	0.98+
80's	DATE	0.98+
ES7000	COMMERCIAL_ITEM	0.98+
today	DATE	0.98+
almost $2 million	QUANTITY	0.98+
second	QUANTITY	0.98+
both	QUANTITY	0.98+
20 years later	DATE	0.98+
tens of millions of dollars	QUANTITY	0.98+
Sunday	DATE	0.98+
Japanese	OTHER	0.98+
90's	DATE	0.97+

Shimon Ben David | KubeCon + CloudNativeCon NA 2021

welcome back to los angeles lisa martin here with dave nicholson day three of the cube's coverage of kubecon cloud native con north america 2020 we've been having some great comp live conversations in the last three days with actual guests on set we're very pleased to welcome to for the first time to our program shimon ben david the cto of weka welcome hey nice to be here nice to be here great to be at an in-person event isn't it no it's awesome they've done a great job i think you're green you're green like we're green fully green which is fantastic actually purple and hearts wake up yeah good to know green means you're shaking hands and maybe the occasional hug so talk to us about weka what's going on we'll kind of dig into what you guys are doing with kubernetes but give us that overview of what's going on at weka io okay so weka has been around for several years already uh we actually jade our product of 2016 so it's been out there uh actually eight of the fortune 50 are using weka um for those of you that don't know weka by the way we're a fully software defined parallel file system cloud native i know it's a mouthful and it's buzzword compliant but we actually baked all of that into the product from day one because we did other storage companies in the past and we actually wanted to take the best of all worlds and put that into one storage that is is not another me too it's not another compromise so we built the the environment we built weka to actually accommodate for upcoming technologies so we identified also that cloud technology is upcoming network actually exploded in a good way one gig 10 gig 100 gig 200 gig came out so we knew that that's going to be a trend and also cloud we saw cloud being utilized more and more and we kind of like bet that being able to be a parallel file system for the cloud would be amazing and it does how are you not on me too tell me tell us that when you're talking with customers what are the like the top three things that really differentiate weka speed scale and simplicity speed skills i like how fast you said that like quicker so speed sorry you see a lot of file system a lot of storage environments that are very um throughput oriented so speed how many gigabytes can you do to be honest a lot of storage environments are saying we can do that in that many gigabytes when we designed weka actually we wanted to provide an environment that would actually be faster than your local nvme on your local server because that's what we see are actually customers using for performance they're copying the data for their local to their local nvmes and process it we created an environment that is actually throughput oriented iops oriented latency sensitive and metadata performance so it's kind of like the best of all worlds and it's just not just a claim we actually showed it in many benchmarks uh top 500s supercomputing centers can talk for hours about performance but that's performance um scalability we actually are able to scale uh and we did show that we scaled to multiple petabytes we actually uh took some projects from scale-out nas appliances that actually got to their limit of their scale out and we we just continued from there double digit triple digits petabytes upcoming um and also scale is also how many clients can you service at once so it's not only how much capacity but also how many clients can you can you work with concurrently and simplicity all of that we from the initial design points were let's make something that is usable by users and not like so my mother can really use it right and so we have a very simple intuitive user interface but it's also api driven so you can automate around it so simplicity speed and scale love it so shimon it's interesting you said that your company was founded in 2016 in that in that time period because uh before jade ga ga 2016. um but but in those in in those surrounding years uh there were a lot of companies that were coming out at sort of the tail end of the legacy storage world yeah trying to just cannibalize that business you came out looking into the future where are we in that future now because you could argue that you guys maybe started a little early you could have taken a couple of years off and waited for uh for for the wave in the world of containerization as an example to come through but this is really this is like your time to shine isn't it exactly and being fully software defined we can always um adapt and we're always adapting so we bet on new technologies networking flash environments and these keep just keep on going and improving right when we went out we were like in 10 gig environments with ssds but we already knew that we're going to go to 100 and we also designed already for nvmes so kind like hardware constantly improved uh cpus for example the new intel cpus the new amd cpus we just accommodated for them because being software defined means that we actually bypass most of their inner workings and do things ourselves so that's awesome and then the cloud environment is growing massively and containers we see containers now in everyday uh use cases where initially it was maybe vms maybe bare metal but now everything is containerized and we're actually starting to see more and more kubernetes orchestrated environment uh coming out as well i still have a feeling that this is still a bit of dev property hey i'm a developer i'm a devops engineer i'm going to do it uh and it's there is i actually saw a lot of exciting things here um taking it to the next level to the it environment so um that's where we will show benefit as well so talk about how kubernetes users are are working with weka what is what superpower does that give them so um i think if you look at the current storage solutions that you have for kubernetes um they're interesting but they're more of like the let's take what we have today and plug it in right um so what kind of has a csi uh plug-in so it's easy to integrate and work with but also when you look at it um block is still being used in in kubernetes environments that i'm familiar with block was still being used for high performance so i i used uh pvs and pvcs to manage my pods uh claims and then but then i mounted them as read write once right because i couldn't share them then if a pod failed i had to reclaim the pvc and connect it to multiple environments because i wanted block storage because it's fast and then nfs environments was were used as read write many uh to be a shared environment but low performance so by being able to say hey we now have an environment that is fully covered kubernetes integrated and it provides all the performance aspects that you need you don't need to choose just run your fleet of pods your cluster of pods read write many you don't need to to manage old reclamations just to create new pods you get the best of all words ease of use and also uh the performance additionally because there's always more right we now see more and more uh cloud environments right so weka also has the ability and i didn't focus on that but it's it's really uh amazing it has the ability to move data around between different environments so imagine and we see that imagine on-prem environments that are now using weka you're in the terabytes or petabyte scale obviously you can copy and rsync and rclone right but nobody really does it because it doesn't work for these capacities so weka has the ability to say hey i can move data around between different environments so create more copies or simply burst so we see customers that are working on-prem throwing data to the cloud we see customers working on the cloud and and then we actually now see customers starting to bridge the gap because cloud bursting is again is a very nice buzzword we see some customers exploring it we don't really see customers doing it at the moment but the customers that are exploring it are exploring uh throwing the compute out to the cloud using the kubernetes cluster and throwing the data to the cloud using the weka cluster so there's and and one last thing because that's another interesting use case weka can be run converged on the same kubernetes cluster so there is no need to have even it's so in essence it's a zero footprint storage you don't need to even add more servers so i don't need to buy a box and connect my cluster to that box i just run it on the same servers and if i want more compute nodes i add more nodes and i'll add more storage by doing that so it's that simple so i was just looking at the website and see that waka was just this was just announced last week a visionary in the gartner mq for what's the mq4 distributed file systems and object storage talk to me talk talk to us about that what does that distinction mean for the company and how does the voice of the customer validate that great so actually this is interesting this is a culmination of a lot of hard work that all of the team did writing the product and all of the customers by adopting the product because it was in order to get to that i know we don't know if anybody is familiar with the criteria but you need to have a large footprint a distinguished footprint worldwide so we worked hard on getting that and we see that and we see that in multiple markets by the way financials we see a massive amounts of aiml projects containerized kubernetes orchestrated so getting to that was a huge achievement you could see other storage devices not being there because not not every storage appliance is is a parallel file system usually i think uh when you look at parallel file systems you you you attribute complexity and i need an army of people to manage it and to tweak it so that's again one of the things that we did and that's why we really think that we're a cool vendor in that magikarp magic quarter right because you it's that simple to manage uh you don't have any uh find you you cannot you don't need to find unity in like a bazillion different ways just install it we work it works you map it to your containers simple so we're here at kubecon a lot of talk about cloud native a lot of projects a lot of integration a lot of community development you've described installing weka into a kubernetes cluster where you know are there are there integrations that are being worked on what are the is there connective tissue between essentially this parallel file system that's spanning you say you have five nodes you have weka running on those five nodes you have a kubernetes cluster spanning those five nodes um what kinds of things are happening in the community maybe that you're supporting or that you're participating in to connect those together so right now you you don't uh we only have the csi plugin we didn't invest in in anything more actually one of the reasons that i'm here is to get to know the community a bit more and to get more involved and we're definitely looking into how more can we help customers utilize kubernetes and and enjoy the worker storage uh do we need to do some sort of integration i'm actually exploring that and i think you'll see some well so we got interesting so we got you at a good time now exactly yeah because you can say with with it with an api approach um you have the you have the connectivity and you're providing this storage layer that provides all the attributes that you described but you are here live living proof green wristband and all showing that the future will be even more interesting voting on the future yeah and and seeing how we can help the community and what can we do together and actually i'm really impressed by the the conference it's been amazing we've been talking about that all week being impressed with the fact that there's we've been hearing between 2 700 and 3 100 people here which is amazing in person of course there's many more that are participating virtually but they've done a great job of these green wristbands by the way we've talked about these a minute ago um this you have a red yellow or green option to to tell others are you comfortable with contact handshakes hugs etc i love that the fact that i am i'm sandwiched by two grains but they've done a great job of making this safe and i hope that this is a message this is a big community um the cncf has 138 000 contributors i hope this is a message that shows that you can do these events we can get together in person again because there's nothing like the hallway track you can't replicate that on video exactly grabbing people in the hallway in the hotel in the lobby talking about their problems seeing what they need what we do it's amazing right so so give us a little bit in our last few minutes here about the go to market what is the the gtm strategy for weka so that's an interesting question so being fully software defined when we started we we thought do we do another me too another storage appliance even though we're storage defined could we just go to market with our own boxes and we actually uh decided to go differently because our market was actually the storage vendors sorry the server vendors we actually decided to go and enable other bare metal environments manufacturers to now create storage solutions so we now have a great partnership with hpe with supermicro with hitachi uh and and more as well with aws because again being software defined we we can run on the cloud we do have massive projects on the clouds some of the we're all familiar with some but i can't mention um so and we we chose that as our go to market because we we are fully software defined we don't need any specific hardware for we just need a server with nvmes or an instance with nvmes and that's it there's no usually when i talk about what we need is as a product i also talk about the list of what we don't need is longer we don't need j bar j buffs servers ups we don't need all of that raid arrays we just need the servers so a lot of the server vendors actually identify that and then when we approach them and say hey this is what we can do on your bare metal on your environment is that valuable of course so so that's mostly our go to market another thing is that we chose to to focus on the markets that we're going after we're not another me too we're not another storage for your home directories even though obviously we are in some cases uh by customers but we're the storage where if you could shrink your wall clock time of your pipeline from two weeks to four hours and we did that's like 84 times faster if you could do that how valuable is that that's what we do that we see that more and more in modern enterprises so when we started doing that people were saying hey so your go to market is only hpc uh no all if you look at ai email life science um financials and the list goes on right modern environments are now being what hpc was a few years ago so there's massive amounts of data so our go to market is to be very targeted toward uh these markets and then we can say that they also uh push us to to other sides of the hey i have a worker so i might put my vmware on it i might put my i'll do my distributed compilation on this it's it's growing organically so that's fun to see awesome tremendous amount of growth i love that you talked about it very clearly simplicity speed and scale i think you did a great job of articulating why waka is not a me too last question are there any upcoming webinars or events or announcements that that folks can go to learn more about weka uh great question um i didn't come with my marketing hat but we we constantly have events and uh we usually what we usually do we we talk about the markets that we go after so for example a while ago we were in bioit so we published some uh life science articles um i need to see what's in the pipeline and definitely share it with you well i know you guys are going to be at re invent we do so hopefully we'll see you re-invent we're very in super computing as well if you'll be there fantastic i see that on your website there um i don't think we're there but we will see you we're a strong believer of of these conferences of these communities of being on the ground talking with people obviously if you can do it we'll do it with zoom but this is prices yeah it is there's nothing like it shimon it's been great to have you on the program thank you so much for giving us an update on weka sharing what you guys are doing how you're helping kubernetes users and what differentiates the technology we appreciate all your insights and your energy too no it's not me it's the product ah i love it for dave nicholson i'm lisa martin coming to you live from los angeles this is kubecon cloudnativecon north america 21 coverage on the cube wrapping up three days of wall-to-wall coverage we thank you for watching we hope you stay well

Published Date : Oct 15 2021

SUMMARY :

actually are able to scale uh and we did

ENTITIES

Entity	Category	Confidence
lisa martin	PERSON	0.99+
2016	DATE	0.99+
84 times	QUANTITY	0.99+
two weeks	QUANTITY	0.99+
los angeles	LOCATION	0.99+
dave nicholson	PERSON	0.99+
hpc	ORGANIZATION	0.99+
supermicro	ORGANIZATION	0.99+
10 gig	QUANTITY	0.99+
last week	DATE	0.99+
four hours	QUANTITY	0.99+
138 000 contributors	QUANTITY	0.99+
north america	LOCATION	0.98+
200 gig	QUANTITY	0.98+
3 100 people	QUANTITY	0.98+
first time	QUANTITY	0.98+
KubeCon	EVENT	0.98+
two grains	QUANTITY	0.98+
three days	QUANTITY	0.97+
one	QUANTITY	0.97+
one gig	QUANTITY	0.97+
CloudNativeCon	EVENT	0.97+
2 700	QUANTITY	0.97+
100 gig	QUANTITY	0.97+
cloudnativecon	EVENT	0.95+
100	QUANTITY	0.94+
zero footprint	QUANTITY	0.94+
today	DATE	0.94+
one last thing	QUANTITY	0.93+
nvmes	TITLE	0.93+
a couple of years	QUANTITY	0.93+
aws	ORGANIZATION	0.93+
five nodes	QUANTITY	0.92+
kubecon	ORGANIZATION	0.92+
weka	ORGANIZATION	0.92+
Shimon Ben David	PERSON	0.91+
shimon	PERSON	0.9+
a minute ago	DATE	0.89+
hpe	ORGANIZATION	0.89+
NA 2021	EVENT	0.85+
few years ago	DATE	0.84+
one storage	QUANTITY	0.83+
cncf	ORGANIZATION	0.81+
a lot of hard work	QUANTITY	0.81+
wave	EVENT	0.8+
50	TITLE	0.8+
five	QUANTITY	0.8+
day one	QUANTITY	0.79+
a lot of file system	QUANTITY	0.76+
2020	DATE	0.76+
a lot of companies	QUANTITY	0.75+
david	PERSON	0.74+
lot of storage environments	QUANTITY	0.74+
mq4	TITLE	0.72+
three things	QUANTITY	0.71+
a lot of storage	QUANTITY	0.71+
hitachi uh	ORGANIZATION	0.69+
a while ago	DATE	0.69+
gartner	TITLE	0.69+
kubecon	EVENT	0.67+
terabytes	QUANTITY	0.67+
multiple petabytes	QUANTITY	0.67+
three days	DATE	0.65+
day	QUANTITY	0.62+
hours	QUANTITY	0.61+
csi	TITLE	0.6+
petabyte	QUANTITY	0.57+
nodes	TITLE	0.55+
gigabytes	QUANTITY	0.55+
bioit	TITLE	0.54+
intel	ORGANIZATION	0.52+
re	EVENT	0.51+
jade	PERSON	0.48+
waka	ORGANIZATION	0.47+
con	LOCATION	0.43+
fortune	COMMERCIAL_ITEM	0.43+
top 500s	QUANTITY	0.4+
three	QUANTITY	0.33+
21	QUANTITY	0.3+

Liran Zvibel & Andy Watson, WekaIO | CUBE Conversation, December 2018

(cheery music) >> Hi I'm Peter Burris, and welcome to another CUBE Conversation from our studios in Palo Alto, California. Today we're going to be talking about some new advances in how data gets processed. Now it may not sound exciting, but when you hear about some of the performance capabilities, and how it liberates new classes of applications, this is important stuff, now to have that conversation we've got Weka.IO here with us, specifically Liran Zvibel is the CEO of Weka.IO, and joined by Andy Watson, who's the CTO of Weka.IO. Liran, Andy, welcome to the cube. >> Thanks. >> Thank you very much for having us. >> So Liran, you've been here before, Andy, you're a newbie, so Liran, let's start with you. Give us the Weka.IO update, what's going on with the company? >> So 18 has been a grand year for us, we've had great market adoption, so we've spent last year proving our technology, and this year we have accelerated our commercial successes, we've expanded to Europe, we've hired quite a lot of sales in the US, and we're seeing a lot of successes around machine learning, deep learning, and life sciences data processes. >> And you've hired a CTO. >> And we've hired the CTO, Andy Watson, which I am excited about. >> So Andy, what's your pedigree, what's your background? >> Well I've been around a while, got the scars on my back to show it, mostly in storage, dating back to even off-specs before NetApp, but probably best known for the years I spent at NetApp, was there from 95 through 2007, kind of the glory years, I was the second CTO at NetApp, as a matter of fact, and that was a pretty exciting time. We changed the way the world viewed shared storage, I think it's fair to say, at NetApp, and it feels the same here at Weka.IO, and that's one of the reasons I'm so excited to have joined this company, because it's the same kind of experience of having something that is so revolutionary that quite often, whether it's a customer, or an analyst like yourself, people are a little skeptical, they find it hard to believe that we can do the things that we do, and so it's gratifying when we have the data to back it up, and it's really a lot of fun to see how customers react when they actually have it in their environment, and it changes their workflow and their life experience. >> Well I will admit, I might be undermining my credibility here, but I will admit that back in the mid 90s I was a little bit skeptical about NetApp, but I'm considerably less skeptical about Weka.IO, just based on the conversations we've had, but let's turn to that, because there are classes of applications that are highly dependent on very large, small files, being able to be moved very very rapidly, like machine learning, so you mentioned machine learning, Liran, talk a little bit about some of the market success that you're having, some of those applications' successes. >> Right so machine learning actually works extremely well for us for two reasons. For one big reasons, machine learning is being performed by GPU servers, so a server with several GPU offload engines in them, and what we see with this kind of server, a single GPU server replaces ten or tens of CPU based servers, and what we see that you actually need, the IO performance to be ten or tens times what the CPU servers has been, so we came up with a way of providing significantly higher, so two orders of magnitude higher IO to a single client on the one hand, and on the other hand, we have sold the data performance from the metadata perspective, so we can have directories with billions of files, we can have the whole file system with trillions of files, and when we look at the autonomous driving problem, for examples, if you look at the high end car makers, they have eight cameras around the cars, these cameras take small resolution, because you don't need a very high resolution to recognize the line, or a cat, or a pedestrian, but they take them at 60 frames per second, so 30 minutes, you get about the 100k files, traditional filers could put in the directory, but if you'd like to have your cars running in the Bay Area, you'd like to have all the data from the Bay Area in the single directory, then you would need the billions of file directories for us, and what we have heard from some of our customers that have had great success with our platform is that not only they get hundreds of gigabytes of small file read performance per second, they tell us that they take their standard time to add pop from about two weeks before they switched to us down to four hours. >> Now let's explore that, because one of the key reasons there is the scalability of the number of files you can handle, so in other words, instead of having to run against a limit of the number of files that they can typically run through the system, saturate these GPUs based on some other storage or file technology, they now don't have to stop and set up the job again and run it over and over, they can run the whole job against the entire expansive set of files, and that's crucial to speeding up the delivery of the outcome, right? >> Definitely, so what they, these customers used to do before us, they would do a local caching, cause NFS was not fast enough for them, so they would copy the data locally, and then they would run them over on the local file system, because that has been the pinnacle of performance of recent year. We are the only storage currently, I think we'll actually be the first wave of storage solutions where a shared platform built for NVME is actually faster than a local file system, so we'd let them go through any file, they don't have to pick initially what files goes to what server, and also we are even faster than the traditional caching solutions. >> And imagine, having to collect the data and copy it to the local server, application server, and do that again and again and again for a whole server farm, right? So it's bad enough to even do it once, to do it many times, and then to do it over and over and over and over again, it's a huge amount of work. >> And a lot of time? >> And a lot of time, and cumulatively that burden, it's going to slow you down, so that makes a big big difference and secondly, as Liran was explaining, if you put 100,000 files in a directory of other file systems, that is stressful. You want to put more than 100,000 files in a directory of other file systems, that is a tragedy, and we routinely can handle millions of files in a directory, doesn't matter to us at all because just like we distribute the data, we also distribute the metadata, and that's completely counter to the way the other file systems are designed because they were all designed in an era where their focus was on the physical geometry of hard disks, and we have been designed for flash storage. >> And the metadata associated with the distribution of that data typically was in a one file, in one place, and that was the master serialization problem when you come right down to it. So we've got a lot of ML workloads, very large number of files, definitely improved performance because of the parallelism through your file system, in the as I said, the ML world. Let's generalize this. What does this mean overall, you've kind of touched upon it, but what does it mean overall for the way that customers are going to think about storage architectures in the future as they are combining ML and related types of workloads with more traditional types of things? What's the impact of this on storage? >> So if you look at how people architect their solutions around storage recently, you have four different kind of storage systems. If you need the utmost performance, you're going to DAS, Fusion IO had a run, perfecting DAS and then the whole industry realized it. >> Direct attached storage. >> Direct attached storage, right, and then the industry realized hey it makes so much sense, they create a standard out of it, created NVME, but then you're wasting a lot of capacity, and you cannot manage it, you cannot back it up, and then if you need it as some way to manage it, you would put your data over SAN, actually our previous company was XAV storage that IBM acquired, vast majority of our use cases are actually people buying block, and then they overlay a local file system over it because it gets you so much higher performance then if you must get, but you don't get, you cannot share the data. Now, if you put it on a filer, which is Neta, or Islon, or the other solutions, you can share the data but your performance is limited, and your scalability is limited as Andy just said, and if you had to scale through the roof- >> With a shared storage approach. >> With a shared storage approach you had to go and port your application to an object storage which is an enormous feat of engineering, and tons of these projects actually failed. We actually bring the new kind of storage, which is assured storage, as scalable as an object storage, but faster than direct attach storage, so looking at the other traditional storage systems of the last 20 or 30 years, we actually have all the advantages people would come to expect from the different categories, but we don't have any of the downsides. >> Now give us some numbers, or do you have any benchmarks that you can talk about that kind of show or verify or validate this kind of vision that you've got, that Weka's delivering on? >> Definitely, but the i500? >> Sure, sure, we recently actually published our IO500 performance results at the SE1800, SE18 event in Dallas, and there are two different metrics- >> So fast you can go back in time? >> Yes, exactly, there are two different metrics, one metric is like an aggregate total amount of performance, it's a much longer list. I think the one that's more interesting is the one where it's the 10-client version, which we like to focus on because we believe that the most important area for a customer to focus on is how much IO can you deliver to an individual application server? And so this part of the benchmark is most representative of that, and on that rating, we were able to come in second well, after you filter out the irrelevant results, which, that's a separate process. >> Typical of every benchmark. >> Yes exactly, of the relevant meaningful results, we came in second behind the world's largest and most expensive supercomputer at Oak Ridge, the SUMMIT system. So they have a 40 rack system, and we have a half, or maybe a little bit more than half, one rack system of industry standard hardware running our software. So compare that, the cost of our hardware footprint and so forth is much less than a million dollars. >> And what was the differential between the two? >> Five percent. >> Five percent? So okay, sound of jaw dropping. 40 rack system at Oak Ridge? Five percent more performance than you guys running on effectively a half rack of like a supermicro or something like that? >> Oh and it was the first time we ran the benchmark, we were just learning how to run it, so those guys are all experts, they had IBM in there at their elbow helping them with all their tuning and everything, this was literally the first time our engineers ran the benchmark. >> Is a large feature of that the fact that Oak Ridge had to get all that hardware to get the physical IO necessary to run serial jobs, and you guys can just do this parallel on a relatively standard IO subset, NVME subset? >> Because beyond that, you have to learn how to use all those resources, right? All the tuning, all the expertise, one of the things people say is you need a PhD to administer one of those systems, and they're not far off, because it's true that it takes a lot of expertise. Our systems are dirt simple. >> Well you got to move the parallelism somewhere, and either you create it yourself, like you do at Oak Ridge, or you do it using your guys' stuff, through a file system. >> Exactly, and what we are showing that we have tremendously higher IO density, and we actually, what we're showing, that instead of using a local file system, that where most of them were created in the 90s, in the serial way of thinking, of optimizing over hard drives, if now you say, hey, NVME devices, SSDs are beasts at running 4k IOs, if you solve the networking problem, if the network is not the bottleneck anymore, if you just run all your IOs as much parallelized workload over 4k IOs, you actually get much higher performance than what you could get, up until we came, the pinnacle of performance, which is a local file system over a local device. >> Well so NFS has an effective throughput limitation of somewhere around a gigabyte, so if you've got a bunch of GPUs that are each wanting four, five, 10 gigabytes of data coming in, you're not saturating them out of an effective one gigabyte throughput rate, so it's almost like you've got the New York City Waterworks coming in to some of these big file systems, and you got like your little core sink that's actually spitting the data out into the GPUs, have I got that right? >> Good analogy, if you are creating a data lake, and then you're going to sip at it with some tiny little straw, it doesn't matter how much data you have, you can't really leverage the value of all that data that you've accumulated, if you're feeding it into your compute farm, GPU or not, because if you're feeding it into that farm slowly, then you'll never get to it all, right? And meanwhile more data's coming in every day, at a faster rate. It's an impossible situation, so the only solution really is to increase the rate by which you access the data, and that's what we do. >> So I could see how you're making the IO bandwidth junkies at Oak Ridge, or would make them really happy, but the other thing that at least I find interesting about Weka.IO is as you just talked about is that, that you've come up with an approach that's specifically built for SSD, you've moved the parallelism into the file system, as opposed to having it be somewhere else, which is natural, because SSD is not built to persist data, it's built to deliver data, and that suggests as you said earlier, that we're looking at a new way of thinking about storage as a consequence of technologies like Weka, technologies like NVME. Now Andy, you came from NetApp, and I remember what NetApp did to the industry, when it started talking about the advantages of sharing storage. Are we looking at something similar happening here with SSD and NVME and Weka? >> Indeed, I think that's the whole point, it's one of the reasons I'm so excited about it. It's not only because we have this technology that opens up this opportunity, this potential being realized. I think the other thing is, there's a lot of features, there's a lot of meaningful software that needs to be written around this architectural capability, and the team that I joined, their background, coming from having created XIV before, and the almost amazing way they all think together and recognize the market, and the way they interact with customers allows the organization to address realistically customer requirements, so instead of just doing things that we want to do because it seems elegant, or because the technology sparkles in some interesting way, this company, and it remains me of NetApp in the early days, and it was a driver of NetApp's big success, this company is very customer-focused, very customer driven. So when customers tell us what they're trying to do, we want to know more. Tell us in detail how you're trying to get there. What are your requirements? Because if we understand better, then we can engineer what we're doing to meet you there, because we have the fundamental building blocks. Those are mostly done, now what we're trying to do is add the pieces that allow you to implement it into your workflow, into your data center, or into your strategy for leveraging the cloud. >> So Liran, when you're here in 2019, we're having a similar conversation with this customer focus, you've got a value proposition to the IO bandwidth junkies, you can give more, but what's next in your sights? Are you going to show how this for example, you can get higher performance with less hardware? >> So we are already showing how you can get higher performance with less hardware, and I think as we go forward, we're going to have more customers embracing us for more workloads, so what we see already, they get us in for either the high end of their life sciences or their machine learning, and then people working around these people realize hey, I could get some faster speed as well, and then we start expanding within these customers and we get to see more and more workloads where people like us and we can start telling stories about them. The other thing that we have natural to us, we run natively in the cloud, and we actually let you move your workload seamlessly between your on-premises and the cloud, and we are seeing tremendous interest about moving to the cloud today, but not a lot of organizations already do it. I think 19 and forward, we are going to see more and more enterprises considering seriously moving to the cloud, cause we have almost 100% of our customers PFCing, cloudbursting, but not a lot of them using them. I think as time passes, all of them that has seen it working, when they did the initial test, will start leveraging this, and getting the elasticity out of the cloud, because this is what you should get out of the cloud, so this is one way for expansion for us. We are going to spend more resources into Europe, which we have recently started building the team, and later in that year also, JPAC. >> Gentlemen, thanks very much for coming on theCUBE and talking to us about some new advances in file systems that are leading to greater performance, less specialized hardware, and enabling new classes of applications. Liran Zvibel is the CEO of Weka.IO, Andy Watson is the CTO of Weka.IO, thanks for being on theCUBE. >> Thank you very much. >> Yeah, thanks a lot. >> And once again, I'm Peter Burris, and thanks very much for participating in this CUBE Conversation, until next time. (cheery music)

Published Date : Dec 14 2018

SUMMARY :

some of the performance So Liran, you've in the US, and we're And we've hired the CTO, Andy Watson, 2007, kind of the glory years, just based on the conversations we've had, a single client on the one the data locally, and then and then to do it over and distribute the data, we also in the future as they are So if you look at how people and then if you need it as We actually bring the new more interesting is the one Yes exactly, of the than you guys running on the benchmark. expertise, one of the things the parallelism somewhere, in the 90s, in the serial way of thinking, so the only solution the file system, as opposed to and the team that I and the cloud, and we are Liran Zvibel is the CEO and thanks very much for

ENTITIES

Entity	Category	Confidence
Andy	PERSON	0.99+
Peter Burris	PERSON	0.99+
Liran	PERSON	0.99+
30 minutes	QUANTITY	0.99+
ten	QUANTITY	0.99+
Andy Watson	PERSON	0.99+
Liran Zvibel	PERSON	0.99+
2019	DATE	0.99+
Oak Ridge	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Weka.IO	ORGANIZATION	0.99+
100,000 files	QUANTITY	0.99+
Five percent	QUANTITY	0.99+
IBM	ORGANIZATION	0.99+
40 rack	QUANTITY	0.99+
four hours	QUANTITY	0.99+
two	QUANTITY	0.99+
December 2018	DATE	0.99+
Dallas	LOCATION	0.99+
US	LOCATION	0.99+
2007	DATE	0.99+
Bay Area	LOCATION	0.99+
hundreds of gigabytes	QUANTITY	0.99+
last year	DATE	0.99+
two reasons	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
billions of file directories	QUANTITY	0.99+
NetApp	ORGANIZATION	0.99+
more than 100,000 files	QUANTITY	0.99+
one file	QUANTITY	0.99+
second	QUANTITY	0.99+
this year	DATE	0.99+
NVME	ORGANIZATION	0.99+
mid 90s	DATE	0.99+
one metric	QUANTITY	0.99+
one place	QUANTITY	0.99+
millions of files	QUANTITY	0.98+
90s	DATE	0.98+
five	QUANTITY	0.98+
Weka	ORGANIZATION	0.98+
tens	QUANTITY	0.98+
first time	QUANTITY	0.98+
eight cameras	QUANTITY	0.98+
two different metrics	QUANTITY	0.98+
single directory	QUANTITY	0.98+
trillions of files	QUANTITY	0.98+
one	QUANTITY	0.97+
SE1800	EVENT	0.97+
less than a million dollars	QUANTITY	0.97+
a half	QUANTITY	0.97+
JPAC	ORGANIZATION	0.97+
one way	QUANTITY	0.97+
CUBE Conversation	EVENT	0.96+
10-client	QUANTITY	0.96+
tens times	QUANTITY	0.96+
60 frames per second	QUANTITY	0.96+
Today	DATE	0.96+
NetApp	TITLE	0.96+
two orders	QUANTITY	0.95+
four	QUANTITY	0.95+
almost 100%	QUANTITY	0.94+

Liran Zvibel, WekaIO | CUBEConversation, April 2018

[Music] hi I'm Stu minimun and this is the cube conversation in Silicon angles Palo Alto office happy to welcome back to the program Lear on survival who is the co-founder and CEO of Weka IO thanks so much for joining me thank you for having me over alright so on our research side you know we've really been saying that data is at the center of everything it's in the cloud it's in the network and of course in the storage industry data has always been there but I think especially for customers it's been more front and center well you know why is data becoming more important it's not data growth and some of the other things that we've talked about for decades but you know how was it changing what are you hearing from customers today so I think the main difference is that organization they're starting to understand that the more data they have the better service they're going to provide to their customers and there will be an overall better company than their competitors so about 10 years ago we started hearing about big data and other ways that in a more simpler form just went over sieved through a lot of data and tried to get some sort of high-level meaning out of it last few years people are actually employing deep learning machine learning technique to their vast amounts of data and they're getting much higher level of intelligence out of their huge capacities of data and actually with deep learning the more data you have the better outputs you get before we go into kind of the m/l and the deep learning piece just did kind of a focus on data itself there's some that say you know digital transformation is it's this buzzword when I talk to users absolutely they're going through transformations you know we're saying everybody's becoming a software company but how does data specifically help them with that you know what what what is your viewpoint there and what are you hearing from your customers so if you look at it from the consumer perspective so people now keep track record of their lives at much higher resolution than the and I'm not talking about the images rigid listen I'm talking about the vast amount of data that they store so if I look at how many pictures I have of myself as a kid and how many pictures I have of my kids like you could fit all of my pictures into albums I can probably fit my my kids like a week's worth of time into albums so people keep a lot more data as consumers and then organization keep a lot more data of their customers in order to provide better service and better overall product you know the industry as an industry we saw a real mixed bag when it came to Big Data when I was saying great I have lots more volume of data that doesn't necessarily mean that I got more value out of it so what are the one of the trends that you're seeing why is you know where things like you deep learning machine learning AI you know is it going to be different or is this just kind of the next iteration of well we're trying and maybe we didn't hit as well with big data let's see if this does it does better so I think that Big Data had its glory days and now where they're coming to to the end of that crescendo because people realized that what they got was sort of aggregate of things that they couldn't make too much sense of and then people really understand that for you to make better use of your data you need to employ way similarly to how the brain works so look a lot of data and then you have to have some sense out of their data and once you've made some sense out of that data we can now get computers to go through way more data and make a similar amount of sense out of that and actually get much much better results so just instead of going finding anecdotes or this thing that you were able to do with big date you're actually now are able to generate intelligent systems you know what one of the other things we saw is it used to be okay I have this this huge back catalogue or I'm going to survey all the data I've collected today you know it's much more you know real times a word that's been thrown around for many years you know whether it do you say live data or you know if you're at sensors where I need to have something where I can you know train models react immediately that that kind of immediacy is much more important you know that's what I'm assuming that's something that you're seeing from customers to indeed so what we say is that customers end up collecting vast amounts of data and then they train their models on these kind of data and then they're pushing these intelligent models to the edges and then you're gonna have edges running inference and that could be a straight camera it could be a camera in the store or it could be your car and then usually you run these inference at the endpoints using all the things you've trained the models back then and you will still keep the data push it back and then you should you still run inference at the data center sort of doing QA and now the edges also know to mark where they couldn't make sense of what they saw so the the data center systems know what should we look at first how we make our models smarter for the next iteration because these are closed-loop systems you train them you push through the edges the edges tell you how well you think they think they understood your train again and things improve we're now at the infancy of a lot of these loops but I think the following probably two to five years will take us through a very very fascinating revolution where systems all around us will become way way more intelligent yeah and there's interesting architectural discussions going on if you talk about this edge environment if I'm an autonomous vehicle now from an airplane of course I need to react there I can't go back to the cloud but you know what what happens in the cloud versus what happens at the edge where do where does Weka fit into that that whole discussion so where we currently are running we're running at the data centers so at Weka we created the fastest file system that's perfect for AI and machine learning and training and we make sure that your GPU field servers that are very expensive never sit idle the second component of our system is tearing two very effective object storages that can run into exabytes so we have the system that makes sure you can have as many GPU servers churning all the time and getting the results getting the new models while having the ability to read any form of data that was collected in the several years really through hundreds of petabytes of data sets and now we have customers talking about exabytes of data sets representing a single application not throughout the organization just for that training application yeah so a I in ml you know Keita is that that the killer use case for your customers today so that's one killer application just because of the vast amount of data and the high-performance nature of the clients we actually show clients that runwa kayo finished training sessions ten times faster than how they would use traditional NFS based solutions but just based on the different way we handle data another very strong application for us is around Life Sciences and genomics where we show that we're the only storage that let these processes remain CPU bound so any other storage at some points becomes IO bound so you couldn't paralyzed paralyzed the processing anymore we actually doesn't matter how many servers you run as clients you double the amount of clients you either get the twice the result the same amount of time or you get the same result it's half the time and with genomics nowadays there are applications that are life-saving so hospitals run these things and they need results as fast as they can so faster storage means better healthcare yeah without getting too deep in it because you know the storage industry has lots of wonkiness and it's there's so many pieces there but you know I hear life scientists I think object storage I hear nvme I think block storage your file storage when it comes down to it you know why is that the right architecture you know for today and what advantages does that give you so we we are actually the only company that went through the hassles and the hurdles of utilizing nvme and nvme of the fabrics for a parallel file system all other solutions went the easier route and created the block and the reason we've created a file system is that this is what computers understand this is what the operating system understand when you go to university you learn computer science they teach you how to write programs they need a file system now if you want to run your program over to servers or ten servers what you need is a shirt file system up until we came gold standard was using NFS for sharing files across servers but NFS was actually created in the 80s when Ethernet run at 10 megabit so currently most of our customers run already 100 gigabytes which is four orders of magnitude faster so they're seeing that they cannot run a network protocol that was designed four orders of magnitude last speed with the current demanding workloads so this explains why we had to go and and pick a totally different way of pushing data to the to the clients with regarding to object storages object storages are great because they allow customers to aggregate hard drives into inexpensive large capacity solutions the problem with object storages is that the programming model is different than the standard file system that computers can understand in too thin two ways a when you write something you don't know when it's going to get actually stored it's called eventual consistency and it's very difficult for mortal programmers to actually write a system that is sound that is always correct when you're writing eventual consistency storage the second thing is that objects cannot change you cannot modify them you need to create them you get them or you can delete them they can have versions but this is also much different than how the average programmer is used to write its programs so we are actually tying between the highest performance and vme of the fabrics at the first year and these object storages that are extremely efficient but very difficult to work with at the back and tier two a single solution that is highest performance and best economics right there on I want to give you the last word give us a little bit of a long view you talked about where we've gone how parallel you know architecture helps now that we're at you know 100 Gig look out five years in the future what's gonna happen you know blockchain takes over the world cloud dominates everything but from an infrastructure application in you know storage world you know where does wek I think that the things look like so one one very strong trend that we are saying is around encryption so it doesn't matter what industry I think storing things in clear-text for many organizations just stops making sense and people will demand more and more of day of their data to be encrypted and tighter control around everything that's one very strong trend that we're seeing another very strong trend that we're seeing is enterprises would like to leverage the public cloud but in an efficient way so if you were to run economics moving all your application to the public cloud may end up being more expensive than running everything on Prem and I think a lot of organizations realized that the the trick is going to be each organisation will have to find a balance to what kind of services are run on Prem and these are going to be the services that are run around the clock and what services have the more of a bursty nature and then organization will learn how to leverage the public cloud for its elasticity because if you're just running on the cloud you're not leveraging the elasticity you're doing it wrong and we're actually helping a lot of our customers do it with our hybrid cloud ability to have local workloads and the cloud workloads and getting these whole workflows to actually run is a fascinating process they're on thank you so much for joining us great to hear the update not only on Weka but really where the industry is going dynamic times here in the industry data at the center of all cubes looking to cover it at all the locations including here and our lovely Palo Alto Studio I'm Stu minimun thanks so much for watching the cube thank you very much [Music] you

Published Date : Apr 6 2018

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
Liran Zvibel	PERSON	0.99+
100 gigabytes	QUANTITY	0.99+
April 2018	DATE	0.99+
10 megabit	QUANTITY	0.99+
two	QUANTITY	0.99+
Weka IO	ORGANIZATION	0.99+
Weka	ORGANIZATION	0.99+
twice	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
second thing	QUANTITY	0.99+
five years	QUANTITY	0.98+
second component	QUANTITY	0.98+
each organisation	QUANTITY	0.98+
first year	QUANTITY	0.98+
today	DATE	0.97+
Stu minimun	PERSON	0.97+
two ways	QUANTITY	0.97+
Prem	ORGANIZATION	0.96+
ten times	QUANTITY	0.95+
about 10 years ago	DATE	0.94+
one	QUANTITY	0.94+
Stu minimun	PERSON	0.94+
last few years	DATE	0.93+
hundreds of petabytes of data sets	QUANTITY	0.93+
first	QUANTITY	0.92+
several years	QUANTITY	0.92+
80s	DATE	0.91+
single application	QUANTITY	0.9+
decades	QUANTITY	0.9+
a lot of data	QUANTITY	0.89+
Silicon angles	LOCATION	0.89+
half the time	QUANTITY	0.87+
ten servers	QUANTITY	0.87+
two very effective object	QUANTITY	0.87+
single solution	QUANTITY	0.86+
four orders	QUANTITY	0.85+
four orders	QUANTITY	0.85+
a week	QUANTITY	0.84+
Palo Alto Studio	ORGANIZATION	0.8+
lot more data	QUANTITY	0.78+
WekaIO	ORGANIZATION	0.78+
100 Gig	QUANTITY	0.74+
Lear on	TITLE	0.72+
double	QUANTITY	0.72+
many pieces	QUANTITY	0.65+
Keita	ORGANIZATION	0.63+
lot of data	QUANTITY	0.6+
lot	QUANTITY	0.58+
lots	QUANTITY	0.58+
application	QUANTITY	0.56+
vast amounts of data	QUANTITY	0.54+
exabytes	QUANTITY	0.53+
trend	QUANTITY	0.52+
CEO	PERSON	0.5+
Big Data	ORGANIZATION	0.45+

Liran Zvibel, WekalO & Maor Ben Dayan, WekalO | AWS re:Invent

>> Announcer: Live from Las Vegas, it's The Cube, covering AWS re:Invent 2017, presented by AWS, Intel, and our ecosystem of partners. >> And we're back, here on the show floor in the exhibit hall at Sands Expo, live at re:Invent for AWS along with Justin Warren. I'm John Walls. We're joined by a couple of executives now from Weka IO, to my immediate right is Liran Zvibel, who is the co-founder and CEO and then Maor Ben Dayan who's the chief architect at IO. Gentleman thanks for being with us. >> Thanks for having us. >> Appreciate you being here on theCube. First off tell the viewers a little bit about your company and I think a little about the unusual origination of the name. You were sharing that with me as well. So let's start with that, and then tell us a little bit more about what you do. >> Alright, so the name is Weka IO. Weka is actually a greek unit, like mega and terra and peta so it's actually a trillion exobytes, ten to the power of thirty, it's a huge capacity, so it works well for a storage company. Hopefully we will end up storing wekabytes. It will take some time. >> I think a little bit of time to get there. >> A little bit. >> We're working on it. >> One customer at a time. >> Give a little more about what you do, in terms of your relationship with AWS. >> Okay, so at Weka IO we create the highest performance file system, either on prem or in the cloud. So we have a parallel file system over NVME. Like no previous generation file system did parallel work over hard drives. But these are 20 years old technology. We're the first file system to bring new paralleled rhythms to NVME so we get you lowest latency, highest throughput either on prem or in the cloud. We are perfect for machine learning and life sciences applications. Also you've mentioned media and entertainment earlier. We can run on your hardware on prem, we can run on our instances, I3 instances, in AWS and we can also take snapshots that are native performance so they don't take away performance and we also have the ability to take these snapshots and push them to S3 based object storage. This allows you to have DR or backup functionality if you look on prem but if your object storage is actually AWSS3, it also lets you do cloud bursting, so it can take your on prem cluster, connect it to AWSS3, take a snapshot, push it to AS3 and now if you have a huge amount of computation that you need to do, your local GPU servers don't have enough capacity or you just want to get the results faster, you would build a big enough cluster on AWS, get the results and bring them back. >> You were explaining before that it's a big challenge to be able to do something that can do both low latency with millions and millions of small files but also be able to do high throughput for some large files, like media and entertainment tends to be very few but very, very large files with something like genomics research, you'll have millions and millions of files but they're all quite tiny. That's quite hard, but you were saying it's actually easier to do the high throughput than it is for low latency, maybe explain some of that. >> You want to take it? >> Sure, on the one hand, streaming lots of data is easy when you distribute the data over many servers or instances in the AWS like luster dust or other solutions, but then doing small files becomes really hard. Now this is where Weka innovated and really solved this bottleneck so it really frees you to do whatever you want with the storage system without hitting any bottlenecks. This is the secret sauce of Weka. >> Right and you were mentioning before, it's a file system so it's an NFS and SMB access to this data but you're also saying that you can export to S3. >> Actually we have NFS, we have SMB, but we also have native posits so any application that you could up until now only run on the local file system such as EXT4 or ZFS, you can actually run in assured manner. Anything that's written on the many pages we do, so adjust works, locking, everything. That's one thing we're showing for life sciences, genomic workflows that we can scale their workflows without losing any performance, so if one server doing one kind of transformation takes time x, if you use 10 servers, it will take 10x the time to get 10x the results. If you have 100 servers, it's gonna take 100x servers to get 100x the results, what customers see with other storage solutions, either on prem or in the cloud, that they're adding servers but they're getting way less results. We're giving the customers five to 20 times more results than what they did on what they thought were high performance file systems prior to the Weka IO solution. >> Can you give me a real life example of this, when you talk about life sciences, you talk about genomic research and we talk about the itty bitty files and millions of samples and whatever, but exactly whatever, translate it for me, when it comes down to a real job task, a real chore, what exactly are you bringing to the table that will enable whatever research is being done or whatever examination's being done. >> I'll give you a general example, not out of specifically of life sciences, we were doing a POC at a very large customer last week and we were compared head to head with best of breed, all flash file system, they did a simple test. They created a large file system on both storage solutions filled with many many millions of small files, maybe even billions of small files and they wanted to go through all the files, they just ran the find command, so the leading competitor finished the work in six and a half hours. We finished the same work in just under two hours. More than 3x time difference compared to a solution that is currently considered probably the fastest. >> Gold standard allegedly, right? Allegedly. >> It's a big difference. During the same comparison, that customer just did an ALS of a directory with a million files that other leading solution took 55 seconds and it took just under 10 seconds for us. >> We just get you the results faster, meaning your compute remains occupied and working. If you're working with let's say GPU servers that are costly, but usually they are just idling around, waiting for the data to come to them. We just unstarve these GPU servers and let's you get what you paid for. >> And particularly with something like the elasticity of AWS, if it takes me only two hours instead of six, that's gonna save me a lot of money because I don't have to pay for that extra six hours. >> It does and if you look at the price of the P3 instances, for reason those voltage GPUs aren't inexpensive, any second they're not idling around is a second you saved and you're actually saving a lot of money, so we're showing customers that by deploying Weka IO on AWS and on premises, they're actually saving a lot of money. >> Explain some more about how you're able to bridge between both on premises and the cloud workloads, because I think you mentioned before that you would actually snapshot and then you could send the data as a cloud bursting capability. Is that the primary use case you see customers using or is it another way of getting your data from your side into the cloud? >> Actually we have a slightly more complex feature, it's called tiering through the object storage. Now customers have humongous name spaces, hundreds of petabytes some of them and it doesn't make sense to keep them all on NVME flash, it's too expensive so a big feature that we have is that we let you tier between your flash and object storage and let's you manage economics and actually we're chopping down large files and doing it to many objects, similarly to how a traditional file system treat hard drives so we treat NVMEs in a parallel fashion, that's world first but we also do all the tricks that a traditional parallel file system do to get good performance out of hard drives to the object storage. Now we take that tiering functionality and we couple it with our highest performance snapshotting abilities so you can take the snapshot and just push it completely into the object storage in a way that you don't require the original cluster anymore >> So you've mentioned a few of the areas that you're expertise now and certainly where you're working, what are some other verticals that you're looking at? What are some other areas where you think that you can bring what you're doing for maybe in the life science space and provide equal if not superior value? >> Currently. >> Like where are you going? >> Currently we focus on GPU based execution because that's where we save the most money to the customers, we give the biggest bang for the buck. Also genomics because they have severe performance problems around building, we've shown a huge semiconductor company that was trying to build and read, they were forced to building on local file system, it took them 35 minutes, they tried their fastest was actually on RAM battery backed RAM based shared file system using NFS V4, it took them four hours. It was too long, you only got to compile the day. It doesn't make sense. We showed them that they can actually compile in 38 minutes, show assured file system that is fully coherent, consistent and protected only took 10% more time, but it didn't take 10% more time because what we enabled them to do is now share the build cache, so the next build coming in only took 10 minutes. A full build took slightly longer, but if you take the average now their build was 13 or 14 minutes, so we've actually showed that assured file system can save time. Other use cases are media and entertainment, for rendering use cases, you have these use cases, they parallelize amazingly well. You can have tons of render nodes rendering your scenes and the more rendering nodes you have, the quicker you can come up with your videos, with your movies or they look nicer. We enable our customers to scale their clusters to sizes they couldn't even imagine prior to us. >> It's impressive, really impressive, great work and thanks for sharing it with us here on theCube, first time for each right? You're now Cube alumni, congratulations. >> Okay, thanks for having us. >> Thank you for being with us here. Again, we're live here at re:Invent and back with more live coverage here on theCube right after this time out.

Published Date : Dec 1 2017

SUMMARY :

Intel, and our ecosystem of partners. in the exhibit hall at Sands Expo, bit more about what you do. Alright, so the name is Weka IO. Give a little more about what you do, rhythms to NVME so we get you lowest latency, That's quite hard, but you were saying it's actually easier is easy when you distribute the data over many servers saying that you can export to S3. native posits so any application that you could up until now a real chore, what exactly are you bringing to the table and we were compared head to head with best of breed, and it took just under 10 seconds for us. and let's you get what you paid for. because I don't have to pay for that extra six hours. It does and if you look at the price Is that the primary use case you see customers using so a big feature that we have is that we let you tier and the more rendering nodes you have, and thanks for sharing it with us here on theCube, Thank you for being with us here.

ENTITIES

Entity	Category	Confidence
Justin Warren	PERSON	0.99+
Liran Zvibel	PERSON	0.99+
John Walls	PERSON	0.99+
10x	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Maor Ben Dayan	PERSON	0.99+
10 servers	QUANTITY	0.99+
10 minutes	QUANTITY	0.99+
six hours	QUANTITY	0.99+
13	QUANTITY	0.99+
35 minutes	QUANTITY	0.99+
millions	QUANTITY	0.99+
55 seconds	QUANTITY	0.99+
100 servers	QUANTITY	0.99+
four hours	QUANTITY	0.99+
six	QUANTITY	0.99+
five	QUANTITY	0.99+
100x	QUANTITY	0.99+
14 minutes	QUANTITY	0.99+
38 minutes	QUANTITY	0.99+
20 times	QUANTITY	0.99+
last week	DATE	0.99+
Las Vegas	LOCATION	0.99+
One customer	QUANTITY	0.99+
hundreds of petabytes	QUANTITY	0.99+
six and a half hours	QUANTITY	0.99+
first time	QUANTITY	0.99+
Intel	ORGANIZATION	0.98+
Sands Expo	EVENT	0.98+
thirty	QUANTITY	0.98+
a million files	QUANTITY	0.98+
Weka IO	ORGANIZATION	0.98+
one server	QUANTITY	0.98+
under two hours	QUANTITY	0.98+
both	QUANTITY	0.98+
Weka	ORGANIZATION	0.98+
millions of samples	QUANTITY	0.97+
each	QUANTITY	0.97+
under 10 seconds	QUANTITY	0.97+
two hours	QUANTITY	0.97+
first file system	QUANTITY	0.97+
IO	ORGANIZATION	0.97+
billions of small files	QUANTITY	0.96+
First	QUANTITY	0.96+
one	QUANTITY	0.96+
NFS V4	TITLE	0.96+
re:Invent	EVENT	0.96+
ten	QUANTITY	0.95+
millions of files	QUANTITY	0.94+
AWSS3	TITLE	0.94+
Cube	ORGANIZATION	0.94+
10% more time	QUANTITY	0.93+
More than 3x time	QUANTITY	0.91+
20 years old	QUANTITY	0.9+
millions of small files	QUANTITY	0.89+
a trillion exobytes	QUANTITY	0.89+
first	QUANTITY	0.87+
one kind	QUANTITY	0.84+
mega	ORGANIZATION	0.83+
re:Invent 2017	EVENT	0.81+
theCube	ORGANIZATION	0.81+
WekalO	ORGANIZATION	0.79+
AWS	EVENT	0.78+
greek	OTHER	0.78+
millions of	QUANTITY	0.75+
tons	QUANTITY	0.65+
S3	TITLE	0.63+
second	QUANTITY	0.62+
terra	ORGANIZATION	0.62+
re	EVENT	0.61+
EXT4	TITLE	0.57+
render	QUANTITY	0.57+
couple	QUANTITY	0.56+
AS3	TITLE	0.55+
theCube	COMMERCIAL_ITEM	0.53+

Day One Kickoff | PentahoWorld 2017

>> Narrator: Live from Orlando, Florida, its theCUBE. Covering Pentaho World 2017. Brought to you by Hitachi Vantara. >> We are kicking off day one of Pentaho World. Brought to you, of course, by Hitachi Vantara. I'm your host, Rebecca Knight, along with my co-hosts. We have Dave Vellante and James Kobielus. Guys I'm thrilled to be here in Orlando, Florida. Kicking off Pentaho World with theCUBE. >> Hey Rebecca, twice in one week. >> I know, this is very exciting, very exciting. So we were just listening to the key notes. We heard a lot about the big three, the power of the big three. Which is internet of things, predictive analytics, big data. So the question for you both is where is Hitachi Vantara in this marketplace? And are they doing what they need to do to win? >> Well so the first big question everyone is asking is what the heck is Hitachi-Vantara? (laughing) What is that? >> Maybe we should have started there. >> We joke, some people say it sounds like a SUV, Japanese company, blah blah blah. When we talked to Brian-- >> Jim: A well engineered SUV. >> So Brian Householder told us, well you know it really is about vantage and vantage points. And when you listen to their angles on insights and data, anywhere and however you want it. So they're trying to give their customers an advantage and a vantage point on data and insights. So that's kind of interesting and cool branding. The second big, I think, point is Hitachi has undergone a massive transformation itself. Certainly Hitachi America, which is really not a brand they use anymore, but Hitachi Data Systems. Brian Householder talked in his keynote, when he came in 14 years ago, Hitachi was 80 percent hardware, and infrastructure, and storage. And they've transformed that. They're about 50/50 last year. In terms of infrastructure versus software and services. But what they've done, in my view, is taken now the next step. I think Hitachi has said, alright listen, storage is going to the cloud, Dell and EMC are knocking each others head off. China is coming in to play. Do we really want to try and dominate that business? Rather, why don't we play from our strengths? Which is devices, internet of things, the industrial internet. So they buy Pentaho two years ago, and we're going to talk more about that, bring in an analytics platform. And this sort of marrying IT and OT, information technology and operation technology, together to go attack what is a trillion dollar marketplace. >> That's it so Pentaho was a very strategic acquisition. For Hitachi, of course, Hitachi data system plus Hitachi insides, plus Pentaho equals Hitachi Vantara. Pentaho was one of the pioneering vendors more than a decade ago. In the whole open source analytics arena. If you cast your mind back to the middle millennium decade, open source was starting to come into its own. Of course, we already had Linux an so forth, but in terms of the data world, we're talking about the pre-Hadoop era, the pre-Spark era. We're talking about the pre-TensorFlow era. Pentaho, I should say at that time. Which is, by the way, now a product group within Hitachi Vantara. It's not a stand alone company. Pentaho established itself as the spearhead for open-source, predictive analytics, and data mining. They made something called Weka, which is an open-source data mining toolkit that was actually developed initially in New Zealand. The core of their offering, to market, in many ways became very much a core player in terms of analytics as a service a so forth, but very much established themselves, Pentaho, as an up and coming solution provider taking a more or less, by the book, open source approach for delivering solutions to market. But they were entering a market that was already fairly mature in terms of data mining. Because you are talking about the mid-2000's. You already had SaaS, and SPSS, and some of the others that had been in that space. And done quite well for a long time. And so cut ahead to the present day. Pentaho had evolved to incorporate some fairly robust data integration, data transformation, all ETL capabilities into their portfolio. They had become a big data player in their own right, With a strong focus on embedded analytics, as the keynoters indicated this morning. There's a certain point where in this decade it became clear that they couldn't go it any further, in terms of differentiating themselves in this space. In a space that dominated by Hadoop and Spark, and AI things like TensorFlow. Unless they are part of a more diversified solution provider that offered, especially I think the critical thing was the edge orientation of the industrial internet of things. Which is really where many of the opportunities are now for a variety of new markets that are opening up, including autonomous vehicles, which was the focus of here all-- >> Let's clarify some things a little bit. So Pentaho actually started before the whole Hadoop movement. >> Yeah, yeah. >> That's kind of interesting. You know they were young company when Hadoop just started to take off. And they said alright we can adopt these techniques and processes as well. So they weren't true legacy, right? >> Jim: No. >> So they were able to ride that sort of modern wave. But essentially they're in the business of data, I call it data management. And maybe that's not the right term. They do ingest, they're doing ETL, transformation anyway. They're embedding, they've got analytics, they're embedding analytics. Like you said, they're building on top of Weka. >> James: In the first flesh and BI as a hot topic in the market in the mid-200's, they became a fairly substantial BI player. That actually helped them to grow in terms of revenue and customers. >> So they're one of those companies that touches on a lot of different areas. >> Yes. >> So who do we sort of compare them to? Obviously, what you think of guys like Informatica. >> Yeah, yeah. >> Who do heavy ETL. >> Yes. You mentioned BI, you mentioned before. Like, guys like Saas. What about Tableau? >> Well, BBI would be like, there's Tableau, and ClickView and so forth. But there's also very much-- >> Talend. >> Cognos under IBM. And, of course, there's the business objects Portfolio under SAP. >> David: Right. And Talend would be? >> In fact I think Talend is in many ways is the closest analog >> Right. >> to Pentaho in terms of predominatly open-source, go to market approach, that involves both the robust data integration and cleansing and so forth from the back end. And also, a deep dive of open source analytics on the front end. >> So they're differentiation they sort of claim is they're sort of end to end integration. >> Jim: Yeah. >> Which is something we've been talking about at Wikibon for a while. And George is doing some work there, you probably are too. It's an age old thing in software. Do you do best-of-breed or do you do sort of an integrated suite? Now the interesting thing about Pentaho is, they don't own their own cloud. Hitachi Vantara doesn't own their own cloud. So they do a lot of, it's an integrated pipeline, but it doesn't include its own database and other tooling. >> Jim: Yeah. >> Right, and so there is an interesting dynamic occurring that we want to talk to Donna Perlik about obviously, is how they position relative to roll your own. And then how they position, sort of, in the cloud world. >> And we should ask also how are they positioning now in the world of deep learning frameworks? I mean they don't provide, near as I know, their own deep learning frameworks to compete with the likes of TensorFlow, or MXNet, or CNT or so forth. So where are they going in that regard? I'd like to know. I mean there are some others that are big players in this space, like IBM, who don't offer their own deep learning framework, but support more than one of the existing frameworks in a portfolio that includes much of the other componentry. So in other words, what I'm saying is you don't need to have your own deep learning framework, or even open-source deep learning code-based, to compete in this new marketplace. And perhaps Pentaho, or Hitachi Vantara, roadmapping, maybe they'll take an IBM like approach. Where they'll bundle support, or incorporate support, for two or more of these third party tools, or open source code bases into their solution. Weka is not theirs either. It's open source. I mean Weka is an open source tool that they've supported from the get go. And they've done very well by it. >> It's just kind of like early day machine leraning. >> David: Yeah. >> Okay, so we've heard about Hitachi's transformation internally. And then their messaging today was, of course-- >> Exactly, that's where I really wanted to go next was we're talking about it from the product and the technology standpoint. But one of the things we kept hearing about today was this idea of the double bottom line. And this is how Hitachi Vantara is really approaching the marketplace, by really focusing on better business, better outcomes, for their customers. And obviously for Hitachi Vantara, too, but also for bettering society. And that's what we're going to see on theCUBE today. We're going to have a lot of guests who will come on and talk about how they're using Pentaho to solve problems in healthcare data, in keeping kids from dropping out of college, from getting computing and other kinds of internet power to underserved areas. I think that's another really important approach that Hitachi Vantara is taking in its model. >> The fact that Hitachi Vantara, I know, received Pentaho Solution, has been on the market for so long and they have such a wide range of reference customers all over the world, in many vertical. >> Rebecca: That's a great point. >> The most vertical. Willing to go on camera and speak at some length of how they're using it inside their business and so forth. Speaks volumes about a solution provider. Meaning, they do good work. They provide good offerings. They're companies have invested a lot of money in, and are willing to vouch for them. That says a lot. >> Rebecca: Right. >> And so the acquisition was in 2015. I don't believe it was a public number. It's Hitachi Limited. I don't think they had to report it, but the number I heard was about a half a billion. >> Jim: Uh-hm >> Which for a company with the potential of Pentaho, is actually pretty cheap, believe it or not. You see a lot of unicorns, billion dollar plus companies. But the more important thing is it allows Hitachi to further is transformation and really go after this trillion dollar business. Which is really going to be interesting to see how that unfolds. Because while Hitachi has a long-term view, it always takes a long-term view, you still got to make money. It's fuzzy, how you make money in IOT these days. Obviously, you can make money selling devices. >> How do you think money, open source anything? You know, so yeah. >> But they're sort of open source, with a hybrid model, right? >> Yeah. >> And we talked to Brian about this. There's a proprietary component in there so they can make their margin. Wikibon, we see this three tier model emerging. A data model, where you've got the edge in some analytics, real time analytics at the edge, and maybe persists some of that data, but they're low cost devices. And then there's a sort of aggregation point, or a hub. I think Pentaho today called it a gateway. Maybe it was Brian from Forester. A gateway where you're sort of aggregating data, and then ultimately the third tier is the cloud. And that cloud, I think, vectors into two areas. One is Onprem and one was public cloud. What's interesting with Brian from Forester was saying that basically said that puts the nail in the coffin of Onprem analytics and Onprem big data. >> Uh-hm >> I don't buy that. >> I don't buy that either. >> No, I think the cloud is going to go to your data. Wherever the data lives. The cloud model of self-service and agile and elastic is going to go to your data. >> Couple of weeks ago, of course we Wikibon, we did a webinar for our customers all around the notion of a true private cloud. And Dave, of course, Peter Burse were on it. Explaining that hybrid clouds, of course, public and private play together. But where the cloud experience migrates to where the data is. In other words, that data will be both in public and in private clouds. But you will have the same reliability, high availability, scaleability, ease of programming, so forth, wherever you happen to put your data assets. In other words, many companies we talk to do this. They combine zonal architecture. They'll put some of their resources, like some of their analytics, will be in the private cloud for good reason. The data needs to stay there for security and so forth. But much in the public cloud where its way cheaper quite often. Also, they can improve service levels for important things. What I'm getting at is that the whole notion of a true private cloud is critically important to understand that its all datacentric. Its all gravitating to where the data is. And really analytics are gravitating to where the data is. And increasingly the data is on the edge itself. Its on those devices where its being persistent, much of it. Because there's no need to bring much of the raw data to the gateway or to the cloud. If you can do the predominate bulk of the inferrencing on that data at edge devices. And more and more the inferrencing, to drive things like face recognition from you Apple phone, is happening on the edge. Most of the data will live there, and most of the analytics will be developed centrally. And then trained centrally, and pushed to those edge devices. That's the way it's working. >> Well, it is going to be an exciting conference. I can't wait to hear more from all of our guests, and both of you, Dave Vellante and Jim Kobielus. I'm Rebecca Knight, we'll have more from theCUBE's live coverage of Pentaho World, brought to you by Hitachi Vantara just after this.

Published Date : Oct 26 2017

SUMMARY :

Brought to you by Hitachi Vantara. Guys I'm thrilled to be So the question for you both is When we talked to Brian-- is taken now the next step. but in terms of the data world, before the whole Hadoop movement. And they said alright we can And maybe that's not the right term. in the market in the mid-200's, So they're one of those Obviously, what you think You mentioned BI, you mentioned before. ClickView and so forth. And, of course, there's the that involves both the they're sort of end to end integration. Now the interesting sort of, in the cloud world. much of the other componentry. It's just kind of like And then their messaging is really approaching the marketplace, has been on the market for so long Willing to go on camera And so the acquisition was in 2015. Which is really going to be interesting How do you think money, and maybe persists some of that data, is going to go to your data. and most of the analytics brought to you by Hitachi

ENTITIES

Entity	Category	Confidence
Hitachi	ORGANIZATION	0.99+
Brian	PERSON	0.99+
George	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
James Kobielus	PERSON	0.99+
Jim Kobielus	PERSON	0.99+
Rebecca	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Donna Perlik	PERSON	0.99+
Pentaho	ORGANIZATION	0.99+
James	PERSON	0.99+
Jim	PERSON	0.99+
Peter Burse	PERSON	0.99+
2015	DATE	0.99+
EMC	ORGANIZATION	0.99+
David	PERSON	0.99+
New Zealand	LOCATION	0.99+
Brian Householder	PERSON	0.99+
IBM	ORGANIZATION	0.99+
80 percent	QUANTITY	0.99+
two	QUANTITY	0.99+
Hitachi Vantara	ORGANIZATION	0.99+
Hitachi Limited	ORGANIZATION	0.99+
last year	DATE	0.99+
Orlando, Florida	LOCATION	0.99+
Onprem	ORGANIZATION	0.99+
today	DATE	0.99+
twice	QUANTITY	0.99+
Apple	ORGANIZATION	0.99+
Hitachi Data Systems	ORGANIZATION	0.99+
Forester	ORGANIZATION	0.99+
two areas	QUANTITY	0.99+
two years ago	DATE	0.99+
Informatica	ORGANIZATION	0.99+
one week	QUANTITY	0.99+
one	QUANTITY	0.99+
Weka	ORGANIZATION	0.99+
both	QUANTITY	0.98+
One	QUANTITY	0.98+
Tableau	TITLE	0.98+
PentahoWorld	EVENT	0.98+
14 years ago	DATE	0.98+
Hitachi America	ORGANIZATION	0.98+
Wikibon	ORGANIZATION	0.98+
Linux	TITLE	0.97+
about a half a billion	QUANTITY	0.97+

Arik Pelkey, Pentaho - BigData SV 2017 - #BigDataSV - #theCUBE

>> Announcer: Live from Santa Fe, California, it's the Cube covering Big Data Silicon Valley 2017. >> Welcome, back, everyone. We're here live in Silicon Valley in San Jose for Big Data SV in conjunct with stratAHEAD Hadoop part two. Three days of coverage here in Silicon Valley and Big Data. It's our eighth year covering Hadoop and the Hadoop ecosystem. Now expanding beyond just Hadoop into AI, machine learning, IoT, cloud computing with all this compute is really making it happen. I'm John Furrier with my co-host George Gilbert. Our next guest is Arik Pelkey who is the senior director of product marketing at Pentaho that we've covered many times and covered their event at Pentaho world. Thanks for joining us. >> Thank you for having me. >> So, in following you guys I'll see Pentaho was once an independent company bought by Hitachi, but still an independent group within Hitachi. >> That's right, very much so. >> Okay so you guys some news. Let's just jump into the news. You guys announced some of the machine learning. >> Exactly, yeah. So, Arik Pelkey, Pentaho. We are a data integration and analytics software company. You mentioned you've been doing this for eight years. We have been at Big Data for the past eight years as well. In fact, we're one of the first vendors to support Hadoop back in the day, so we've been along for the journey ever since then. What we're announcing today is really exciting. It's a set of machine learning orchestration capabilities, which allows data scientists, data engineers, and data analysts to really streamline their data science processes. Everything from ingesting new data sources through data preparation, feature engineering which is where a lot of data scientists spend their time through tuning their models which can still be programmed in R, in Weka, in Python, and any other kind of data science tool of choice. What we do is we help them deploy those models inside of Pentaho as a step inside of Pentaho, and then we help them update those models as time goes on. So, really what this is doing is it's streamlining. It's making them more productive so that they can focus their time on things like model building rather than data preparation and feature engineering. >> You know, it's interesting. The market is really active right now around machine learning and even just last week at Google Next, which is their cloud event, they had made the acquisition of Kaggle, which is kind of an open data science. You mentioned the three categories: data engineer, data science, data analyst. Almost on a progression, super geek to business facing, and there's different approaches. One of the comments from the CEO of Kaggle on the acquisition when we wrote up at Sylvan Angle was, and I found this fascinating, I want to get your commentary and reaction to is, he says the data science tools are as early as generations ago, meaning that all the advances and open source and tooling and software development is far along, but now data science is still at that early stage and is going to get better. So, what's your reaction to that, because this is really the demand we're seeing is a lot of heavy lifing going on in the data science world, yet there's a lot of runway of more stuff to do. What is that more stuff? >> Right. Yeah, we're seeing the same thing. Last week I was at the Gardener Data and Analytics conference, and that was kind of the take there from one of their lead machine learning analysts was this is still really early days for data science software. So, there's a lot of Apache projects out there. There's a lot of other open source activity going on, but there are very few vendors that bring to the table an integrated kind of full platform approach to the data science workflow, and that's what we're bringing to market today. Let me be clear, we're not trying to replace R, or Python, or MLlib, because those are the tools of the data scientists. They're not going anywhere. They spent eight years in their phD program working with these tools. We're not trying to change that. >> They're fluent with those tools. >> Very much so. They're also spending a lot of time doing feature engineering. Some research reports, say between 70 and 80% of their time. What we bring to the table is a visual drag and drop environment to do feature engineering a much faster, more efficient way than before. So, there's a lot of different kind of desperate siloed applications out there that all do interesting things on their own, but what we're doing is we're trying to bring all of those together. >> And the trends are reduce the time it takes to do stuff and take away some of those tasks that you can use machine learning for. What unique capabilities do you guys have? Talk about that for a minute, just what Pentaho is doing that's unique and added value to those guys. >> So, the big thing is I keep going back to the data preparation part. I mean, that's 80% of time that's still a really big challenge. There's other vendors out there that focus on just the data science kind of workflow, but where we're really unqiue is around being able to accommodate very complex data environments, and being able to onboard data. >> Give me an example of those environments. >> Geospatial data combined with data from your ERP or your CRM system and all kinds of different formats. So, there might be 15 different data formats that need to be blended together and standardized before any of that can really happen. That's the complexity in the data. So, Pentaho, very consistent with everything else that we do outside of machine learning, is all about helping our customers solve those very complex data challenges before doing any kind of machine learning. One example is one customer is called Caterpillar Machine Asset Intelligence. So, their doing predictive maintenance onboard container ships and on ferry's. So, they're taking data from hundreds and hundreds of sensors onboard these ships, combining that kind of operational sensor data together with geospatial data and then they're serving up predictive maintenance alerts if you will, or giving signals when it's time to replace an engine or complace a compressor or something like that. >> Versus waiting for it to break. >> Versus waiting for it to break, exactly. That's one of the real differentiators is that very complex data environment, and then I was starting to move toward the other differentiator which is our end to end platform which allows customers to deliver these analytics in an embedded fashion. So, kind of full circle, being able to send that signal, but not to an operational system which is sometimes a challenge because you might have to rewrite the code. Deploying models is a really big challenge within Pentaho because it is this fully integrated application. You can deploy the models within Pentaho and not have to jump out into a mainframe environment or something like that. So, I'd say differentiators are very complex data environments, and then this end to end approach where deploying models is much easier than ever before. >> Perhaps, let's talk about alternatives that customers might see. You have a tool suite, and others might have to put together a suite of tools. Maybe tell us some of the geeky version would be the impendent mismatch. You know, like the chasms you'd find between each tool where you have to glue them together, so what are some of those pitfalls? >> One of the challenges is, you have these data scientists working in silos often times. You have data analysts working in silos, you might have data engineers working in silos. One of the big pitfalls is not really collaborating enough to the point where they can do all of this together. So, that's a really big area that we see pitfalls. >> Is it binary not collaborating, or is it that the round trip takes so long that the quality or number of collaborations is so drastically reduced that the output is of lower quality? >> I think it's probably a little bit of both. I think they want to collaborate but one person might sit in Dearborn, Michigan and the other person might sit in Silicon Valley, so there's just a location challenge as well. The other challenge is, some of the data analysts might sit in IT and some of the data scientists might sit in an analytics department somewhere, so it kind of cuts across both location and functional area too. >> So let me ask from the point of view of, you know we've been doing these shows for a number of years and most people have their first data links up and running and their first maybe one or two use cases in production, very sophisticated customers have done more, but what seems to be clear is the highest value coming from those projects isn't to put a BI tool in front of them so much as to do advanced analytics on that data, apply those analytics to inform a decision, whether a person or a machine. >> That's exactly right. >> So, how do you help customers over that hump and what are some other examples that you can share? >> Yeah, so speaking of transformative. I mean, that's what machine learning is all about. It helps companies transform their businesses. We like to talk about that at Pentaho. One customer kind of industry example that I'll share is a company called IMS. IMS is in the business of providing data and analytics to insurance companies so that the insurance companies can price insurance policies based on usage. So, it's a usage model. So, IMS has a technology platform where they put sensors in a car, and then using your mobile phone, can track your driving behavior. Then, your insurance premium that month reflects the driving behavior that you had during that month. In terms of transformative, this is completely upending the insurance industry which has always had a very fixed approach to pricing risk. Now, they understand everything about your behavior. You know, are you turning too fast? Are you breaking too fast, and they're taking it further than that too. They're able to now do kind of a retroactive look at an accident. So, after an accident, they can go back and kind of decompose what happened in the accident and determine whether or not it was your fault or was in fact the ice on the street. So, transformative? I mean, this is just changing things in a really big way. >> I want to get your thoughts on this. I'm just looking at some of the research. You know, we always have the good data but there's also other data out there. In your news, 92% of organizations plan to deploy more predictive analytics, however 50% of organizations have difficulty integrating predictive analytics into their information architecture, which is where the research is shown. So my question to you is, there's a huge gap between the technology landscapes of front end BI tools and then complex data integration tools. That seems to be the sweet spot where the value's created. So, you have the demand and then front end BI's kind of sexy and cool. Wow, I could power my business, but the complexity is really hard in the backend. Who's accessing it? What's the data sources? What's the governance? All these things are complicated, so how do you guys reconcile the front end BI tools and the backend complexity integrations? >> Our story from the beginning has always been this one integrated platform, both for complex data integration challenges together with visualizations, and that's very similar to what this announcement is all about for the data science market. We're very much in line with that. >> So, it's the cart before the horse? Is it like the BI tools are really driven by the data? I mean, it makes sense that the data has to be key. Front end BI could be easy if you have one data set. >> It's funny you say that. I presented at the Gardner conference last week and my topic was, this just in: it's not about analytics. Kind of in jest, but it drove a really big crowd. So, it's about the data right? It's about solving the data problem before you solve the analytics problem whether it's a simple visualization or it's a complex fraud machine learning problem. It's about solving the data problem first. To that quote, I think one of the things that they were referencing was the challenging information architectures into which companies are trying to deploy models and so part of that is when you build a machine learning model, you use R and Python and all these other ones we're familiar with. In order to deploy that into a mainframe environment, someone has to then recode it in C++ or COBOL or something else. That can take a really long time. With our integrated approach, once you've done the feature engineering and the data preparation using our drag and drop environment, what's really interesting is that you're like 90% of the way there in terms of making that model production ready. So, you don't have to go back and change all that code, it's already there because you used it in Pentaho. >> So obviously for those two technologies groups I just mentioned, I think you had a good story there, but it creates problems. You've got product gaps, you've got organizational gaps, you have process gaps between the two. Are you guys going to solve that, or are you currently solving that today? There's a lot of little questions in there, but that seems to be the disconnect. You know, I can do this, I can do that, do I do them together? >> I mean, sticking to my story of one integrated approach to being able to do the entire data science workflow, from beginning to end and that's where we've really excelled. To the extent that more and more data engineers and data analysts and data scientists can get on this one platform even if their using R and WECCA and Python. >> You guys want to close those gaps down, that's what you guys are doing, right? >> We want to make the process more collaborative and more efficient. >> So Dave Alonte has a question on CrowdChat for you. Dave Alonte was in the snowstorm in Boston. Dave, good to see you, hope you're doing well shoveling out the driveway. Thanks for coming in digitally. His question is HDS has been known for mainframes and storage, but Hitachi is an industrial giant. How is Pentaho leveraging Hitatchi's IoT chops? >> Great question, thanks for asking. Hitatchi acquired Pentaho about two years ago, this is before my time. I've been with Pentaho about ten months ago. One of the reasons that they acquired Pentaho is because a platform that they've announced which is called Lumata which is their IoT platform, so what Pentaho is, is the analytics engine that drives that IoT platform Lumata. So, Lumata is about solving more of the hardware sensor, bringing data from the edge into being able to do the analytics. So, it's an incredibly great partnership between Lumata and Pentaho. >> Makes an eternal customer too. >> It's a 90 billion dollar conglomerate so yeah, the acquisition's been great and we're still very much an independent company going to market on our own, but we now have a much larger channel through Hitatchi's reps around the world. >> You've got IoT's use case right there in front of you. >> Exactly. >> But you are leveraging it big time, that's what you're saying? >> Oh yeah, absolutely. We're a very big part of their IoT strategy. It's the analytics. Both of the examples that I shared with you are in fact IoT, not by design but it's because there's a lot of demand. >> You guys seeing a lot of IoT right now? >> Oh yeah. We're seeing a lot of companies coming to us who have just hired a director or vice president of IoT to go out and figure out the IoT strategy. A lot of these are manufacturing companies or coming from industries that are inefficient. >> Digitizing the business model. >> So to the other point about Hitachi that I'll make, is that as it relates to data science, a 90 billion dollar manufacturing and otherwise giant, we have a very deep bench of phD data scientists that we can go to when there's very complex data science problems to solve at customer sight. So, if a customer's struggling with some of the basic how do I get up and running doing machine learning, we can bring our bench of data scientist at Hitatchi to bear in those engagements, and that's a really big differentiator for us. >> Just to be clear and one last point, you've talked about you handle the entire life cycle of modeling from acquiring the data and prepping it all the way through to building a model, deploying it, and updating it which is a continuous process. I think as we've talked about before, data scientists or just the DEV ops community has had trouble operationalizing the end of the model life cycle where you deploy it and update it. Tell us how Pentaho helps with that. >> Yeah, it's a really big problem and it's a very simple solution inside of Pentaho. It's basically a step inside of Pentaho. So, in the case of fraud let's say for example, a prediction might say fraud, not fraud, fraud, not fraud, whatever it is. We can then bring that kind of full lifecycle back into the data workflow at the beginning. It's a simple drag and drop step inside of Pentaho to say which were right and which were wrong and feed that back into the next prediction. We could also take it one step further where there has to be a manual part of this too where it goes to the customer service center, they investigate and they say yes fraud, no fraud, and then that then gets funneled back into the next prediction. So yeah, it's a big challenge and it's something that's relatively easy for us to do just as part of the data science workflow inside of Pentaho. >> Well Arick, thanks for coming on The Cube. We really appreciate it, good luck with the rest of the week here. >> Yeah, very exciting. Thank you for having me. >> You're watching The Cube here live in Silicon Valley covering Strata Hadoop, and of course our Big Data SV event, we also have a companion event called Big Data NYC. We program with O'Reilley Strata Hadoop, and of course have been covering Hadoop really since it's been founded. This is The Cube, I'm John Furrier. George Gilbert. We'll be back with more live coverage today for the next three days here inside The Cube after this short break.

Published Date : Mar 14 2017

SUMMARY :

it's the Cube covering Big Data Silicon Valley 2017. and the Hadoop ecosystem. So, in following you guys I'll see Pentaho was once You guys announced some of the machine learning. We have been at Big Data for the past eight years as well. One of the comments from the CEO of Kaggle of the data scientists. environment to do feature engineering a much faster, and take away some of those tasks that you can use So, the big thing is I keep going back to the data That's the complexity in the data. So, kind of full circle, being able to send that signal, You know, like the chasms you'd find between each tool One of the challenges is, you have these data might sit in IT and some of the data scientists So let me ask from the point of view of, the driving behavior that you had during that month. and the backend complexity integrations? is all about for the data science market. I mean, it makes sense that the data has to be key. It's about solving the data problem before you solve but that seems to be the disconnect. To the extent that more and more data engineers and more efficient. shoveling out the driveway. One of the reasons that they acquired Pentaho the acquisition's been great and we're still very much Both of the examples that I shared with you of IoT to go out and figure out the IoT strategy. is that as it relates to data science, from acquiring the data and prepping it all the way through and feed that back into the next prediction. of the week here. Thank you for having me. for the next three days here inside The Cube

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Hitachi	ORGANIZATION	0.99+
Dave Alonte	PERSON	0.99+
Pentaho	ORGANIZATION	0.99+
Dave	PERSON	0.99+
90%	QUANTITY	0.99+
Arik Pelkey	PERSON	0.99+
Boston	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
Hitatchi	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
one	QUANTITY	0.99+
50%	QUANTITY	0.99+
eight years	QUANTITY	0.99+
Arick	PERSON	0.99+
One	QUANTITY	0.99+
Lumata	ORGANIZATION	0.99+
Last week	DATE	0.99+
two technologies	QUANTITY	0.99+
15 different data formats	QUANTITY	0.99+
first	QUANTITY	0.99+
92%	QUANTITY	0.99+
One example	QUANTITY	0.99+
Both	QUANTITY	0.99+
Three days	QUANTITY	0.99+
Python	TITLE	0.99+
Kaggle	ORGANIZATION	0.99+
one customer	QUANTITY	0.99+
today	DATE	0.99+
eighth year	QUANTITY	0.99+
last week	DATE	0.99+
Santa Fe, California	LOCATION	0.99+
two	QUANTITY	0.99+
each tool	QUANTITY	0.99+
90 billion dollar	QUANTITY	0.99+
80%	QUANTITY	0.99+
Caterpillar	ORGANIZATION	0.98+
both	QUANTITY	0.98+
NYC	LOCATION	0.98+
first data	QUANTITY	0.98+
Pentaho	LOCATION	0.98+
San Jose	LOCATION	0.98+
The Cube	TITLE	0.98+
Big Data SV	EVENT	0.97+
COBOL	TITLE	0.97+
70	QUANTITY	0.97+
C++	TITLE	0.97+
IMS	TITLE	0.96+
MLlib	TITLE	0.96+
one person	QUANTITY	0.95+
R	TITLE	0.95+
Big Data	EVENT	0.95+
Gardener Data and Analytics	EVENT	0.94+
Gardner	EVENT	0.94+
Strata Hadoop	TITLE	0.93+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Weka: