Josh Klahr & Prashanthi Paty | DataWorks Summit 2017
>> Announcer: Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017. Brought to you by Hortonworks. >> Hey, welcome back to theCUBE. Day two of the DataWorks Summit, I'm Lisa Martin with my cohost, George Gilbert. We've had a great day and a half so far, learning a ton in this hyper-growth, big data world meets IoT, machine learning, data science. George and I are excited to welcome our next guests. We have Josh Klahr, the VP of Product Management from AtScale. Welcome George, welcome back. >> Thank you. >> And we have Prashanthi Paty, the Head of Data Engineering for GoDaddy. Welcome to theCUBE. >> Thank you. >> Great to have you guys here. So, wanted to kind of talk to you guys about, one, how you guys are working together, but two, also some of the trends that you guys are seeing. So as we talked about, in the tech industry, it's two degrees of Kevin Bacon, right. You guys worked together back in the day at Yahoo. Talk to us about what you both visualized and experienced in terms of the Hadoop adoption maturity cycle. >> Sure. >> You want to start, Josh? >> Yeah, I'll start, and you can chime in and correct me. But yeah, as you mentioned, Prashanthi and I worked together at Yahoo. It feels like a long time ago. In our central data group. And we had two main jobs. First job was, collect all of the data from our ad systems, our audience systems, and stick that data into a Hadoop cluster. At the time, we were kind of doing it while Hadoop was kind of being developed. And the other thing that we did was, we had to support a bunch of BI consumers. So we built cubes, we built data marts, we used MicroStrategy, Tableau, and I would say the experience there was a great experience with Hadoop in terms of the ability to have low-cost storage, scale out data processing of all of, what were really, billions and billions, tens of billions of events a day. But when it came to BI, it felt like we were doing stuff the old way. And we were moving data off cluster, and making it small. In fact, you did a lot of that. >> Well, yeah, at the end of the day, we were using Hadoop as a staging layer. So we would process a whole bunch of data there, and then we would scale it back, and move it into, again, relational stores or cubes, because basically we couldn't afford to give any accessibility to BI tools or to our end users directly on Hadoop. So while we surely did a large-scale data processing in Hadoop layer, we failed to turn on the insights right there. >> Lisa: Okay. >> Maybe there's a lesson in there for folks who are getting slightly more mature versions of Hadoop now, but can learn from also some of the experiences you've had. Were there issues in terms of, having cleaned and curated data, were there issues for BI with performance and the lack of proper file formats like Parquet? What was it that where you hit the wall? >> It was both, you have to remember this, we were probably one of the first teams to put a data warehouse on Hadoop. So we were dealing with Pig versions of like, 0.5, 0.6, so we were putting a lot of demand on the tooling and the infrastructure. Hadoop was still in a very nascent stage at that time. That was one. And I think a lot of the focus was on, hey, now we have the ability to do clickstream analytics at scale, right. So we did a lot of the backend stuff. But the presentation is where I think we struggled. >> So would that mean that you did do, the idea is that you could do full resolution without sampling on the backend, and then you would extract and presumably sort of denormalize so that you could, essentially run data match for subject matter interests. >> Yeah, and that's exactly what we did is, we took all of this big data, but to make it work for BI, which were two things, one was performance. It was really, can you get an interactive query and response time. And the other thing was the interface. Can a Tableau user connect and understand what they're looking at. You had to make the data small again. And that was actually the genesis of AtScale, which is where I am today, was, we were frustrated with this, big data platform and having to then make the data small again in order to support BI. >> That's a great transition, Josh. Let's actually talk about AtScale. You guys saw BI on Hadoop as this big white space. How have you succeeded there, and then let's talk about what GoDaddy is doing with AtScale and big data. >> Yeah, I think that we definitely learned, we took the learnings from our experience at Yahoo, and we really thought about, if we were to start from scratch, and solve the problem the way we wanted it to be solved, what would that system look like. And it was a few things. One was an interface that worked for BI. I don't want to date myself, but my experience in the software space started with OLAP. And I can tell you OLAP isn't dead. When you go and talk to an enterprise, a fortune 1000 enterprise and you talk about OLAP, that's how they think. They think in terms of measures and dimensions and hierarchies. So one important thing for us was to project an OLAP interface on top of data that's Hadoop native. It's Hive tables, Parquet, ORC, you kind of talk about all of the mess that may sit underneath the covers. So one thing was projecting that interface, the other thing was delivering performance. So we've invested a lot in using the Hadoop cluster natively to deliver performing queries. We do this by creating aggregate tables and summary tables and being smart about how we route queries. But we've done it in a way that makes a Hadoop admin very happy. You don't have to buy a bunch of AtScale servers in addition to your Hadoop cluster. We scale the way the Hadoop cluster scales. So we don't require separate technology. So we fit really nicely into that Hadoop ecosystem. >> So how do you make, making the Hadoop admin happy is a good thing. How do you make the business user happy, who needs now, as we were here yesterday, to kind of merge more with the data science folks to be able to understand or even have the chance to articulate, "These are the business outcomes "we want to look for and we want to see." How do you guys, maybe, under the hood, if you will, AtScale, make the business guys and gals happy? >> I'll share my opinion and then Prashanthi can comment on her experience but, as I've mentioned before, the business users want an interface that's simple to use. And so that's one thing we do, is, we give them the ability to just look at measures and dimensions. If I'm a business, I grew up using Excel to do my analysis. The thing I like most as an analyst is a big fat wide table. And so that's what, we make an underlying Hadoop cluster and what could be tens or hundreds of tables look like a single big fat wide table for a data analyst. You talk to a data scientist, you talk to a business analyst, that's the way they want to view the world. So that's one thing we do. And then, we give them response times that are fast. We give them interactivity, so that you could really quickly start to get a sense of the shape of the data. >> And allowing them to get that time to value. >> Yes. >> I can imagine. >> Just a follow-up on that. When you have to prepare the aggregates, essentially like the cubes, instead of the old BI tools running on a data mart, what is the additional latency that's required from data coming fresh into the data lake and then transforming it into something that's consumption ready for the business user? >> Yeah, I think I can take that. So again, if you look at the last 10 years, in the initial period, certainly at Yahoo, we just threw engineering resources at that problem, right. So we had teams dedicated to building these aggregates. But the whole premise of Hadoop was the ability to do unstructured optimizations. And by having a team find out the new data coming in and then integrating that into your pipeline, so we were adding a lot of latency. And so we needed to figure out how we can do this in a more seamless way, in a more real-time way. And get the, you know, the real premise of Hadoop. Get it at the hands of our business users. I mean, I think that's where AtScale is doing a lot of the good work in terms of dynamically being able to create aggregates based on the design that you put in the cube. So we are starting to work with them on our implementation. We're looking forward to the results. >> Tell us a little bit more about what you're looking to achieve. So GoDaddy is a customer of AtScale. Tell us a little bit more about that. What are you looking to build together, and kind of, where are you in your journey right now? >> Yeah, so the main goal for us is to move beyond predefined models, dashboards, and reports. So we want to be more agile with our schema changes. Time to market is one. And performance, right. Ability to put BI tools directly on top of Hadoop, is one. And also to push as much of the semantics as possible down into the Hadoop layer. So those are the things that we're looking to do. >> So that sounds like a classic business intelligence component, but sort of rethought for a big data era. >> I love that quote, and I feel it. >> Prashanthi: Yes. >> Josh: Yes. (laughing) >> That's exactly what we're trying to do. >> But that's also, some of the things you mentioned are non-trivial. You want to have this, time goes in to the pre-processing of data so that it's consumable, but you also wanted it to be dynamic, which is sort of a trade-off, which means, you know, that takes time. So is that a sort of a set of requirements, a wishlist for AtScale, or is that something that you're building on your own? >> I think there's a lot happening in that space. They are one of the first people to come out with their product, which is solving a real problem that we tried to solve for a long time. And I think as we start using them more and more, we'll surely be pushing them to bring in more features. I think the algorithm that they have to dynamically generate aggregates is something that we're giving quite a lot of feedback to them on. >> Our last guest from Pentaho was talking about, there was, in her keynote today, the quote from I think McKinsey report that said, "40% of machine learning data is either not fully "exploited or not used at all." So, tell us, kind of, where is big daddy regarding machine learning? What are you seeing? What are you seeing at AtScale and how are you guys going to work together to maybe venture into that frontier? >> Yeah, I mean, I think one of the key requirements we're placing on our data scientists is, not only do you have to be very good at your data science job, you have to be a very good programmer too to make use of the big data technologies. And we're seeing some interesting developments like very workload-specific engines coming into the market now for search, for graph, for machine learning, as well. Which is supposed to give the tools right into the hands of data scientists. I personally haven't worked with them to be able to comment. But I do think that the next realm on big data is this workload-specific engines, and coming on top of Hadoop, and realizing more of the insights for the end users. >> Curious, can you elaborate a little more on those workload-specific engines, that sounds rather intriguing. >> Well, I think interactive, interacting with Hadoop on a real-time basis, we see search-based engines like Elasticsearch, Solr, and there is also Druid. At Yahoo, we were quite a bit shop of Druid actually. And we were using it as an interactive query layer directly with our applications, BI applications. This is our JavaScript-based BI applications, and Hadoop. So I think there are quite a few means to realize insights from Hadoop now. And that's the space where I see workload-specific engines coming in. >> And you mentioned earlier before we started that you were using Mahout, presumably for machine learning. And I guess I thought the center of gravity for that type of analytics has moved to Spark, and you haven't mentioned Spark yet. We are not using Mahout though. I mentioned it as something that's in that space. But yeah, I mean, Spark is pretty interesting. Spark SQL, doing ETL with Spark, as well as using Spark SQL for queries is something that looks very, very promising lately. >> Quick question for you, from a business perspective, so you're the Head of Engineering at GoDaddy. How do you interact with your business users? The C-suite, for example, where data science, machine learning, they understand, we have to have, they're embracing Hadoop more and more. They need to really, embracing big data and leveraging Hadoop as an enabler. What's the conversation like, or maybe even the influence of the GoDaddy business C-suite on engineering? How do you guys work collaboratively? >> So we do have very regular stakeholder meeting. And these are business stakeholders. So we have representatives from our marketing teams, finance, product teams, and data science team. We consider data science as one of our customers. We take requirements from them. We give them peek into the work we're doing. We also let them be part of our agile team so that when we have something released, they're the first ones looking at it and testing it. So they're very much part of the process. I don't think we can afford to just sit back and work on this monolithic data warehouse and at the end of the day say, "Hey, here is what we have" and ask them to go get the insights from it. So it's a very agile process, and they're very much part of it. >> One last question for you, sorry George, is, you guys mentioned you are sort of early in your partnership, unless I misunderstood. What has AtScale help GoDaddy achieve so far and what are your expectations, say the next six months? >> We want the world. (laughing) >> Lisa: Just that. >> Yeah, but the premise is, I mean, so Josh and I, we were part of the same team at Yahoo, where we faced problems that AtScale is trying to solve. So the premise of being able to solve those problems, which is, like their name, basically delivering data at scale, that's the premise that I'm very much looking forward to from them. >> Well, excellent. Well, we want to thank you both for joining us on theCUBE. We wish you the best of luck in attaining the world. (all laughing) >> Josh: There we go, thank you. >> Excellent, guys. Josh Klahr, thank you so much. >> My pleasure. Prashanthi, thank you for being on theCUBE for the first time. >> No problem. >> You've been watching theCUBE live at the day two of the DataWorks Summit. For my cohost George Gilbert, I am Lisa Martin. Stick around guys, we'll be right back. (jingle)
SUMMARY :
Brought to you by Hortonworks. George and I are excited to welcome our next guests. And we have Prashanthi Paty, Talk to us about what you both visualized and experienced And the other thing that we did was, and then we would scale it back, and the lack of proper file formats like Parquet? So we were dealing with Pig versions of like, the idea is that you could do full resolution And the other thing was the interface. How have you succeeded there, and solve the problem the way we wanted it to be solved, So how do you make, And so that's one thing we do, is, that's consumption ready for the business user? based on the design that you put in the cube. and kind of, where are you in your journey right now? So we want to be more agile with our schema changes. So that sounds like a classic business intelligence Josh: Yes. of data so that it's consumable, but you also wanted And I think as we start using them more and more, What are you seeing at AtScale and how are you guys and realizing more of the insights for the end users. Curious, can you elaborate a little more And we were using it as an interactive query layer and you haven't mentioned Spark yet. machine learning, they understand, we have to have, and at the end of the day say, "Hey, here is what we have" you guys mentioned you are sort of early We want the world. So the premise of being able to solve those problems, Well, we want to thank you both for joining us on theCUBE. Josh Klahr, thank you so much. for the first time. of the DataWorks Summit.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Josh | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Josh Klahr | PERSON | 0.99+ |
Prashanthi Paty | PERSON | 0.99+ |
Prashanthi | PERSON | 0.99+ |
Lisa | PERSON | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
Kevin Bacon | PERSON | 0.99+ |
San Jose | LOCATION | 0.99+ |
Excel | TITLE | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
GoDaddy | ORGANIZATION | 0.99+ |
40% | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
AtScale | ORGANIZATION | 0.99+ |
tens | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
Druid | TITLE | 0.99+ |
First job | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
two | QUANTITY | 0.99+ |
Spark SQL | TITLE | 0.99+ |
today | DATE | 0.99+ |
two degrees | QUANTITY | 0.99+ |
both | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
DataWorks Summit | EVENT | 0.98+ |
two things | QUANTITY | 0.98+ |
Elasticsearch | TITLE | 0.98+ |
first time | QUANTITY | 0.98+ |
DataWorks Summit 2017 | EVENT | 0.97+ |
first teams | QUANTITY | 0.96+ |
Solr | TITLE | 0.96+ |
Mahout | TITLE | 0.95+ |
hundreds of tables | QUANTITY | 0.95+ |
two main jobs | QUANTITY | 0.94+ |
One last question | QUANTITY | 0.94+ |
billions and | QUANTITY | 0.94+ |
McKinsey | ORGANIZATION | 0.94+ |
Day two | QUANTITY | 0.94+ |
One | QUANTITY | 0.94+ |
Parquet | TITLE | 0.94+ |
Tableau | TITLE | 0.93+ |