Joel Cumming, Kik - Spark Summit East 2017 - #SparkSummit - #theCUBE
>> Narrator: Live from Boston, Massachusetts this is the Cube, covering Spark Summit East 2017 brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Boston, everybody, where it's a blizzard outside and a blizzard of content coming to you from Spark Summit East, #SparkSummit. This is the Cube, the worldwide leader in live tech coverage. Joel Cumming is here. He's the head of data at Kik. Kicking butt at Kik. Welcome to the Cube. >> Thank you, thanks for having me. >> So tell us about Kik, this cool mobile chat app. Checked it out a little bit. >> Yeah, so Kik has been around since about 2010. We're, as you mentioned, a mobile chat app, start-up based in Waterloo, Ontario. Kik really took off, really 2010 when it got 2 million users in the first 22 days of its existence. So was insanely popular, specifically with U.S. youth, and the reason for that really is Kik started off in a time where chatting through text cost money. Text messages cost money back in 2010, and really not every kid has a phone like they do today. So if you had an iPod or an iPad all you needed to do was sign up, and you had a user name and now you could text with your friends, so kids could do that just like their parents could with Kik, and that's really where we got our entrenchment with U.S. youth. >> And you're the head of data. So talk a little bit about your background. What does that mean to be a head of data? >> Yes, so prior to working at Kik I worked at Blackberry, and I like to say I worked at Blackberry probably around the time just before you bought your first Blackberry and I left just after you bought your first iPhone. So kind of in that range, but was there for nine years. >> Vellante: Can you do that with real estate? >> Yeah, I'd love to be able to do that with real estate. But it was a great time at Blackberry. It was very exciting to be part of that growth. When I was there, we grew from three million to 80 million customers, from three thousand employees to 17 thousand employees, and of course, things went sideways for Blackberry, but conveniently at the end Blackberry was working in BBM, and leading a team of data scientists and data engineers there. And BBM if you're not familiar with it is a chat app as well, and across town is where Kik is headquartered. The appeal to me of moving to Kik was a company that was very small and fast moving, but they actually weren't leveraging data at all. So when I got there, they had a pile of logs sitting in S3, waiting for someone to take advantage of them. They were good at measuring events, and looking at those events and how they tracked over time, but not really combining them to understand or personalize any experience for their end customers. >> So they knew enough to keep the data. >> They knew enough to keep the data. >> They just weren't sure what to do with it. Okay so, you come in, and where did you start? >> So the first day that I started that was the first day I used any AWS product, so I had worked on the big data tools at the old place, with Hadoop and Pig and Hive and Oracle and those kinds of things, but had never used an AWS product until I got there and it was very much sink or swim and on my first day our CEO in the meeting said, "Okay, you're data guy here now. "I want you to tell me in a week why people leave Kik." And I'm like, man we don't even have a database yet. The first thing I did was I fired up a Redshift cluster. First time I had done that, looked at the tools that were available in AWS to transform the data using EMR and Pig and those kinds of things, and was lucky enough, fortunate enough that they could figure that out in a week and I didn't give him the full answer of why people left, but I was able to give him some ideas of places we could go based on some preliminary exploration. So I went from leading this team of about 40 people to being a team of one and writing all the code myself. Super exciting, not the experience that everybody wants, but for me it was a lot of fun. Over the last three years have built up the team. Now we have three data engineers and three data scientists and indeed it's a lot more important to people every day at Kik. >> What sort of impact has your team had on the product itself and the customer experience? >> So the beginning it was really just trying to understand the behaviors of people across Kik, and that took a while to really wrap our heads around, and any good data analysis combines behaviors that you have to ask people their opinion on and also behaviors that we see them do. So I had an old boss that used to work at Rogers, which is a telecomm provider in Canada, and he said if you ask people the things that they watch they tell you documentaries and the news and very important stuff, but if you see what they actually watch it's reality TV and trashy shows, and so the truth is really somewhere in the middle. There's an aspirational element. So for us really understanding the data we already had, instrumenting new events, and then in the last year and a half, building out an A/B testing framework is something that's been instrumental in how we leverage data at Kik. So we were making decisions by gut feel in the very beginning, then we moved into this era where we were doing A/B testing and very focused on statistical significance, and rigor around all of our experiments, but then stepping back and realizing maybe the bets that we have aren't big enough. So we need to maybe bet a little bit more on some bigger features that have the opportunity to move the needle. So we've been doing that recently with a few features that we've released, but data is super important now, both to stimulate creativity of our product managers as well as to measure the success of those features. >> And how do you map to the product managers who are defining the new features? Are you a central group? Are you sort of point guards within the different product groups? How does that, your evidence-based decisions or recommendations but they make ultimately, presumably, the decisions. What's the dynamic? >> So it's a great question. In my experience, it's very difficult to build a structure that's perfect. So in the purely centralized model you've got this problem of people are coming to you to ask for something, and they may get turned away because you're too busy, and then in the decentralized model you tend to have lots of duplication and overlap and maybe not sharing all the things that you need to share. So we tried to build a hybrid of both. And so we had our data engineers centralized and we tried doing what we called tours of duty, so our data scientists would be embedded with various teams within the company so it could be, it could be the core messenger team. It could be our bot platform team. It could be our anti-spam team. And they would sit with them and it's very easy for product managers and developers to ask them questions and for them to give out answers, and then we would rotate those folks through a different tour of duty after a few months and they would sit with another team. So we did that for a while, and it worked pretty well, but one of the major things we found was a problem was there's no good checkpoint to confirm that what they're doing is right. So in software development you're releasing a version of software. There's QA, there's code review and there's structure in place to ensure that yes, this number I'm providing is right. It's difficult when you've got a data scientist who's out with a team for him to come back to the team and get that peer review. So now we're kind of reevaluating that. We use an agile approach, but we have primes for each of these groups but now we all sit together. >> So the accountability is after the data scientist made a recommendation that the product manager agrees with, how do you ensure that it measured up to the expectation? Like sort of after the fact. >> Yeah, so in those cases our A/B tests are it's nice to have that unbiased data resource on the team that's embedded with them that can step back and say yes, this idea worked, or it didn't work. So that's the approach that we're taking. It's not a dedicated resource, but a prime resource for each of these teams that's a subject matter expert and then is evaluating the results in an unbiased kind of way. >> So you've got this relatively small, even though it's quadruple the size when you started, data team and then application development team as sort of colleagues or how do you interact with them? >> Yeah, we're actually part of the engineering organization at Kik, part of R and D, and in different times in my life I've been part of different organizations whether it's marketing or whether it's I.T. or whether it's R and D, and R and D really fits nicely. And the reason why I think it's the best is because if there's data that you need to understand users more there's much more direct control over getting that element instrumented within a product that you have when you're part of R and D. If you're in marketing, you're like hey, I'd love to know how many times people tap on that red button, but no event fires when that red button is tapped. Good luck trying to get the software developers to put that in. But when there's an inherent component of R and D that's dependent on data, and data has that direct path to those developers, getting that kind of thing done is much easier. >> So from a tooling standpoint, thinking about data scientists and data engineers, a lot of the tools that we've seen in this so-called big data world have been quite spoke. Different interfaces, different experience. How are you addressing that? Does Spark help with that? Maybe talk about that a bit more. >> Yeah, so I was fortunate enough to do a session today that sort of talked about data V1 at Kik versus data V2 at Kik, and we drew this kind of a line in the sand. So when I started it was just me. I'm trying to answer these questions very quickly on these three or five day timelines that we get from our CEO. >> Vallente: You've been here a week, come on! >> Yeah exactly, so you sacrifice data engineering and architecture when you're living like that. So you can answer questions very quickly. It worked well for a while, but then all of a sudden we come up and we have 300 data pipelines. They're a mess. They're hard to manage and control. We've got code sometimes in Sequel or sometimes in Python scripts, or sometimes on people's laptops. We have no real plan for Getup integration. And then you know real scalability out of Redshift. We were doing a lot of our workloads in Redshift to do transformations just because, get the data into Redshift, write some Sequel and then have your results. We're running into contention problems with that. So what we decided to do is sort of stop, step back and say, okay so how are we going to house all of this atomic data that we have in a way that's efficient. So we started with Redshift, our database was 10 terabytes. Now it's 100, except for we get five terabytes of data per day that's new coming in, so putting that all in Redshift, it doesn't make sense. It's not all that useful. So if we cull that data under supervision, we don't want to get rid of the atomic data, how do we control that data under supervision. So we decided to go the data lake route, even though we hate the term data lake, but basically a folder structure within S3 that's stored in a query optimized format like Parquet, and now we can access that data very quickly at an atomic level, at a cleansed level and also an at aggregate level. So for us, this data V2 was the evolution of stopping doing a lot of things the way we used to do, which was lots of data pipelines, kind of code that was all over the place, and then aggregations in Redshift, and starting to use Spark, specifically Databricks. Databricks we think of in two ways. One is kind of managed Spark, so that we don't have to do all the configuration that we used to have to do with EMR, and then the second is notebooks that we can align with all the work that we're doing and have revision control and Getup integration as well. >> A question to clarify, when you've put the data lake, which is the file system and then the data in Parquet format, or Parquet files, so this is where you want to have some sort of interactive experience for business intelligence. Do you need some sort of MPP server on top of that to provide interactive performance, or, because I know a lot customers are struggling at that point where they got all the data there, and it's kind of organized, but then if they really want to munge through that huge volume they find it slows to lower than a crawl. >> Yeah, it's a great point. And we're at the stage right now where our data lake at the top layer of our data lake where we aggregate and normalize, we also push that data into Redshift. So Redshift what we're trying to do with that is make it a read-only environment, so that our analysts and developers, so they know they have consistent read performance on Redshift, where before when it's a mix of batch jobs as well as read workload, they didn't have that guarantee. So you're right, and we think what will probably happen over the next year or so is the advancements in Spark will make it much more capable as a data warehousing product, and then you'd have to start a question do I need both Redshift and Spark for that kind of thing? But today I think some of the cost-based optimizations that are coming, at least the promise of them coming I would hope that those would help Spark becoming more of a data warehouse, but we'll have to see. >> So carry that thread a little further through. I mean in terms of things that you'd like to see in the Spark roadmap, things that could be improved. What's your feedback to Databricks? >> We're fortunate, we work with them pretty closely. We've been a customer for about half a year, and they've been outstanding working with us. So structured streaming is a great example of something we worked pretty closely with on. We're really excited about. We don't have, you know we have certain pockets within our company that require very real-time data, so obviously your operational components. Are your servers up or down, as well as our anti-spam team. They require very low latency access to data. We haven't typically, if we batch every hour that's fine in most cases, but structured streaming when our data streams are coming in now through Kinesis Firehose, and we can process those without have to worry about checking to see if it's time we should start this or is all the data there so we can run this batch. Structured streaming solves a lot of those, it simplifies a lot of that workload for us. So that's something we've been working with them on. The other things that we're really interested in. We've got a bit of list, but the other major ones are how do you start to leverage this data to use it for personalization back in the app? So today we think of data in two ways at Kik. It's data as KPIs, so it's like the things you need to run your business, maybe it's A/B testing results, maybe it's how many active users you had yesterday, that kind of thing. And then the second is data as a product, and how do you provide personalization at an individual level based on your data sciences models back out to the app. So we do that, I should point out at Kik we don't see anybody's messages. We don't read your messages. We don't have access to those. But we have the metadata around the transactions that you have, like most companies do. So that helps us improve our products and services under our privacy policy to say okay, who's building good relationships and who's leaving the platform and why are they doing it. But we can also service components that are useful for personalization, so if you've chatted with three different bots on our platform that's important for us to know if we want to recommend another bot to you. Or you know the classic people people you may know recommendations. We don't do that right now, but behind the scenes we have the kind of information that we could help personalize that experience for you. So those two things are very different. In a lot of companies there's an R and D element, like at Blackberry, the app world recommendation engine was something that there was a team that ran in production but our team was helping those guys tweak and tune their models. So it's the same kind of thing at Kik where we can build, our data scientist are building models for personalization, and then we need to service them back up to the rest of the company. And the process right now of taking the results of our models and then putting them into a real time serving system isn't that clean, and so we do batches every day on things that don't need to be near real-time, so things like predicted gender. If we know your first name, we've downloaded the list of baby names from the U.S. Social Security website and we can say the frequency of the name Pat 80 percent of the time it's a male, and 20 percent it's a female, but Joel is 99 percent of the time it's male and one percent of the time it's a female, so based on your tolerance for whatever you want to use this personalization for we can give you our degrees of confidence on that. That's one example of what we surface rate now in our API back to our own first party components of our app. But in the future with more real-time data coming in from Spark streaming with more real-time model scoring, and then the ability to push that over into some sort of capability that can be surfaced up through an API, it gives our data team the capability of being much more flexible and fast at surfacing things that can provide personalization to the end user, as opposed to what we have now which is all this batch processing and then loading once a day and then knowing that we can't react on the fly. >> So if I were to try and turn that into a sort of a roadmap, a Spark roadmap, it sounds like the process of taking the analysis and doing perhaps even online training to update the models, or just rescoring if you're doing a little slightly less fresh, but then serving it up from a high speed serving layer, that's when you can take data that's coming in from the game and send it back to improve the game in real time. >> Exactly. Yep. >> That's what you're looking for. >> Yeah. >> You and a lot of other people. >> Yeah I think so. >> So how's the event been for you? >> It's been great. There's some really smart people here. It's humbling when you go to some of these sessions and you know, we're fortunate where we try and not have to think about a lot of the details that people are explaining here, but it's really good to understand them and know that there are some smart people that are fixing these problems. As like all events, been some really good sessions, but the networking is amazing, so meeting lots of great people here, and hearing their stories too. >> And you're hoping to go to the hockey game tonight. >> Yeah, I'd love to go to the hockey game. See if we can get through the snow. >> Who are the Bruins playing tonight. >> San Jose. >> Oh, good. >> It could be a good game. >> Yeah, the rivalry. You guys into the hockey game? Alright, good. Alright, Joel, listen, thanks very much for coming on the Cube. Great segment. I really appreciate your insights and sharing. >> Okay, thanks for having me. >> You're welcome. Alright, keep it right there, everybody. George and I will be back right after this short break. This is the Cube. We're live from Spark Summit in Boston.
SUMMARY :
brought to you by Databricks. and a blizzard of content coming to you So tell us about Kik, this cool mobile chat app. and the reason for that really is Kik started off What does that mean to be a head of data? and I like to say I worked at Blackberry but conveniently at the end Blackberry was working Okay so, you come in, and where did you start? and on my first day our CEO in the meeting said, and also behaviors that we see them do. And how do you map to the product managers but one of the major things we found was a problem So the accountability is after the data scientist So that's the approach that we're taking. and data has that direct path to those developers, a lot of the tools that we've seen and we drew this kind of a line in the sand. One is kind of managed Spark, so that we don't have to do and it's kind of organized, but then if they that are coming, at least the promise of them coming in the Spark roadmap, things that could be improved. It's data as KPIs, so it's like the things you need from the game and send it back to improve the game and not have to think about a lot of the details See if we can get through the snow. Yeah, the rivalry. This is the Cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Canada | LOCATION | 0.99+ |
Joel Cumming | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Blackberry | ORGANIZATION | 0.99+ |
2010 | DATE | 0.99+ |
Joel | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
10 terabytes | QUANTITY | 0.99+ |
20 percent | QUANTITY | 0.99+ |
nine years | QUANTITY | 0.99+ |
99 percent | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
iPad | COMMERCIAL_ITEM | 0.99+ |
three million | QUANTITY | 0.99+ |
17 thousand employees | QUANTITY | 0.99+ |
Boston, Massachusetts | LOCATION | 0.99+ |
three thousand employees | QUANTITY | 0.99+ |
Kik | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
Waterloo, Ontario | LOCATION | 0.99+ |
iPod | COMMERCIAL_ITEM | 0.99+ |
three data scientists | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
100 | QUANTITY | 0.99+ |
one percent | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
Redshift | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
2 million users | QUANTITY | 0.99+ |
80 percent | QUANTITY | 0.99+ |
iPhone | COMMERCIAL_ITEM | 0.99+ |
today | DATE | 0.99+ |
Kik | PERSON | 0.99+ |
five day | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
three data engineers | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
second | QUANTITY | 0.99+ |
300 data pipelines | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
yesterday | DATE | 0.98+ |
two ways | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
S3 | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
Parquet | TITLE | 0.98+ |
first day | QUANTITY | 0.98+ |
Rogers | ORGANIZATION | 0.98+ |
about half a year | QUANTITY | 0.97+ |
once a day | QUANTITY | 0.97+ |
Spark | TITLE | 0.97+ |
Spark Summit East 2017 | EVENT | 0.97+ |
first 22 days | QUANTITY | 0.97+ |
about 40 people | QUANTITY | 0.97+ |
next year | DATE | 0.97+ |
first thing | QUANTITY | 0.96+ |
First time | QUANTITY | 0.96+ |
Spark | ORGANIZATION | 0.95+ |
U.S. Social Security | ORGANIZATION | 0.95+ |
a week | QUANTITY | 0.95+ |
80 million customers | QUANTITY | 0.95+ |