David Scott, Veritas | CUBE Conversation, June 2020
>> Announcer: From theCUBE studios in Palo Alto, in Boston, connecting with thought leaders all around the world. This is a CUBE conversation. >> Hey welcome back everybody. Jeff Frick here with theCUBE. Coming to you today from our Palo Alto studios. It's COVID is still going on. So, there's still no shows, but the good news is we've got the technology we can reach out to the community, and bring them in from far, far away. So today joining us from Virginia across the country is Dave Scott. He is the director of Product Management for Veritas, Dave, great to see you. >> Thanks Jeff, great to be here. >> Absolutely. So let's jump into it. You guys have been about backup and recovery for years and years and years, but oh my goodness, how the landscape continues to evolve between, you know, public cloud and you know, all the things happening with Amazon and Google, and Microsoft. And then now, of course big push for Hybrid. And, you know, we're the workloads, and kind of application centric infrastructure. You guys still got a backup and secure all this things. I wonder if you can give us a little bit of your perspective on, you know, kind of the increasing complexity of the computing environment, has all these different kind of pieces of the puzzle, are kind of gaining traction at the same time. >> Yeah, absolutely. I mean, I'm on the compliance side of the company. So I'm more on looking after requirements around collection of content preparation for litigation, making sure you're adhering to compliance regulations in different parts of the world. And, I mean that's a constantly evolving space. One of the, so basically the products I look after are Enterprise Vaults, Enterprise Vault.cloud, and eDiscovery platform. And, as you say, I mean, one of the biggest challenges is that customers are starting to move, you know, customers are looking for flexibility in how they deploy our solutions. We've had a product in market with enterprise vault for about 20 years. And so, we have a lot of customers that have a lot of data on premise, and now they're starting, you know, they've got cloud mandates, they want to move that content to the cloud. So we have gotten very aggressive at building out our SaaS, archiving solution, Enterprise Vault.cloud. But we also provide other options. Like if you want to move enterprise vault from your data center on premise, to your tenant in Azure, Amazon, we fully support that. In fact, we're taking advantage of cloud services to make that a much more viable option for our customers. >> So let's get into the regulation and the compliance, 'cause that's a big piece of the motivation beyond just, you know, making sure that the business can recover, that the regulation and compliance thing is huge. You know, the GDPR, which has been around now for a couple of years, California protection act. And I think what I find interesting from your perspective is you have this kind of crazy sea of regulations that are different by country, by industry, by data type, and they're evolving all the time. So, that's got to be a relatively complex little grid you got to keep track of. >> Yeah, it makes the job interesting. But it also is a huge competitive advantage for us. We have a team that researches data privacy regulations around the world, and it's been a competitive advantage in that we can be incredibly nimble in creating a new policy. We had some opportunities come up in Turkey, there's a regulation there that mirrors GDPR called KVKK or KVKK I think they call it locally. And it's, a joke that it's kind of like GDPR, but with jail time for noncompliance. So there's a lot more motivation on the part of an IT department, to make sure they're meeting that requirement. But it has to do with dealing with, you know, data privacy again, and ensuring the safety of the continent. That's proliferating throughout the world. You mentioned California Consumer Privacy Act, many other States are starting to follow what the California Consumer Privacy Act. And I'm sure, it won't be long before we have a data privacy act in the US, that's nationwide instead of at the state level. In other industries that we serve, like the financial services industry. There's, you know, there's always been a lot of regulation around SEC and FINRA in the US, that's spreading to other countries now, you know, MiFID II in the European union has been huge. And that dictates you need to capture all voice conversations, all text conversations, instant messages, everything that goes on between a broker and the end customer, has to be captured, has to be supervised, and has to be maintained on warm storage. So that's a great segment for us as well. That's an area we play very well in. >> So it's interesting. 'Cause in preparing for this, I saw some of the recent announcements around the concept of data supervision. So I think a lot of people are familiar with backup and recovery, and continuity, but specifically data supervision. What does that really mean? How is that different than kind of traditional backup and recovery, and what are some of the really key features or attributes to make that a successful platform? >> Yeah, no, it is really outside of the realm of backup and recovery. Archiving is very different from backup and recovery. And then archiving is about preserving the communication, and being able to monitor that communication, for the purposes of meeting compliance regulations. So, in the case of our solution, Veritas advanced supervision, It sounds a bit big brotherish, if I'm being honest, but it is a requirement for the financial service community that you sample a subset of those communications looking for violations. So you're looking for insider trading, you're looking for money laundering. In some companies, at the HR departments, or even just trying to ensure that their employees are being compliant. And so you may sample a subset of content. But it's absolutely required within the financial services community. And we're starting to see a lot of other industries, you know, leveraging this technology just to ensure compliance with different regulations, or compliance with their own internal policies. Ensuring a safe work place, ensuring that there's not any sexual harassment, or that type of thing going on through office communications. So it is a way of just monitoring your employees communications. >> So it's while I remember when, when people used to talk about messaging, and kind of the generic sense, like I could never understand why, you know, it's an email, it's a text. I mean, little did I know that every single application is now installed on every single device that I have, has a messaging app, you know, has a direct messaging feature. So, I mean the complexity and, and I guess the, the variability in the communication methods, across all these applications and, you know, probably more than half of them, that most of us work on are SaaS as well, really adds a ton of complexity to the challenge that you were just talking about. >> Oh, absolutely. I mean, I'm old. You know, when I started, all of my communications were on a Microsoft mail server, all my files were in the file, you know, the server room down the hall. Now I've got about 20 different ways to communicate on my phone. And, the fragmentation of communication does make that job a lot more, more challenging. You know, now you need to take a voice conversation, convert it to text. With COVID and with, you know, the dawn of telemedicine, or at least the rapid growth, and telemessaging, telemedicine sorry. There is a whole new potential market for this kind of supervision tool. Now you can capture every doctor patient interaction that takes place over Zoom or over a Team's video, transcribe that content, and there's a wealth of value in that conversation. Not only can you tell if the doctor is responding to the patient, if the interaction is positive or negative, is the doctor helping to calm the patient down? Do they have good quality of interaction? That sort of thing. And so there's incredible value in capturing those communications, so you can learn from the... you know learn best practices, I guess. And then, feed that into a broader data lake, and correlate the interaction with patient outcomes, who are your great doctors? who are your, you know, that type of thing. So that's an area that we're very excited about going forward. >> Wow, that's pretty interesting. I never kind of thought that through, because I would have assumed that, you know, kind of most of the calls for this type of data were based on some type of a litigation. You know, it was some type of an ask or a request, that I was going to ask you, now how does that actually work within the context of this sea of data, that you have. Is it usually around a specific individual, who's got some issues and you're kind of looking at their ecosystem of communications, or is it more of a pattern, or is it potentially more of a keyword type of thing that's triggering, You know, kind of this forensics into this tremendous amount of data that's in all these enterprises. >> Yeah, it's a little bit of everything. Like, so first of all, we have the ability to capture a lot of different native content sources. But we also leverage partners to bring in other content sources. We can capture over 80 different content sources, all the, you know, instant messaging, social media, of course email, but even voice communications and video communications. And to answer your question as far as litigation, I mean, it really depends on the incident right? in the past, in the old days, any kind of litigation resulted in a fire drill where you're trying to find every scrap of evidence, every piece of information related to the case. By being a little bit proactive and capturing your email, and your communication streams into immutable storage in an archive, you're ready for that litigation event. And you've already indexed that content, you've already classified that content. So you can find the needle in the haystack. You can find the relevant content to prove your innocence, or at least to comply with the request for information. Now that has also led to solving similar issues for public sector. US federal, with the Freedom of Information Act. They're getting all kinds of requests for right now for COVID related communications. And that could be related to lawsuits. it could be related to just information around how stimulus funds are being spent. And they've got to respond to these requests very, very quickly. Our team came up with a COVID-19 classification policy, where we can actually weed out the communications related to COVID-19. To allow those federal agencies, and even state and local agencies, to quickly respond to those types of requests. So that's been an exciting area for us. And then there's still the SEC requirements to monitor broker dealers and conversations with end users, to ensure they're not doing anything, they shouldn't be doing like insider trading, >> Right. Which is so different, than kind of a post event, you know, kind of forensics investigation, and then collecting that data. So I'm curious, you know, how often are you having to update policies and update, you know, kind of the sniffers and the intelligence that goes behind the monitoring to trigger a flag, And then that just go into their own internal kind of compliance reg and set off a whole another chain of events? I would imagine. >> Yeah. I mean, there's a lot of things we can do with our classification policies. And like, in the case of the COVID policy, we just kind of crowd source that internally, and created a policy, in about a week, really. That we, you know, we shaped the basic policy and then kept refining it, refining it, testing it. And we were able to go from start to finish, and have it publicly available within about a week and a half. It was really a great effort. And we have that kind of ability to be very nimble, to react to different types of regulations as they become, you know, get out there. And then there's also a constant refining of even data privacy for every country that we support. You know, we have data privacy regulations for the entire European union and for most countries around the world, obviously the US, Canada, Australia, and so on. And, you can always make those policies better. So we've introduced feedback loops where our customers can give us feedback on what works and what isn't working, and we can tweak the policies as needed. But it is a great way to respond to whatever's going on in the world, to help our customer base, which, you know, is largely the financial verticals, the public sector verticals, but even healthcare is becoming more important for us. >> So Dave, I wonder if there's some other use cases that people aren't thinking about, where you guys have seen value in this type of analytics. >> Yeah, I mean, definitely the one thing that I think is just starting to emerge as the value that's inherent in communications. So I mentioned earlier the telemedicine idea, and, you know, can you learn from doctor-patient interactions if you're capturing them over telemedicine vehicles, you know again, Video Chat, Zoom, and that sort of thing. But similarly, if you've captured communications for a long time, as many of our customers have, what can you do with that data? And how can it feed into a broader data lake to give you new insights? So for example, if you want to gauge whether a major deal is about to close, you know, you can rely on your sales reps to populate the CRM and give you an indication it's 10% complete, it's 50% complete, whatever. But you're dependent on all the games that salespeople play. It would be far better to look at the pattern of a traditional deal Closing. You know, first you start out with one person at your company talking, to one person at the target customer that leads to meetings, that leads to calendar invites, that leads to emails being sent back and forth. You can look at the time of response, how quickly does the target customer respond to the sales rep? How often are they interacting? How many people are they interacting with? Is it spanning different GEOS? Is it spanning different groups within the company? Are there certain documents being sent back and forth, like, a quote for example. All of this can give you a higher confidence that that deal is going to close, or that deals failing. You don't really know. You can also look at historical data, and compare the current account manager to his predecessor. You know, does the current account manager interact with his customer as much as the former rep did? And is there a correlation in their effectiveness? You know, based on kind of their interactions, and their just basic skills. So I think that's an exciting field, and it shines a new light on the data that you have to collect, to comply with regulations, the data you have to collect for litigation and other reasons. Now there's other value there. >> Right. That's a fascinating story. So the reason that you guys would be involved in this, is because you're sitting on, you're sitting all that comms data, because you have to, for the regulation. I mean, what you're describing sounds like a perfect, you know, kind of sales force. Plug in. >> Absolutely. >> With a much richer dataset, versus as you said, relying on the sales person to input the sales force, information which would require them to remember their password, which gets reset every three weeks. So the chances of that are pretty slim. (laughs) >> Yeah. There's a fact, I think I've read a stat recently that about, you know, only 10% of information is actually captured in a CRM. You know, contact information and that sort of thing. But if you're looking at their emails, if you're looking at their phone calls and their texts, and that sort of thing, you get a rich set of data on contacts and people that you're interacting with at a target customer, and, you know, sales. More than any other job, I think sales has high turnover. And so you need that record of, you know, off the counter. One account rep leaves, you don't want to lose all their contacts and start over again. You want a smooth handover to the next person. >> Right. >> If you capture all that content from their communications into CRM, you're in great shape. >> Dave, I want to get your take on something that's happening now, because you're so dialed into policy, and policy and regulations, which are such a giant determinant of what people can and should and should not do, with data. When you take something like COVID and the conversations about people going back to work, and contact tracing. To me it's like, Wow! You know, it's kind of this privacy clash against HIPAA, and, you know, that's medical information. And yet it's like this particular disease has been deemed such that it kind of falls outside the traditional, you know, kind of HIPAA rules. They're not going to test me for any other ailments before I come in the door at work, but they, you know, eventually we're going to be scanning people. So, you know, the levels of complexity and dynamicism, if that's even a word, around something like that, that's even a one off, within a specific, you know, kind of medical data is got to be, you know, I guess, interesting and challenging, but from a policy perspective and an actual handling of that information, that's got to be a crazy challenge. >> Yeah. I mean, we do expect that COVID it's going to lead to all kinds of litigation and Freedom of Information Act request. And that's a big reason why we saw the importance very quickly, that we need a classification policy to highlight that content. So what we can do in this case is we can, first of all, identify where that content is stored. We have a product called data insight that can monitor your file system and quickly locate. If you've got a document that includes, you know, patient data or anything related to COVID-19, we can find that. And now as we bring in the communications, we can flag communications, as we archive them and say, this is related to COVID-19. Then when a litigation happens, you can look, you can do a quick search and you can filter on the COVID-19 tag. And the people you're concerned with, and the date range you're concerned with, you can easily pull in all of the communications, all of the file content, anything related to COVID-19. And this is huge for, again, for public sector, where there are subject to finance, you know, sorry, Freedom of Information Act request. But it's also going to affect every company, because like, it's going to be litigation around, when a company decided that they would work from home, and did they wait too long. You know, and did someone get sick because they weren't aggressive enough. There's going to be frivolous lawsuits, there's going to be more tangible lawsuits, and, there's going to be all kinds of activity around how stimulus funds were spent and that sort of thing. So, yeah. That's a great example of a case where you've got to find the content quickly and respond to requests very quickly. Classification go a long way there. >> Yeah. That's the lawyers have hardly gotten involved in this COVID thing yet. And, to your point, it's going to be both frivolous as well as justified. And did people come back too early? Did they take the right steps? It's going to be messy and sloppy, but it sounds like you're in a good position to help people get through it. So, you know, just kind of your final thoughts you've been in this business for a long time. The rate and pace of change is only increasing the complexity of veracity, stealing some good, old, big data words. Velocity of the data is only increasing, the sources are growing exponentially. You know, as you kind of sit back and reflect obviously, a lot of exciting stuff ahead, but what do you think about what gets you up in the morning beyond just continuing to race to keep up with the neverending see of changing regulatory environment? >> Yeah, that's a great question. I mean, I think we have a great portfolio that can really help us react to change, and to take advantage of some of these new trends. And that is exciting, like telemedicine, the changes that come with COVID-19, what we could do for telemedicine rating doctors gauging their performance. We could do the same sort of thing for tele-education. You know, like I have two kids that have had, you know, homeschooling for the last three months, and, probably are going to face that in the fall. And, there might be some needs to just rate how the teachers are doing, how well are the classes interacting, and what can we learn from best practices there. So I think that's interesting and interesting space as well. But what keeps me going is the fact that we've got market leading products in archiving, eDiscovery, and supervision. We're putting a lot of new energy into those solutions. They've been around a long time. We've been archiving since 1998 I think, and doing supervision and discovery for 20 years. And, it's strange, the market's still there, it's still expanding, it's still growing. And, it's kind of just keeping up to change and, trying to find better ways of surfacing the relevant human communications content that said that's kind of the key to the job, I think. >> Right. Well yeah, Finding that signal amongst the noise is going to get increasingly... >> Exactly. >> More difficult than has been kind of a recurring theme here over the last 12 weeks or 15 weeks, or however long it's been. As you know, this kind of light switch moment on digital transformation is no longer, when are we going to get to it, or we're going to do a POC or let's experiment a little bit, you know, here and there it's, you know, ready, set, go. Whether you're ready or not, whether that's a kindergarten teacher, that's never taught online, a high school teacher running a big business. So nothing but a great opportunity. (laughs) >> Absolutely. >> All right. >> Absolutely. I mean, it's a very, a changing world and lots of opportunity comes with that. >> All right. Well Dave, thank you for sharing your insight, obviously regulation compliance, and I like that, you know, data supervision is not just backup and recovery is much, much bigger opportunity, in a lot higher value activity. So congrats to you and the team. And thanks for the update. >> All right. Thank you, Jeff. Thanks for the time. >> All right. He's Dave and I'm Jeff. You're watching theCUBE. Thanks for watching, we'll see you next time. (upbeat music)
SUMMARY :
leaders all around the world. Coming to you today from to evolve between, you know, I mean, I'm on the compliance that the regulation and and the end customer, has to be captured, I saw some of the recent that you sample a subset and kind of the generic sense, is the doctor helping to of this sea of data, that you have. And that could be related to lawsuits. you know, kind of the as they become, you know, get out there. where you guys have seen value the data you have to So the reason that you guys So the chances of that are pretty slim. you know, off the counter. If you capture all that COVID and the conversations and the date range you're concerned with, Velocity of the data is only increasing, the key to the job, I think. the noise is going to As you know, this kind and lots of opportunity comes with that. So congrats to you and the team. Thanks for the time. we'll see you next time.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
Dave Scott | PERSON | 0.99+ |
David Scott | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Microsoft | ORGANIZATION | 0.99+ |
50% | QUANTITY | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
10% | QUANTITY | 0.99+ |
SEC | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
June 2020 | DATE | 0.99+ |
California Consumer Privacy Act | TITLE | 0.99+ |
California Consumer Privacy Act | TITLE | 0.99+ |
Virginia | LOCATION | 0.99+ |
Freedom of Information Act | TITLE | 0.99+ |
20 years | QUANTITY | 0.99+ |
two kids | QUANTITY | 0.99+ |
US | LOCATION | 0.99+ |
FINRA | ORGANIZATION | 0.99+ |
1998 | DATE | 0.99+ |
COVID-19 | OTHER | 0.99+ |
Freedom of Information Act | TITLE | 0.99+ |
Turkey | LOCATION | 0.99+ |
Boston | LOCATION | 0.99+ |
today | DATE | 0.99+ |
HIPAA | TITLE | 0.99+ |
about 20 years | QUANTITY | 0.99+ |
GDPR | TITLE | 0.98+ |
California protection act | TITLE | 0.98+ |
one person | QUANTITY | 0.98+ |
KVKK | ORGANIZATION | 0.98+ |
Veritas | ORGANIZATION | 0.98+ |
theCUBE | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
over 80 different content sources | QUANTITY | 0.98+ |
MiFID II | TITLE | 0.97+ |
one | QUANTITY | 0.97+ |
Australia | LOCATION | 0.97+ |
about a week and a half | QUANTITY | 0.97+ |
One | QUANTITY | 0.96+ |
Canada | LOCATION | 0.96+ |
Enterprise Vault.cloud. | TITLE | 0.96+ |
European union | ORGANIZATION | 0.95+ |
COVID-19 | TITLE | 0.94+ |
first | QUANTITY | 0.94+ |
about a week | QUANTITY | 0.92+ |
COVID | OTHER | 0.92+ |
more than half | QUANTITY | 0.91+ |
COVID | TITLE | 0.9+ |
one thing | QUANTITY | 0.89+ |
European union | ORGANIZATION | 0.88+ |
15 weeks | QUANTITY | 0.87+ |
Enterprise Vault.cloud | TITLE | 0.84+ |
CUBE | ORGANIZATION | 0.84+ |
Veritas | PERSON | 0.83+ |
about 20 different ways | QUANTITY | 0.82+ |
One account rep | QUANTITY | 0.82+ |
single device | QUANTITY | 0.79+ |
single application | QUANTITY | 0.73+ |
three weeks | QUANTITY | 0.73+ |
Enterprise | ORGANIZATION | 0.67+ |
last three months | DATE | 0.67+ |
couple of years | QUANTITY | 0.66+ |
eDiscovery | TITLE | 0.66+ |
-19 | OTHER | 0.65+ |
last 12 weeks | DATE | 0.6+ |
years | QUANTITY | 0.59+ |
Azure | ORGANIZATION | 0.56+ |
Matthew Hunt | Spark Summit 2017
>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, we're talking about data signs and engineering at scale, and we're having a great time, aren't we, George? >> We are! >> Well, we have another guest now we're going to talk to, I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg, Matt, thanks for joining us! >> My pleasure. >> Alright, we're going to talk about a lot of exciting stuff here today, but I want to first start with, you're a long-time member of the Spark community, right? How many Spark Summits have you been to? >> Almost all of them, actually, it's quite amazing to see the 10th one, yes. >> And you're pretty actively involved with the user group on the east coast? >> Matt: Yeah, I run the New York users group. >> Alright, well, what's that all about? >> We have some 2,000 people in New York who are interested in finding out what goes on, and which technologies to use, and what are people working on. >> Alright, so hopefully, you saw the keynote this morning with Matei? >> Yes. >> Alright, any comments or reactions from the things that he talked about as priorities? >> Well, I've always loved the keynotes at the Spark Summits, because they announce something that you don't already know is coming in advance, at least for most people. The second Spark Summit actually had people gasping in the audience while they were demoing, a lot of senior people-- >> Well, the one millisecond today was kind of a wow one-- >> Exactly, and I would say that the one thing to pick out of the keynote that really stood out for me was the changes in improvements they've made for streaming, including potentially being able to do sub-millisecond times for some workloads. >> Well, maybe talk to us about some of the apps that you're building at Bloomberg, and then I want you to join in, George, and drill down some of the details. >> Sure. And Bloomberg is a large company with 4,000-plus developers, we've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware, but there's some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in, that's a natural application, another one would be regulatory, there's a regulatory system called MiFID, or MiFID II, the regulations required for Europe, you have to be able to record every trade for seven years, provide daily reports, there's clearly a lot around that, and then I would also just say, our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. >> These applications sound like they're oriented towards streaming solutions, and really low latency. Has that been a constraint on what you can build so far? >> I would definitely say that we have some things that are latency constrained, it tends to be not like high frequency trading, where you care about microseconds, but milliseconds are important, how long does it take to get an answer, but I would say equally important with latency is efficiency, and those two often wind up being coupled together, though not always. >> And so when you say coupled, is it because it's a trade-off, or 'cause you need both? >> Right, so it's a little bit of both, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often, greater efficiencies. Micro-batching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in, and your method of processing them isn't efficient enough, it gets too slow simply from that, and that's why it's not just one or the other. >> So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency, and is that relevant for the apps that you're building? >> I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have, so that's one explicit one, but part of what's useful is, often when you sit down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through, and then you discover other inefficiencies that can be ironed out, and so it just makes it faster overall. I would say, a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on, well, all these things that make things much slower, much less efficient than they need to be, and we can close that gap a lot, I would say that from the very beginning. >> This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take N 10 control of continuous apps, for simplicity and performance, and so there's this, we'll write with transactional consistency, so we're assuring the customer of exactly one's semantics when we write to a file system or database or something like that. But, Spark has never really done native storage, whereas Matei came here on the show earlier today and said, "Well, Databricks as a company "is going to have to do something in that area," and he talks specifically about databases, and he said, he implied that Apache Spark, separate from Databricks, would also have to do more in state management, I don't know if he was saying key value store, but how would that open up a broader class of apps, how would it make your life simpler as a developer? >> Right. Interesting and great question, this is kind of a subject that's near and dear to my own heart, I would say. So part of that, when you take a step back, is about some of the potential promise of what Spark could be, or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value, if you can learn one small skillset, but it can work in a wide variety of use cases, whether it's streaming or at rest or analytics, and plug other things in. As always, there's a gap in any such system between theory and reality, and how much can you close that gap, but as for storage systems, this is something that, you and I have talked about this before, and I've written about it a fair amount too, Spark is historically an analytic system, so you have a bunch of data, and you can do analytics on it, but where's that data come from? Well, either it's streaming in, or you're reading from files, but most people need, essentially, an actual database. So what constitutes the universal system? You need file store, you need a distributive file store, you need a database with generally transactional semantics because the other forms are too hard for people to understand, you need analytics that are extensible, and you need a way to stream data in, and there's how close can you get to that, versus how much do you have to fit other parts that come together, very interesting question. >> So, so far, they've sort of outsourced that to DIY, do-it-yourself, but if they can find a sufficiently scalable relational database, they can do the sort of analytical queries, and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that, like Cassandra would be the, sort of the database that would handle all updates, and then some amount of those would be filtered through to a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you would have the option to drop that out, what applications would you undertake that you couldn't use right now, where the theme was, we're going to take big data apps into production, and then the competition that they show for streaming is of Kafka and Flink, so what does that do to that competitive balance? >> Right, so how many pieces do you need, and how well do they fit together is maybe the essence of that question, and people ask that all the time, and one of the limits has been, how mature is each piece, how efficient is it, and do they work together? And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real-world usage. I think we're coalescing around a smaller set of options, so in the, Kafka, for example, has a lot of usage, and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data, for ingest, I see that everywhere I go. But what happens when you move from Kafka into Spark, or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it, or Oracle, or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability, so we actually just opensourced that a few weeks ago, it's called ComDB2, and the reason we had to do that was the industry solutions at the time, when we started working on that, were inadequate to our needs, but we look at how long that took to develop for these other systems and think, that's really hard for someone else to get right, and so, if you need a database, which everyone does, how can you make that work better with Spark? And I think there're a number of very interesting developments that can make that a lot better, short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too. How do you make them work well together, we could talk about for a while, 'cause that's a fascinating question. >> On that one topic, maybe the Databricks guys don't want to assume responsibility for the development, because then they're picking a winner, perhaps? Maybe, as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Splice Machine and SnappyData do the work, take it upon themselves to take data frames, the core data structure, in Spark, and give it transactional semantics. Does that sound promising? >> There're multiple avenues for potential success, and who can use which, in a way, depends on the audience. If you look at things like Cassandra and HBase, they're distributing key value stores that additional things are being built on, so they started as distributed, and they're moving towards more encompassing systems, versus relational databases, which generally started as single image on single machine, and are moving towards federation distribution, and there's been a lot with that with post grads, for example. One of the questions would be, is it just knobs, or why don't they work well together? And there're a number of reasons. One is, what can be pushed down, how much knowledge do you have to have to make that decision, and optimizing that, I think, is actually one of the really interesting things that could be done, just as we have database query optimizers, why not, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted, but are coming. One is the very efficient copy of data between systems, and Apache Arrow, for example, is very, very interesting, and it's nearing the time when I think it's just going to explode, because it lets you connect these systems radically more efficiently in a standardized way, and that's one of the things that was missing, as soon as you hop from one system to another, all of a sudden, you have the semantic computational expense, that's a problem, we can fix that. The other is, the next level of integration requires, basically, exposing more hooks. In order to know, where should a query be executed and which operator should I push down, you need something that I think of as a meta-optimizer, and also, knowledge about the shape of the data, or statistics underlying, and ways to exchange that back and forth to be able to do it well. >> Wow, Matt, a lot of great questions there. We're coming up on a break, so we have to wrap things up, and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community, do over the next year. What are the top issues, things you'd love to see worked on? >> Right. It's an exciting time for Spark, because as time goes by, it gets more and more mature, and more real-world applications are viable. The hardest thing of all is to get, anywhere you in any organization's to get people working together, but the more people work together to enable these pieces, how do I efficiently work with databases, or have these better optimizations make streaming more mature, the more people can use it in practice, and that's why people develop software, is to actually tackle these real-world problems, so, I would love to see more of that. >> Can we all get along? (chuckling) Well, that's going to be the last word of this segue, Matt, thank you so much for coming on and spending some time with us here to share the story! >> My pleasure. >> Alright, thank you so much. Thank you George, and thank you all for watching this segment of theCUBE, please stay with us, as Spark Summit 2017 will be back in a few moments.
SUMMARY :
covering Spark Summit 2017, brought to you by Databricks. it's quite amazing to see the 10th one, yes. and what are people working on. that you don't already know is coming in advance, and I would say that the one thing and then I want you to join in, George, you have to be able to record every trade for seven years, Has that been a constraint on what you can build so far? where you care about microseconds, On the other hand, if you have a really large volume and then you discover other inefficiencies and so there's this, we'll write and there's how close can you get to that, what applications would you undertake and so, if you need a database, which everyone does, and give it transactional semantics. it's just going to explode, because it lets you and I wanted to give you at least 30 seconds and that's why people develop software, Alright, thank you so much.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
Matt Hunt | PERSON | 0.99+ |
Bloomberg | ORGANIZATION | 0.99+ |
Matthew Hunt | PERSON | 0.99+ |
Matt | PERSON | 0.99+ |
Matei | PERSON | 0.99+ |
New York | LOCATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
30 years | QUANTITY | 0.99+ |
seven years | QUANTITY | 0.99+ |
each piece | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
5,000 skills | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.98+ |
Europe | LOCATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
DB2 | TITLE | 0.98+ |
200 different products | QUANTITY | 0.98+ |
Spark Summits | EVENT | 0.98+ |
Spark Summit | EVENT | 0.98+ |
today | DATE | 0.98+ |
one system | QUANTITY | 0.97+ |
next year | DATE | 0.97+ |
4,000-plus developers | QUANTITY | 0.97+ |
first | QUANTITY | 0.96+ |
HBase | ORGANIZATION | 0.95+ |
second | QUANTITY | 0.94+ |
decades | QUANTITY | 0.94+ |
MiFID II | TITLE | 0.94+ |
one topic | QUANTITY | 0.92+ |
this morning | DATE | 0.92+ |
single machine | QUANTITY | 0.91+ |
One of | QUANTITY | 0.91+ |
ComDB2 | TITLE | 0.9+ |
few weeks ago | DATE | 0.9+ |
Cassandra | PERSON | 0.89+ |
earlier today | DATE | 0.88+ |
10th one | QUANTITY | 0.88+ |
2,000 people | QUANTITY | 0.88+ |
one thing | QUANTITY | 0.87+ |
Kafka | TITLE | 0.87+ |
single image | QUANTITY | 0.87+ |
MiFID | TITLE | 0.85+ |
Spark | ORGANIZATION | 0.81+ |
Splice Machine | TITLE | 0.81+ |
Project Tungsten | ORGANIZATION | 0.78+ |
theCUBE | ORGANIZATION | 0.78+ |
at least 30 seconds | QUANTITY | 0.77+ |
Cassandra | ORGANIZATION | 0.72+ |
Apache Spark | ORGANIZATION | 0.71+ |
questions | QUANTITY | 0.7+ |
things | QUANTITY | 0.69+ |
Apache Arrow | ORGANIZATION | 0.69+ |
SnappyData | TITLE | 0.66+ |