Mike Cohen, Splunk | Leading with Observability

(upbeat music playing) >> Narrator: From theCUBE's studios in Palo Alto in Boston, connecting with thought leaders all around the world. This is a CUBE conversation. >> Hello, everyone, welcome to this CUBE conversation. I'm John Ferry, host of theCUBE. We're doing a content series called leading with observability. And this segment is network observability for distributed services. And we have CUBE alumni Mike Cohen, head of product management for network monitoring at Splunk. Mike, great to see you. It's been a while, going back to the open stack days, red hat summit. Now here talking about observability with Splunk. Great to see you. >> Thanks a lot for having me. >> So the world's right now observability is at the center of all the conversations from monitoring, investing infrastructure, on premises cloud and also cyber security. A lot of conversations, a lot of, broad reaching implications observability. You're at the head of product management, network observability at Splunk. This is where the conversation's going getting down at the network layer, getting down into the, as the packets move around. This is becoming important. Why is this the trend? What's the situation? >> Yeah, so we're seeing a couple of different trends that are really driving how people think about observability, right? One of them is this huge migration towards public cloud architecture. And you're running, you're running on an infrastructure that you don't own yourself. The other one is around how people are rebuilding and refactoring applications around service-based architectures scale-out models, cloud native paradigms. And both of these things is, they're really introducing a lot of new complexity into the applications and really increasing the service area of where problems can occur. And what this means is when you actually have gaps in visibility or places where you have a separate tool, you know, analyzing parts of your system. It really makes it very hard to debug when things go wrong and to figure out where problems occur. And really what we've seen is that, you know people really need an integrated solution to observability. And one that can really span from what your user is seeing but all the way to the deepest backend services. Where are the problems in some of the core in your infrastructure that you're operating? So that you can really figure out where, where problems occur. And really network observability is playing a critical role in kind of filling in one of those critical gaps. >> You know, you think about the 10 years past decade we've been on this wave. It feels like now more than ever, it's an inflection point because of how awesome cloud native has become from a value standpoint. Value creation, time to market all those things that you know why people are investing in modern applications. But then as you build out your architecture and your, your infrastructure to make that happen there's more things happening. Everything as a service creates new dependencies new things to document. This is an opportunity, certainly on one hand on the other hand, it's a technical challenge. So, you know, balancing out, technical dead end or deploying new stuff, you got to monitor it all. Right, monitoring has turned into observability which is just code word for cloud scale monitoring, I guess. I mean, is that how you see it? I mean, how could you, how do you talk about this? Because it's certainly a major shift happening right now and this transition is pretty obvious. >> Yeah. Yeah, no, absolutely. And we've, you know, we've seen a lot of new interests into the network visibility, network monitoring space. And really, again, the drivers of that, like you know, network infrastructure is actually becoming increasingly opaque as you move towards, you know public cloud. You know, kind of public cloud environments. And it's been sort of a fun thing to blame the network. And say, look Oh it's the network we don't know what's going on. But you know, it's not always the network. Sometimes it is, sometimes it isn't. You actually need to understand where these problems are really occurring to actually have the right level of visibility in your systems. But the other way we've started talking to people thinking about this is. The network has an empowering capability an untapped resource. That you can actually get new data about your distributed systems. You know, SREs are struggling to understand these complex environments, but by. You know with the capabilities we've seen and started taking advantage of things like EBPF and monitoring from the OS. We can actually get visibility into how processes and containers communicate and that can give us insights into our system. It's a new source of data that actually has not existed in the past. That is now available to help us with the broader observability problem. >> You mentioned SRE, Site Reliable Engineers, as it's known Google kind of pioneered this. It's become a kind of a standard persona in large scale kind of infrastructure, cloud environments and what not like massive scale. Are you seeing SREs, now that role become more mainstream in enterprises? I mean, cause some enterprises might not call on the SRE medical on the cloud architect. I mean, what can you just help as you know, if you could tie that together cause it is certainly happening. Is it becoming a proliferating? >> For sure, absolutely Yeah. No absolutely, I think SREs, you know, the title may vary across organizations as you point out. And sometimes the exact layout of you know, the organizational breakdown varies. But this role of someone who really cares about keeping the system up you know, and you know, caring for it and scaling it out and thinking about its architecture is now a really critical role. And sometimes that role sits alongside, it sits alongside developers who are writing the code. And this is really happening in almost every organization that, that we're dealing with today. It is becoming a mainstream occurrence. >> Yeah, it's interesting, I'm going to ask you a question about what businesses are missing when they think about how to, think about observability but since you brought up that, that piece. It's almost as if kubernetes created this kind of demarcation between the line. Between half the stack and the top of the half and bottom half of the stack. Where you can do a lot of engineering underneath the second half of the stack or the bottom of the stack up to say kubernetes and then above that you could just be infrastructure as code application developer. So it's almost, it's almost kind of like leveled out with nice lanes there. I mean, I'm oversimplifying it, but I mean how do you react to that? Do you see that evolving too? Because it's all seems cleaner now. It's like you're engineering below Kubernetes or above it. >> Oh, absolutely. It's definitely one of the ways you see sort of the deepest engagement in. As folks go towards Kubernetes, they start embracing containers. They you know, they start building microservices. You'll see development teams really accelerate the pace of innovation that they have, you know, in in their environment. And that's really the, you know kind of the driver behind this. So, you know, we do see that, that sort of rebuilding refactoring as some of the most, some of the biggest drivers behind, these initiatives. >> What are businesses missing around observability? Cause it seems to be, first of all a very overfunded segment, a lot of new startups coming in. A lot of security vendors over here, you're seeing network folks moving in. What's almost becoming a fabric feature piece of things. What is that mean to businesses? What, what are businesses missing or getting? How are people evaluating observability? How do you see that? >> Yeah. So I'll, for sure, I'll talk. I'll start initially to talk generically about it but then I'll talk a little bit about network areas specifically, right? That's I think one of the, one of the things people are realizing they need in observability is this approach as an integrated suite. So having a disparate set of tools can make it very hard for SREs to actually take advantage of all those tools, use the data within them to solve meaningful problems. And I think what we're, you know, what we're seeing as we've been talking to more people in the industry. They really want something that can bring all that data together and build it into an insight that can help them solve a problem more quickly. Right, so that, you know, I think that's the broader context of what's going on. And I think that's driving some of the work we're doing on the network side. Because, network is a powerful new data set that we can combine with other aspects of what people have already been doing in observability. >> What do you think about programmability? That's been a big topic, when you start to get into that kind of mindset. You're almost making the the software defined aspect come in here heavily. How does that play in, how do you what's your vision around, you know making the network adaptable, programmable, measurable, fully, fully surveilled? >> Yeah, yeah. So I think we'll work, well again, what we're focused on is the capabilities you can have in using, using the network as a means of visibility and observability for, for its systems. Networks are becoming highly flexible. A lot of people, once they get into a cloud environment they have a very rich set of networking capabilities. But what they want to be able to do is use that as a way of getting visibility into the system. So, to talk for, I can talk for a minute or two about some of the capabilities we're exposing. Use it in network observer, network observability. One of them is just being able to visual, visualize and optimize a service architecture. So really seeing what's connecting to what automatically. So we've been using a technology called EBPF, the Extended Berkeley Packet Filter. Part of everyone's Linux operating system, right? You know, you're running Linux you basically have this already. And it gives you an interesting touch point to observe the behavior of every processing container automatically. When you can actually see, with very little overhead what they're doing and correlate that with data from systems like Kubernetes to understand how distributed systems behave. To see how things connect to two other things. We can use this to build a complete service map of the system in seconds, automatically without developers having to do any additional work. Without having, without forcing anyone to change their code. They can get visibility across an entire system automatically. >> That's like the original value proposition of Splunk. When it came out, it was just a great tool for Splunk and the data from logs. Now, as data becomes more complex you're still instrumenting and those are critical services. And they're now microservices, the trends at the top of the stack and on, at the network layer. The network layer has always been a hard nut to crack. I got to ask you why now? Why do you feel, you mentioned earlier that everyone used to blame the network. Oh, it's not my problem. You really can't finger point when you start getting into full instrumentation of the, of the traffic patterns and the underlying processes. So it seems to be good magic going on here. What's the core issue? What are the, what's the, what's going on here? Why is it, why is it now? >> Mike: Yeah. >> Why is the time now? >> Yeah. So, yeah, well. So unreliable networks, slow network, DNS problems. These have always been present in systems. The problem is they're actually becoming exacerbated because people have less visibility into, into them. But also as you have these distributed systems the failure modes are getting more complex. So you'll actually have some of the longest, most challenging troubleshooting problems are these network issues, which tend to be transient which tend to bounce around the systems. They tend to cause other unrelated alerts to happen. Inside your application stack with multiple teams, troubleshooting the wrong problems that don't really exist. So, the network has actually caused some of the most painful outages that the teams, the teams see. And when these outages happen, what you really need to be able to know is, is it truly a network problem or is it something in another part of my system? If I'm running a distributed service, what, you know, which services are affected? Because that's the language now my team thinks about. As you mentioned now, they're in kubernetes. They're trying to think which Kubernetes services are actually going, affected by a potential network outage that I'm worried about? The other aspect is figuring out the scope of the impact. So, are there a couple instances in my cloud provider that aren't doing well? Is an entire availability zone, having problems? Is there a region of the, of the world that, that's an issue? Understanding the scope of this problem will actually help me as an SRE decide what the right mitigation is. And, you know, and by limiting it as much as possible, it can actually help me better hit my SLA. Because I won't have to hit something with a huge hammer when a really small one might solve the problem. >> Yeah, this is one of the things that comes up. Almost just hearing you talk I'm seeing how it could be complex for the customer just documenting the dependencies. I mean, as services come online someone of them are going to be very dynamic not just at the network, both the application level, we mentioned Kubernetes. And you've got service meshes and microservices. You're going to start to see the need to be tracking all this stuff. And that's a big, that's a big part of what's going on with the, with your suite right now. The ability to help there. How are you guys helping people do that? >> Yeah, absolutely. So, you know, just understanding dependencies is, you know, is one of the key aspects of these distributed systems. You know, this began as a simple problem. You have a monolithic application it kind of runs on one machine. You understand its behavior. Once you start moving towards microservices it's very easy for that to change from. Look, we have a handful of microservices to we have hundreds, to we have thousands and they can be running across thousands or tens of thousands of machines as you get bigger. And understanding that environment can become a major challenge and teams' role. They'll end up with the handwritten diagram that has the behavior of their services broken out. Or they'll find out that there's an interaction that they didn't expect to have happened. And that may be the source of an issue. So, you know, one of the capabilities we have using network monitoring out of the operating system with EBPF. Is, we can actually automatically discover every connection that's made. So if you're able to watch the sockets they're created in licks, you can actually see how containers interact with each other. Then you can use that to build automatic service dependency diagrams. So without the user having to change the code, to change anything about their system. You can automatically discover those dependencies and you'll find things you didn't expect. You'll find things that change over time, that weren't well-documented. And these are the critical, the critical level of understanding you need to get to and use the environment. >> Yeah. You know, it's interesting you mentioned that you might've missed them in the past. People have that kind of feeling at the network either because they weren't tracking it well or they used a different network tool. I mean, just packet loss by itself is one, service and host health is another. And if you could track everything, then you got to build it. So I got, so I love, love this direction. My question really is more of, okay how do you operationalize it? Okay, I'm a operator, am I getting alerts? Do I, does it just auto discover? How does this all work from a user, usability standpoint? How do I? >> Yeah. >> What are the key features that unlock, what gets unlocked from that, that kind of instrumentation? >> Yeah, well again, when you do this estimation correctly. It can be really, it can be automatic, right? You can actually put an agent that might run in one of your, on your instances collecting data based on the, that the traffic and the interactions that occur without you having to take any action that's really the Holy grail. And that's where some of the best value of these systems emerge. It just works out of the box. And then it'll pull data from other systems like your cloud provider from your Kubernetes environment and use that to build a picture of what's going on. And that's really where this is, where these systems get super valuable is they actually just, they just work without you having to do a ton of work behind the scenes. >> So Mike, I got to ask you a final question. Explain the distributed services aspect of observability. What should people walk away with from a main concept standpoint and how does it apply to their environment? What should they be thinking about? What is it and what's the real story there? >> Yeah, so I think the way we're thinking about this is. How can you turn, the network from a liability to a strength in the, in your, in these distributed environments, right? So, what it can, you know, by observing data at the network level and, out of the operating system. You can actually use it to automatically construct service maps. To learn about your system, improve the insight and understanding you have of your, of your complex systems. You can identify network problems that are occurring. You can understand how you're utilizing aspects of the network. It can drive things like, costs, cost optimization in your environment. So you can actually get better insights and, be able to troubleshoot problems better and handle the blame game of, is the network really the problem that I'm seeing or is it occurring somewhere else in my application? And though, that's really critical in these complex distributed environments. And critically you can do it in a way that doesn't actually add overhead to your development team. You don't have to change the code. You don't have to, take on a complex engineering task. You just, you can actually deploy agents. that'll act, that'll be able to collect this data automatically. >> Awesome, and take that complexity away and automate, help people get the job done. Great, great stuff. Mike, thanks for coming on theCUBE. Leading with observability, I'm John Ferry with theCUBE. Thanks for watching. >> Mike: Yeah, thanks a lot. (gentle music playing)

Published Date : Feb 22 2021

SUMMARY :

all around the world. to the open stack days, red hat summit. So the world's right So that you can really figure out where, I mean, is that how you see it? And we've, you know, we've seen I mean, what can you about keeping the system up you know, and bottom half of the stack. of innovation that they have, you know, in What is that mean to businesses? And I think what we're, you know, How does that play in, how do you of the system in seconds, automatically I got to ask you why now? of the most painful how it could be complex for the customer And that may be the source of an issue. And if you could track everything, that the traffic and the Explain the distributed services of the network. people get the job done. Mike: Yeah, thanks a lot.

ENTITIES

Entity	Category	Confidence
John Ferry	PERSON	0.99+
Mike Cohen	PERSON	0.99+
Mike	PERSON	0.99+
Palo Alto	LOCATION	0.99+
thousands	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
One	QUANTITY	0.99+
Splunk	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
two	QUANTITY	0.99+
both	QUANTITY	0.99+
one machine	QUANTITY	0.99+
theCUBE	ORGANIZATION	0.99+
one	QUANTITY	0.99+
Boston	LOCATION	0.99+
CUBE	ORGANIZATION	0.98+
Linux	TITLE	0.98+
a minute	QUANTITY	0.98+
Kubernetes	TITLE	0.97+
half	QUANTITY	0.94+
second half of	QUANTITY	0.88+
today	DATE	0.87+
tens of thousands	QUANTITY	0.83+
SRE	ORGANIZATION	0.82+
EBPF	TITLE	0.79+
half the stack	QUANTITY	0.79+
10 years past decade	DATE	0.77+
two other things	QUANTITY	0.74+
machines	QUANTITY	0.73+
SREs	TITLE	0.71+
Engineers	ORGANIZATION	0.67+
Site	ORGANIZATION	0.67+
Splunk	TITLE	0.63+
SRE	TITLE	0.62+
couple	QUANTITY	0.62+
them	QUANTITY	0.58+
Berkeley	COMMERCIAL_ITEM	0.57+
red hat	EVENT	0.54+
Splunk	PERSON	0.51+
Kubernetes	ORGANIZATION	0.5+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Mike Cohen: