PUBLIC SECTOR Speed to Insight
>>Hi, this is Cindy Mikey, vice president of industry solutions at caldera. Joining me today is chef is Molly, our solution engineer for the public sector. Today. We're going to talk about speed to insight. Why using machine learning in the public sector, specifically around fraud, waste and abuse. So topic for today, we'll discuss machine learning, why the public sector uses it to target fraud, waste, and abuse, the challenges. How do we enhance your data and analytical approaches the data landscape analytical methods and shad we'll go over reference architecture and a case study. So by definition at fraud waste and abuse per the government accountability office is broad as an attempt to obtain something about a value through unwelcomed misrepresentation waste is about squandering money or resources and abuse is about behaving improperly or unreasonably to actually obtain something of value for your personal, uh, benefit. So as we look at fraud, um, and across all industries, it's a top of mind, um, area within the public sector. >>Um, the types of fraud that we see is specifically around cyber crime, uh, looking at accounting fraud, whether it be from an individual perspective to also, uh, within organizations, looking at financial statement fraud, to also looking at bribery and corruption, as we look at fraud, it really hits us from all angles, whether it be from external perpetrators or internal perpetrators, and specifically for the research by PWC, the key focus area is we also see over half of fraud is actually through some form of internal or external perpetrators, again, key topics. So as we also look at a report recently by the association of certified fraud examiners, um, within the public sector, the us government, um, in 2017, it was identified roughly $148 billion was attributable to fraud, waste and abuse. Specifically of that 57 billion was focused on reported monetary losses and another 91 billion on areas where that opportunity or the monetary basis had not yet been measured. >>As we look at breaking those areas down again, we look at several different topics from an out payment perspective. So breaking it down within the health system, over $65 billion within social services, over $51 billion to procurement fraud to also, uh, uh, fraud, waste and abuse that's happening in the grants and the loan process to payroll fraud, and then other aspects, again, quite a few different topical areas. So as we look at those areas, what are the areas that we see additional type of focus, those are broad stroke areas. What are the actual use cases that, um, agencies are using the data landscape? What data, what analytical methods can we use to actually help curtail and prevent some of the, uh, the fraud waste and abuse. So, as we look at some of the analytical processes and analytical use great, uh, use cases in the public sector, whether it's from, uh, you know, the taxation areas to looking at, you know, social services, uh, to public safety, to also the, um, our, um, additional agency methods, we're going to focus specifically on some of the use cases around, um, you know, fraud within the tax area. >>Uh, we'll briefly look at some of the aspects of unemployment insurance fraud, uh, benefit fraud, as well as payment integrity. So fraud has its, um, uh, underpinnings in quite a few different government agencies and difficult, different analytical methods and I usage of different data. So I think one of the key elements is, you know, you can look at your, your data landscape on specific data sources that you need, but it's really about bringing together different data sources across a different variety, a different velocity. So, uh, data has different dimensions. So we'll look at on structured types of data of semi-structured data, behavioral data, as well as when we look at, um, you know, predictive models, we're typically looking at historical type information, but if we're actually trying to look at preventing fraud before it actually happens, or when a case may be in flight, which is specifically a use case that Chev is going to talk about later it's how do I look at more, that real, that streaming information? >>How do I take advantage of data, whether it be, uh, you know, uh, financial transactions we're looking at, um, asset verification, we're looking at tax records, we're looking at corporate filings. Um, and we can also look at more, uh, advanced data sources where as we're looking at, um, investigation type information. So we're maybe going out and we're looking at, uh, deep learning type models around, uh, you know, semi or that, uh, behavioral that's unstructured data, whether it be camera analysis and so forth. So for quite a different variety of data and the breadth and the opportunity really comes about when you can integrate and look at data across all different data sources. So in essence, looking at a more extensive, uh, data landscape. So specifically I want to focus on some of the methods, some of the data sources and some of the analytical techniques that we're seeing, uh, being used, um, in the government agencies, as well as opportunities to look at new methods. >>So as we're looking at, you know, from a, um, an audit planning or looking at, uh, the opportunity for the likelihood of non-compliance, um, specifically we'll see data sources where we're maybe looking at a constituents profile, we might actually be investigating the forms that they provided. We might be comparing that data, um, or leveraging internal data sources, possibly looking at net worth, comparing it against other financial data, and also comparison across other constituents groups. Some of the techniques that we use are some of the basic natural language processing, maybe we're going to do some text mining. We might be doing some probabilistic modeling, uh, where we're actually looking at, um, information within the agency to also comparing that against possibly tax forms. A lot of times it's information historically has been done on a batch perspective, both structured and semi-structured type information. And typically the data volumes can be low, but we're also seeing those data volumes on increase exponentially based upon the types of events that we're dealing with, the number of transactions. >>Um, so getting the throughput, um, and chef's going to specifically talk about that in a moment. The other aspect is, as we look at other areas of opportunity is when we're building upon, how do I actually do compliance? How do I actually look at conducting audits or potential fraud to also looking at areas of under-reported tax information? So there you might be pulling in, um, some of our other types of data sources, whether it's being property records, it could be data that's being supplied by the actual constituents or by vendors to also pulling in social media information to geographical information, to leveraging photos on techniques that we're seeing used is possibly some sentiment analysis, link analysis. Um, how do we actually blend those data sources together from a natural language processing? But I think what's important here is also the method and the looking at the data velocity, whether it be batch, whether it be near real time, again, looking at all types of data, whether it's structured semi-structured or unstructured and the key and the value behind this is, um, how do we actually look at increasing the potential revenue or the, uh, under reported revenue? >>Uh, how do we actually look at stopping fraudulent payments before they actually occur? Um, also looking at increasing the amount of, uh, the level of compliance, um, and also looking at the potential of prosecution of fraud cases. And additionally, other areas of opportunity could be looking at, um, economic planning. How do we actually perform some link analysis? How do we bring some more of those things that we saw in the data landscape on customer, or, you know, constituent interaction, bringing in social media, bringing in, uh, potentially police records, property records, um, other tax department, database information. Um, and then also looking at comparing one individual to other individuals, looking at people like a specific constituent, are there areas where we're seeing, uh, um, other aspects of a fraud potentially being occurring. Um, and also as we move forward, some of the more advanced techniques that we're seeing around deep learning is looking at computer vision, um, leveraging geospatial information, looking at social network entity analysis, uh, also looking at, um, agent-based modeling techniques, where we're looking at, uh, simulation Monte Carlo type techniques that we typically see in the financial services industry, actually applying that to fraud, waste, and abuse within the, uh, the public sector. >>Um, and again, that really lends itself to a new opportunities. And on that, I'm going to turn it over to Shev to talk about, uh, the reference architecture for, uh, doing these baskets. >>Thanks, Cindy. Um, so I'm going to walk you through an example, reference architecture for fraud detection using, uh, Cloudera underlying technology. Um, and you know, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. So with fraud detection, what we're trying to do is identify anomalies or novelists behavior within our data sets. Um, now in order to understand what aspects of our incoming data represents anomalous behavior, we first need to understand what normal behavior is. So in essence, once we understand normal behavior, anything that deviates from it can be thought of as an anomaly, right? So in order to understand what normal behavior is, we're going to need to be able to collect store and process a very large amount of historical data. And so then comes clutter's platform and this reference architecture that needs to before you, so, uh, let's start on the left-hand side of this reference architecture with the collect phase. >>So fraud detection will always begin with data collection. Uh, we need to collect large amounts of information from systems that could be in the cloud. It could be in the data center or even on edge devices, and this data needs to be collected so we can create our normal behavior profiles. And these normal behavioral profiles would then in turn, be used to create our predictive models for fraudulent activity. Now, uh, uh, to the data collection side, one of the main challenges that many organizations face, uh, in this phase, uh, involves using a single technology that can handle, uh, data that's coming in all different types of formats and protocols and standards with different porosities and velocities. Um, let me give you an example. Uh, we could be collecting data from a database that gets updated daily, uh, and maybe that data is being collected in Agra format. >>At the same time, we can be collecting data from an edge device that's streaming in every second, and that data may be coming in Jason or a binary format, right? So this is a data collection challenge that can be solved with clutter data flow, which is a suite of technologies built on Apache NIFA and mini five, allowing us to ingest all of this data, do a drag and drop interface. So now we're collecting all of this data, that's required to map out normal behavior. The next thing that we need to do is enrich it, transform it and distribute it to, uh, you know, downstream systems for further process. Uh, so let's, let's walk through how that would work first. Let's taking Richmond for, uh, for enrichment, think of adding additional information to your incoming data, right? Let's take, uh, financial transactions, for example, uh, because Cindy mentioned it earlier, right? >>You can store known locations of an individual in an operational database, uh, with Cloudera that would be HBase. And as an individual makes a new transaction, their geo location that's in that transaction data, it can be enriched with previously known locations of that very same individual and all of that enriched data. It can be later used downstream for predictive analysis, predictable. So the data has been enrich. Uh, now it needs to be transformed. We want the data that's coming in, uh, you know, Avro and Jason and binary and whatever other format to be transformed into a single common format. So it can be used downstream for stream processing. Uh, again, this is going to be done through clutter and data flow, which is backed by NIFA, right? So the transformed semantic data is then going to be stimulated to Kafka and coffin. It's going to serve as that central repository of syndicated services or a buffer zone, right? >>So cough is, you know, pretty much provides you with, uh, extremely fast resilient and fault tolerance storage. And it's also going to give you the consumer APIs that you need that are going to enable a wide variety of applications to leverage that enriched and transformed data within your buffer zone. Uh, I'll add that, you know, 17, so you can store that data, uh, in a distributed file system, give you that historical context that you're going to need later on for machine learning, right? So the next step in the architecture is to leverage a cluttered SQL string builder, which enables us to write, uh, streaming sequel jobs on top of Apache Flink. So we can, uh, filter, analyze and, uh, understand the data that's in the Kafka buffer zone in real time. Uh I'll you know, I'll also add like, you know, if you have time series data, or if you need a lab type of cubing, you can leverage kudu, uh, while EDA or exploratory data analysis and visualization, uh, can all be enabled through clever visual patient technology. >>All right, so we've filtered, we've analyzed and we've explored our incoming data. We can now proceed to train our machine learning models, uh, which will detect anomalous behavior in our historically collected data set, uh, to do this, we can use a combination of supervised unsupervised, uh, even deep learning techniques with neural networks and these models can be tested on new incoming streaming data. And once we've gone ahead and obtain the accuracy of the performance, the scores that we want, we can then take these models and deploy them into production. And once the models are productionalized or operationalized, they can be leveraged within our streaming pipeline. So as new data is ingested in real-time knife, I can query these models to detect if the activity is anomalous or fraudulent. And if it is, they can alert downstream users and systems, right? So this in essence is how fraudulent activity detection works. >>Uh, and this entire pipeline is powered by clutter's technology, right? And so, uh, the IRS is one of, uh, clutters customers. That's leveraging our platform today and implementing, uh, a very similar architecture, uh, to detect fraud, waste, and abuse across a very large set of, uh, historical facts, data. Um, and one of the neat things with the IRS is that they've actually, uh, recently leveraged the partnership between Cloudera and Nvidia to accelerate their Spark-based analytics and their machine learning. Uh, and the results have been nothing short of amazing, right? And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the research analytics and statistics division group within the IRS with zero changes to our fraud detection workflow, we're able to obtain eight times to performance simply by adding GPS to our mainstream big data servers. This improvement translates to half the cost of ownership for the same workloads, right? So embedding GPU's into the reference architecture I covered earlier has enabled the IRS to improve their time to insights by as much as eight X while simultaneously reducing their underlying infrastructure costs by half, uh, Cindy back to you >>Chef. Thank you. Um, and I hope that you found, uh, some of the, the analysis, the information that Sheva and I have provided, um, to give you some insights on how cloud era is actually helping, uh, with the fraud waste and abuse challenges within the, uh, the public sector, um, specifically looking at any and all types of data, how the clutter a platform is bringing together and analyzing information, whether it be you're structured you're semi-structured to unstructured data, both in a fast or in a real time perspective, looking at anomalies, being able to do some of those on detection methods, uh, looking at neural network analysis, time series information. So next steps we'd love to have an additional conversation with you. You can also find on some additional information around, uh, how quad areas working in the federal government by going to cloudera.com solutions slash public sector. And we welcome scheduling a meeting with you again, thank you for joining Chevy and I today, we greatly appreciate your time and look forward to future >>Conversation..
SUMMARY :
So as we look at fraud, So as we also look at a So as we look at those areas, what are the areas that we see additional So I think one of the key elements is, you know, you can look at your, looking at, uh, deep learning type models around, uh, you know, So as we're looking at, you know, from a, um, an audit planning or looking and the value behind this is, um, how do we actually look at increasing Um, also looking at increasing the amount of, uh, the level of compliance, And on that, I'm going to turn it over to Shev to talk about, uh, the reference architecture for, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher It could be in the data center or even on edge devices, and this data needs to be collected so uh, you know, downstream systems for further process. So the data has been enrich. So the next step in the architecture is to leverage a cluttered SQL string builder, historically collected data set, uh, to do this, we can use a combination of supervised And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the the analysis, the information that Sheva and I have provided, um, to give you some insights on
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Cindy Mikey | PERSON | 0.99+ |
Nvidia | ORGANIZATION | 0.99+ |
Molly | PERSON | 0.99+ |
2017 | DATE | 0.99+ |
patrick | PERSON | 0.99+ |
NVIDIA | ORGANIZATION | 0.99+ |
PWC | ORGANIZATION | 0.99+ |
Cindy | PERSON | 0.99+ |
Patrick Osbourne | PERSON | 0.99+ |
Joe | PERSON | 0.99+ |
Peter | PERSON | 0.99+ |
NIFA | ORGANIZATION | 0.99+ |
Today | DATE | 0.99+ |
today | DATE | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
over $65 billion | QUANTITY | 0.99+ |
over $51 billion | QUANTITY | 0.99+ |
last year | DATE | 0.99+ |
Shev | PERSON | 0.99+ |
57 billion | QUANTITY | 0.99+ |
IRS | ORGANIZATION | 0.99+ |
Sheva | PERSON | 0.98+ |
Jason | PERSON | 0.98+ |
first | QUANTITY | 0.98+ |
both | QUANTITY | 0.97+ |
one | QUANTITY | 0.97+ |
HPE | ORGANIZATION | 0.97+ |
Intel | ORGANIZATION | 0.97+ |
Avro | PERSON | 0.96+ |
salty | PERSON | 0.95+ |
eight X | QUANTITY | 0.95+ |
Apache | ORGANIZATION | 0.94+ |
single technology | QUANTITY | 0.92+ |
eight times | QUANTITY | 0.92+ |
91 billion | QUANTITY | 0.91+ |
zero changes | QUANTITY | 0.9+ |
next year | DATE | 0.9+ |
caldera | ORGANIZATION | 0.9+ |
Chev | ORGANIZATION | 0.87+ |
Richmond | LOCATION | 0.85+ |
three prong | QUANTITY | 0.85+ |
$148 billion | QUANTITY | 0.84+ |
single common format | QUANTITY | 0.83+ |
SQL | TITLE | 0.82+ |
Kafka | PERSON | 0.82+ |
Chevy | PERSON | 0.8+ |
HP Labs | ORGANIZATION | 0.8+ |
one individual | QUANTITY | 0.8+ |
Patrick | PERSON | 0.78+ |
Monte Carlo | TITLE | 0.76+ |
half | QUANTITY | 0.75+ |
over half | QUANTITY | 0.68+ |
17 | QUANTITY | 0.65+ |
second | QUANTITY | 0.65+ |
HBase | TITLE | 0.56+ |
elements | QUANTITY | 0.53+ |
Apache Flink | ORGANIZATION | 0.53+ |
cloudera.com | OTHER | 0.5+ |
coffin | PERSON | 0.5+ |
Spark | TITLE | 0.49+ |
Lake | COMMERCIAL_ITEM | 0.48+ |
HPE | TITLE | 0.47+ |
mini five | COMMERCIAL_ITEM | 0.45+ |
Green | ORGANIZATION | 0.37+ |
PUBLIC SECTOR V1 | CLOUDERA
>>Hi, this is Cindy Mikey, vice president of industry solutions at caldera. Joining me today is chef is Molly, our solution engineer for the public sector. Today. We're going to talk about speed to insight. Why using machine learning in the public sector, specifically around fraud, waste and abuse. So topic for today, we'll discuss machine learning, why the public sector uses it to target fraud, waste, and abuse, the challenges. How do we enhance your data and analytical approaches the data landscape analytical methods and shad we'll go over reference architecture and a case study. So by definition, fraud, waste and abuse per the government accountability office is fraud. Isn't an attempt to obtain something about value through unwelcome misrepresentation waste is about squandering money or resources and abuse is about behaving improperly or unreasonably to actually obtain something of value for your personal benefit. So as we look at fraud, um, and across all industries, it's a top of mind, um, area within the public sector. >>Um, the types of fraud that we see is specifically around cyber crime, uh, looking at accounting fraud, whether it be from an individual perspective to also, uh, within organizations, looking at financial statement fraud, to also looking at bribery and corruption, as we look at fraud, it really hits us from all angles, whether it be from external perpetrators or internal perpetrators, and specifically from the research by PWC, the key focus area is we also see over half of fraud is actually through some form of internal or external, uh, perpetrators again, key topics. So as we also look at a report recently by the association of certified fraud examiners, um, within the public sector, the us government, um, in 2017, it was identified roughly $148 billion was attributable to fraud, waste and abuse. Specifically about 57 billion was focused on reported monetary losses and another 91 billion on areas where that opportunity or the monetary basis had not yet been measured. >>As we look at breaking those areas down again, we look at several different topics from permit out payment perspective. So breaking it down within the health system, over $65 billion within social services, over $51 billion to procurement fraud to also, um, uh, fraud, waste and abuse that's happening in the grants and the loan process to payroll fraud, and then other aspects, again, quite a few different topical areas. So as we look at those areas, what are the areas that we see additional type of focus, there's a broad stroke areas. What are the actual use cases that our agencies are using the data landscape? What data, what analytical methods can we use to actually help curtail and prevent some of the, uh, the fraud waste and abuse. So, as we look at some of the analytical processes and analytical use crate, uh, use cases in the public sector, whether it's from, uh, you know, the taxation areas to looking at, you know, social services, uh, to public safety, to also the, um, our, um, uh, additional agency methods, we're gonna use focused specifically on some of the use cases around, um, you know, fraud within the tax area. >>Uh, we'll briefly look at some of the aspects of, um, unemployment insurance fraud, uh, benefit fraud, as well as payment and integrity. So fraud has it it's, um, uh, underpinnings inquiry, like you different on government agencies and difficult, different analytical methods, and I usage of different data. So I think one of the key elements is, you know, you can look at your, your data landscape on specific data sources that you need, but it's really about bringing together different data sources across a different variety, a different velocity. So, uh, data has different dimensions. So we'll look at structured types of data of semi-structured data, behavioral data, as well as when we look at, um, you know, predictive models. We're typically looking at historical type information, but if we're actually trying to look at preventing fraud before it actually happens, or when a case may be in flight, which is specifically a use case that shad is going to talk about later is how do I look at more of that? >>Real-time that streaming information? How do I take advantage of data, whether it be, uh, you know, uh, financial transactions we're looking at, um, asset verification, we're looking at tax records, we're looking at corporate filings. Um, and we can also look at more, uh, advanced data sources where as we're looking at, um, investigation type information. So we're maybe going out and we're looking at, uh, deep learning type models around, uh, you know, semi or that, uh, behavioral, uh, that's unstructured data, whether it be camera analysis and so forth. So for quite a different variety of data and the, the breadth and the opportunity really comes about when you can integrate and look at data across all different data sources. So in a looking at a more extensive, uh, data landscape. So specifically I want to focus on some of the methods, some of the data sources and some of the analytical techniques that we're seeing, uh, being used, um, in the government agencies, as well as opportunities, uh, to look at new methods. >>So as we're looking at, you know, from a, um, an audit planning or looking at, uh, the opportunity for the likelihood of non-compliance, um, specifically we'll see data sources where we're maybe looking at a constituents profile, we might actually be investigating the forms that they've provided. We might be comparing that data, um, or leveraging internal data sources, possibly looking at net worth, comparing it against other financial data, and also comparison across other constituents groups. Some of the techniques that we use are some of the basic natural language processing, maybe we're going to do some text mining. We might be doing some probabilistic modeling, uh, where we're actually looking at, um, information within the agency to also comparing that against possibly tax forms. A lot of times it's information historically has been done on a batch perspective, both structured and semi-structured type information. And typically the data volumes can be low, but we're also seeing those data volumes on increase exponentially based upon the types of events that we're dealing with, the number of transactions. >>Um, so getting the throughput, um, and chef's going to specifically talk about that in a moment. The other aspect is, as we look at other areas of opportunity is when we're building upon, how do I actually do compliance? How do I actually look at conducting audits, uh, or potential fraud to also looking at areas of under-reported tax information? So there you might be pulling in some of our other types of data sources, whether it's being property records, it could be data that's being supplied by the actual constituents or by vendors to also pulling in social media information to geographical information, to leveraging photos on techniques that we're seeing used is possibly some sentiment analysis, link analysis. Um, how do we actually blend those data sources together from a natural language processing? But I think what's important here is also the method and the looking at the data velocity, whether it be batch, whether it be near real time, again, looking at all types of data, whether it's structured semi-structured or unstructured and the key and the value behind this is, um, how do we actually look at increasing the potential revenue or the, um, under reported revenue? >>Uh, how do we actually look at stopping fraudulent payments before they actually occur? Um, also looking at increasing the amount of, uh, the level of compliance, um, and also looking at the potential of prosecution of fraud cases. And additionally, other areas of opportunity could be looking at, um, economic planning. How do we actually perform some link analysis? How do we bring some more of those things that we saw in the data landscape on customer, or, you know, constituent interaction, bringing in social media, bringing in, uh, potentially police records, property records, um, other tax department, database information. Um, and then also looking at comparing one individual to other individuals, looking at people like a specific, like a constituent, are there areas where we're seeing, uh, >>Um, other >>Aspects of, of fraud potentially being occurring. Um, and also as we move forward, some of the more advanced techniques that we're seeing around deep learning is looking at computer vision, um, leveraging geospatial information, looking at social network entity analysis, uh, also looking at, uh, agent-based modeling techniques, where we're looking at simulation Monte Carlo type techniques that we typically see in the financial services industry, actually applying that to fraud, waste, and abuse within the, uh, the public sector. Um, and again, that really, uh, lends itself to a new opportunities. And on that, I'm going to turn it over to chef to talk about, uh, the reference architecture for, uh, doing these buckets. >>Thanks, Cindy. Um, so I'm gonna walk you through an example, reference architecture for fraud detection using, uh, Cloudera's underlying technology. Um, and you know, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. So with fraud detection, what we're trying to do is identify anomalies or novelists behavior within our datasets. Um, now in order to understand what aspects of our incoming data represents anomalous behavior, we first need to understand what normal behavior is. So in essence, once we understand normal behavior, anything that deviates from it can be thought of as an anomaly, right? So in order to understand what normal behavior is, we're going to need to be able to collect store and process a very large amount of historical data. And so incomes, clutters platform, and this reference architecture that needs to be for you. >>So, uh, let's start on the left-hand side of this reference architecture with the collect phase. So fraud detection will always begin with data collection. We need to collect large amounts of information from systems that could be in the cloud. It could be in the data center or even on edge devices, and this data needs to be collected so we can create our normal behavior profiles. And these normal behavioral profiles would then in turn, be used to create our predictive models for fraudulent activity. Now, uh, thinking, uh, to the data collection side, one of the main challenges that many organizations face, uh, in this phase, uh, involves using a single technology that can handle, uh, data that's coming in all different types of formats and protocols and standards with different velocities and velocities. Um, let me give you an example. Uh, we could be collecting data from a database that gets updated daily, uh, and maybe that data is being collected in Agra format. >>At the same time, we can be collecting data from an edge device that's streaming in every second, and that data may be coming in Jason or a binary format, right? So this is a data collection challenge that can be solved with cluttered data flow, which is a suite of technologies built on a patch NIFA in mini five, allowing us to ingest all of this data, do a drag and drop interface. So now we're collecting all of this data, that's required to map out normal behavior. The next thing that we need to do is enrich it, transform it and distribute it to, uh, you know, downstream systems for further process. Uh, so let's, let's walk through how that would work first. Let's taking Richmond for, uh, for enrichment, think of adding additional information to your incoming data, right? Let's take, uh, financial transactions, for example, uh, because Cindy mentioned it earlier, right? >>You can store known locations of an individual in an operational database, uh, with Cloudera that would be HBase. And as an individual makes a new transaction, their geolocation that's in that transaction data can be enriched with previously known locations of that very same individual. And all of that enriched data can be later used downstream for predictive analysis, predictable. So the data has been enrich. Uh, now it needs to be transformed. We want the data that's coming in, uh, you know, Avro and Jason and binary and whatever other format to be transformed into a single common format. So it can be used downstream for stream processing. Uh, again, this is going to be done through clutter and data flow, which is backed by NIFA, right? So the transformed semantic data is then going to be stricted to Kafka and coffin. It's going to serve as that central repository of syndicated services or a buffer zone, right? >>So coffee is going to pretty much provide you with, uh, extremely fast resilient and fault tolerance storage. And it's also gonna give you the consumer APIs that you need that are going to enable a wide variety of applications to leverage that enriched and transformed data within your buffer zone, uh, allowed that, you know, 17. So you can store that data in a distributed file system, give you that historical context that you're going to need later on for machine learning, right? So the next step in the architecture is to leverage a cluttered SQL stream builder, which enables us to write, uh, streaming SQL jobs on top of Apache Flink. So we can, uh, filter, analyze and, uh, understand the data that's in the Kafka buffer in real time. Uh I'll you know, I'll also add like, you know, if you have time series data, or if you need a lab type of cubing, you can leverage kudu, uh, while EDA or, you know, exploratory data analysis and visualization, uh, can all be enabled through clever visualization technology. >>All right, so we've filtered, we've analyzed and we've explored our incoming data. We can now proceed to train our machine learning models, uh, which will detect anomalous behavior in our historically collected data set, uh, to do this, we can use a combination of supervised unsupervised, uh, even deep learning techniques with neural networks. And these models can be tested on new incoming streaming data. And once we've gone ahead and obtain the accuracy of the performance, the scores that we want, we can then take these models and deploy them into production. And once the models are productionalized or operationalized, they can be leveraged within our streaming pipeline. So as new data is ingested in real-time knife, I can query these models to detect if the activity is anomalous or fraudulent. And if it is, they can alert downstream users and systems, right? So this in essence is how fraudulent activity detection works. >>Uh, and this entire pipeline is powered by clutters technology, right? And so, uh, the IRS is one of, uh, clutter's customers. That's leveraging our platform today and implementing, uh, a very similar architecture, uh, to detect fraud, waste, and abuse across a very large set of historical facts, data. Um, and one of the neat things with the IRS is that they've actually recently leveraged the partnership between Cloudera and Nvidia to accelerate their spark based analytics and their machine learning, uh, and the results have been nothing short of amazing, right? And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the research analytics and statistics division group within the IRS with zero changes to our fraud detection workflow, we're able to obtain eight times to performance simply by adding GPS to our mainstream big data servers. This improvement translates to half the cost of ownership for the same workloads, right? So embedding GPU's into the reference architecture I covered earlier has enabled the IRS to improve their time to insights by as much as eight X while simultaneously reducing their underlying infrastructure costs by half, uh, Cindy back to you >>Chef. Thank you. Um, and I hope that you found, uh, some of the, the analysis, the information that Sheva and I have provided, um, to give you some insights on how cloud era is actually helping, uh, with the fraud waste and abuse challenges within the, uh, the public sector, um, specifically looking at any and all types of data, how the clutter platform is bringing together and analyzing information, whether it be you're structured you're semi-structured to unstructured data, both in a fast or in a real-time perspective, looking at anomalies, being able to do some of those on detection, uh, looking at neural network analysis, time series information. So next steps we'd love to have additional conversation with you. You can also find on some additional information around, I have caught areas working in the, the federal government by going to cloudera.com solutions slash public sector. And we welcome scheduling a meeting with you again, thank you for joining us Sheva and I today. We greatly appreciate your time and look forward to future progress. >>Good day, everyone. Thank you for joining me. I'm Sydney. Mike joined by Rick Taylor of Cloudera. Uh, we're here to talk about predictive maintenance for the public sector and how to increase assets, service, reliability on today's agenda. We'll talk specifically around how to optimize your equipment maintenance, how to reduce costs, asset failure with data and analytics. We'll go into a little more depth on, um, what type of data, the analytical methods that we're typically seeing used, um, the associated, uh, Brooke, we'll go over a case study as well as a reference architecture. So by basic definition, uh, predictive maintenance is about determining when an asset should be maintained and what specific maintenance activities need to be performed either based upon an assets of actual condition or state. It's also about predicting and preventing failures and performing maintenance on your time on your schedule to avoid costly unplanned downtime. >>McKinsey has looked at analyzing predictive maintenance costs across multiple industries and has identified that there's the opportunity to reduce overall predictive maintenance costs by roughly 50% with different types of analytical methods. So let's look at those three types of models. First, we've got our traditional type of method for maintenance, and that's really about our corrective maintenance, and that's when we're performing maintenance on an asset, um, after the equipment fails. But the challenges with that is we end up with unplanned. We end up with disruptions in our schedules, um, as well as reduced quality, um, around the performance of the asset. And then we started looking at preventive maintenance and preventative maintenance is really when we're performing maintenance on a set schedule. Um, the challenges with that is we're typically doing it regardless of the actual condition of the asset, um, which has resulted in unnecessary downtime and expense. Um, and specifically we're really now focused on pre uh, condition-based maintenance, which is looking at leveraging predictive maintenance techniques based upon actual conditions and real time events and processes. Um, within that we've seen organizations, um, and again, source from McKenzie have a 50% reduction in downtime, as well as an overall 40% reduction in maintenance costs. Again, this is really looking at things across multiple industries, but let's look at it in the context of the public sector and based upon some activity by the department of energy, um, several years ago, >>Um, they've really >>Looked at what does predictive maintenance mean to the public sector? What is the benefit, uh, looking at increasing return on investment of assets, reducing, uh, you know, reduction in downtime, um, as well as overall maintenance costs. So corrective or reactive based maintenance is really about performing once there's been a failure. Um, and then the movement towards, uh, preventative, which is based upon a set schedule or looking at predictive where we're monitoring real-time conditions. Um, and most importantly is now actually leveraging IOT and data and analytics to further reduce those overall downtimes. And there's a research report by the, uh, department of energy that goes into more specifics, um, on the opportunity within the public sector. So, Rick, let's talk a little bit about what are some of the challenges, uh, regarding data, uh, regarding predictive maintenance. >>Some of the challenges include having data silos, historically our government organizations and organizations in the commercial space as well, have multiple data silos. They've spun up over time. There are multiple business units and note, there's no single view of assets. And oftentimes there's redundant information stored in, in these silos of information. Uh, couple that with huge increases in data volume data growing exponentially, along with new types of data that we can ingest there's social media, there's semi and unstructured data sources and the real time data that we can now collect from the internet of things. And so the challenge is to collect all these assets together and begin to extract intelligence from them and insights and, and that in turn then fuels, uh, machine learning and, um, and, and what we call artificial intelligence, which enables predictive maintenance. Next slide. So >>Let's look specifically at, you know, the, the types of use cases and I'm going to Rick and I are going to focus on those use cases, where do we see predictive maintenance coming into the procurement facility, supply chain, operations and logistics. Um, we've got various level of maturity. So, you know, we're talking about predictive maintenance. We're also talking about, uh, using, uh, information, whether it be on a, um, a connected asset or a vehicle doing monitoring, uh, to also leveraging predictive maintenance on how do we bring about, uh, looking at data from connected warehouses facilities and buildings all bring on an opportunity to both increase the quality and effectiveness of the missions within the agencies to also looking at re uh, looking at cost efficiency, as well as looking at risk and safety and the types of data, um, you know, that Rick mentioned around, you know, the new types of information, some of those data elements that we typically have seen is looking at failure history. >>So when has that an asset or a machine or a component within a machine failed in the past? Uh, we've also looking at bringing together a maintenance history, looking at a specific machine. Are we getting error codes off of a machine or assets, uh, looking at when we've replaced certain components to looking at, um, how are we actually leveraging the assets? What were the operating conditions, uh, um, pulling off data from a sensor on that asset? Um, also looking at the, um, the features of an asset, whether it's, you know, engine size it's make and model, um, where's the asset located on to also looking at who's operated the asset, uh, you know, whether it be their certifications, what's their experience, um, how are they leveraging the assets and then also bringing in together, um, some of the, the pattern analysis that we've seen. So what are the operating limits? Um, are we getting service reliability? Are we getting a product recall information from the actual manufacturer? So, Rick, I know the data landscape has really changed. Let's, let's go over looking at some of those components. Sure. >>So this slide depicts sort of the, some of the inputs that inform a predictive maintenance program. So, as we've talked a little bit about the silos of information, the ERP system of record, perhaps the spares and the service history. So we want, what we want to do is combine that information with sensor data, whether it's a facility and equipment sensors, um, uh, or temperature and humidity, for example, all this stuff is then combined together, uh, and then use to develop machine learning models that better inform, uh, predictive maintenance, because we'll do need to keep, uh, to take into account the environmental factors that may cause additional wear and tear on the asset that we're monitoring. So here's some examples of private sector, uh, maintenance use cases that also have broad applicability across the government. For example, one of the busiest airports in Europe is running cloud era on Azure to capture secure and correlate sensor data collected from equipment within the airport, the people moving equipment more specifically, the escalators, the elevators, and the baggage carousels. >>The objective here is to prevent breakdowns and improve airport efficiency and passenger safety. Another example is a container shipping port. In this case, we use IOT data and machine learning, help customers recognize how their cargo handling equipment is performing in different weather conditions to understand how usage relates to failure rates and to detect anomalies and transport systems. These all improve for another example is Navistar Navistar, leading manufacturer of commercial trucks, buses, and military vehicles. Typically vehicle maintenance, as Cindy mentioned, is based on miles traveled or based on a schedule or a time since the last service. But these are only two of the thousands of data points that can signal the need for maintenance. And as it turns out, unscheduled maintenance and vehicle breakdowns account for a large share of the total cost for vehicle owner. So to help fleet owners move from a reactive approach to a more predictive model, Navistar built an IOT enabled remote diagnostics platform called on command. >>The platform brings in over 70 sensor data feeds for more than 375,000 connected vehicles. These include engine performance, trucks, speed, acceleration, cooling temperature, and break where this data is then correlated with other Navistar and third-party data sources, including weather geo location, vehicle usage, traffic warranty, and parts inventory information. So the platform then uses machine learning and advanced analytics to automatically detect problems early and predict maintenance requirements. So how does the fleet operator use this information? They can monitor truck health and performance from smartphones or tablets and prioritize needed repairs. Also, they can identify that the nearest service location that has the relevant parts, the train technicians and the available service space. So sort of wrapping up the, the benefits Navistar's helped fleet owners reduce maintenance by more than 30%. The same platform is also used to help school buses run safely. And on time, for example, one school district with 110 buses that travel over a million miles annually reduce the number of PTOs needed year over year, thanks to predictive insights delivered by this platform. >>So I'd like to take a moment and walk through the data. Life cycle is depicted in this diagram. So data ingest from the edge may include feeds from the factory floor or things like connected vehicles, whether they're trucks, aircraft, heavy equipment, cargo vessels, et cetera. Next, the data lands on a secure and governed data platform. Whereas combined with data from existing systems of record to provide additional insights, and this platform supports multiple analytic functions working together on the same data while maintaining strict security governance and control measures once processed the data is used to train machine learning models, which are then deployed into production, monitored, and retrained as needed to maintain accuracy. The process data is also typically placed in a data warehouse and use to support business intelligence, analytics, and dashboards. And in fact, this data lifecycle is representative of one of our government customers doing condition-based maintenance across a variety of aircraft. >>And the benefits they've discovered include less unscheduled maintenance and a reduction in mean man hours to repair increased maintenance efficiencies, improved aircraft availability, and the ability to avoid cascading component failures, which typically cost more in repair cost and downtime. Also, they're able to better forecast the requirements for replacement parts and consumables and last, and certainly very importantly, this leads to enhanced safety. This chart overlays the secure open source Cloudera platform used in support of the data life cycle. We've been discussing Cloudera data flow, the data ingest data movement and real time streaming data query capabilities. So data flow gives us the capability to bring data in from the asset of interest from the internet of things. While the data platform provides a secure governed data lake and visibility across the full machine learning life cycle eliminates silos and streamlines workflows across teams. The platform includes an integrated suite of secure analytic applications. And two that we're specifically calling out here are Cloudera machine learning, which supports the collaborative data science and machine learning environment, which facilitates machine learning and AI and the cloud era data warehouse, which supports the analytics and business intelligence, including those dashboards for leadership Cindy, over to you, Rick, >>Thank you. And I hope that, uh, Rick and I provided you some insights on how predictive maintenance condition-based maintenance is being used and can be used within your respective agency, bringing together, um, data sources that maybe you're having challenges with today. Uh, bringing that, uh, more real-time information in from a streaming perspective, blending that industrial IOT, as well as historical information together to help actually, uh, optimize maintenance and reduce costs within the, uh, each of your agencies, uh, to learn a little bit more about Cloudera, um, and our, what we're doing from a predictive maintenance please, uh, business@cloudera.com solutions slash public sector. And we look forward to scheduling a meeting with you, and on that, we appreciate your time today and thank you very much.
SUMMARY :
So as we look at fraud, Um, the types of fraud that we see is specifically around cyber crime, So as we look at those areas, what are the areas that we see additional So I think one of the key elements is, you know, you can look at your, the breadth and the opportunity really comes about when you can integrate and Some of the techniques that we use and the value behind this is, um, how do we actually look at increasing Um, also looking at increasing the amount of, uh, the level of compliance, I'm going to turn it over to chef to talk about, uh, the reference architecture for, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. It could be in the data center or even on edge devices, and this data needs to be collected At the same time, we can be collecting data from an edge device that's streaming in every second, So the data has been enrich. So the next step in the architecture is to leverage a cluttered SQL stream builder, obtain the accuracy of the performance, the scores that we want, Um, and one of the neat things with the IRS the analysis, the information that Sheva and I have provided, um, to give you some insights on the analytical methods that we're typically seeing used, um, the associated, doing it regardless of the actual condition of the asset, um, uh, you know, reduction in downtime, um, as well as overall maintenance costs. And so the challenge is to collect all these assets together and begin the types of data, um, you know, that Rick mentioned around, you know, the new types on to also looking at who's operated the asset, uh, you know, whether it be their certifications, So we want, what we want to do is combine that information with So to help fleet So the platform then uses machine learning and advanced analytics to automatically detect problems So data ingest from the edge may include feeds from the factory floor or things like improved aircraft availability, and the ability to avoid cascading And I hope that, uh, Rick and I provided you some insights on how predictive
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Cindy Mikey | PERSON | 0.99+ |
Rick | PERSON | 0.99+ |
Rick Taylor | PERSON | 0.99+ |
Molly | PERSON | 0.99+ |
Nvidia | ORGANIZATION | 0.99+ |
2017 | DATE | 0.99+ |
PWC | ORGANIZATION | 0.99+ |
40% | QUANTITY | 0.99+ |
110 buses | QUANTITY | 0.99+ |
Europe | LOCATION | 0.99+ |
50% | QUANTITY | 0.99+ |
Cindy | PERSON | 0.99+ |
Mike | PERSON | 0.99+ |
Joe | PERSON | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Today | DATE | 0.99+ |
today | DATE | 0.99+ |
Navistar | ORGANIZATION | 0.99+ |
First | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
more than 30% | QUANTITY | 0.99+ |
over $51 billion | QUANTITY | 0.99+ |
NIFA | ORGANIZATION | 0.99+ |
over $65 billion | QUANTITY | 0.99+ |
IRS | ORGANIZATION | 0.99+ |
over a million miles | QUANTITY | 0.99+ |
first | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
Jason | PERSON | 0.98+ |
Azure | TITLE | 0.98+ |
Brooke | PERSON | 0.98+ |
Avro | PERSON | 0.98+ |
one school district | QUANTITY | 0.98+ |
SQL | TITLE | 0.97+ |
both | QUANTITY | 0.97+ |
$148 billion | QUANTITY | 0.97+ |
Sheva | PERSON | 0.97+ |
three types | QUANTITY | 0.96+ |
each | QUANTITY | 0.95+ |
McKenzie | ORGANIZATION | 0.95+ |
more than 375,000 connected vehicles | QUANTITY | 0.95+ |
Cloudera | TITLE | 0.95+ |
about 57 billion | QUANTITY | 0.95+ |
salty | PERSON | 0.94+ |
several years ago | DATE | 0.94+ |
single technology | QUANTITY | 0.94+ |
eight times | QUANTITY | 0.93+ |
91 billion | QUANTITY | 0.93+ |
eight X | QUANTITY | 0.92+ |
business@cloudera.com | OTHER | 0.92+ |
McKinsey | ORGANIZATION | 0.92+ |
zero changes | QUANTITY | 0.92+ |
Monte Carlo | TITLE | 0.92+ |
caldera | ORGANIZATION | 0.91+ |
couple | QUANTITY | 0.9+ |
over 70 sensor data feeds | QUANTITY | 0.88+ |
Richmond | LOCATION | 0.84+ |
Navistar Navistar | ORGANIZATION | 0.82+ |
single view | QUANTITY | 0.81+ |
17 | OTHER | 0.8+ |
single common format | QUANTITY | 0.8+ |
thousands of data points | QUANTITY | 0.79+ |
Sydney | LOCATION | 0.78+ |
Cindy Maike & Nasheb Ismaily | Cloudera
>>Hi, this is Cindy Mikey, vice president of industry solutions at Cloudera. Joining me today is chef is Molly, our solution engineer for the public sector. Today. We're going to talk about speed to insight. Why using machine learning in the public sector, specifically around fraud, waste and abuse. So topic for today, we'll discuss machine learning, why the public sector uses it to target fraud, waste, and abuse, the challenges. How do we enhance your data and analytical approaches the data landscape analytical methods and Shev we'll go over reference architecture and a case study. So by definition, fraud, waste and abuse per the government accountability office is fraud is an attempt to obtain something about a value through unwelcomed. Misrepresentation waste is about squandering money or resources and abuse is about behaving improperly or unreasonably to actually obtain something of value for your personal benefit. So as we look at fraud and across all industries, it's a top of mind, um, area within the public sector. >>Um, the types of fraud that we see is specifically around cyber crime, uh, looking at accounting fraud, whether it be from an individual perspective to also, uh, within organizations, looking at financial statement fraud, to also looking at bribery and corruption, as we look at fraud, it really hits us from all angles, whether it be from external perpetrators or internal perpetrators, and specifically from the research by PWC, the key focus area is we also see over half of fraud is actually through some form of internal or external are perpetrators again, key topics. So as we also look at a report recently by the association of certified fraud examiners, um, within the public sector, the us government, um, in 2017, it was identified roughly $148 billion was attributable to fraud, waste and abuse. Specifically of that 57 billion was focused on reported monetary losses and another 91 billion on areas where that opportunity or the monetary basis had not yet been measured. >>As we look at breaking those areas down again, we look at several different topics from an out payment perspective. So breaking it down within the health system, over $65 billion within social services, over $51 billion to procurement fraud to also, um, uh, fraud, waste and abuse that's happening in the grants and the loan process to payroll fraud, and then other aspects, again, quite a few different topical areas. So as we look at those areas, what are the areas that we see additional type of focus, there's broad stroke areas? What are the actual use cases that our agencies are using the data landscape? What data, what analytical methods can we use to actually help curtail and prevent some of the, uh, the fraud waste and abuse. So, as we look at some of the analytical processes and analytical use crate, uh, use cases in the public sector, whether it's from, uh, you know, the taxation areas to looking at social services, uh, to public safety, to also the, um, our, um, uh, additional agency methods, we're going to focus specifically on some of the use cases around, um, you know, fraud within the tax area. >>Uh, we'll briefly look at some of the aspects of unemployment insurance fraud, uh, benefit fraud, as well as payment and integrity. So fraud has its, um, uh, underpinnings in quite a few different on government agencies and difficult, different analytical methods and I usage of different data. So I think one of the key elements is, you know, you can look at your, your data landscape on specific data sources that you need, but it's really about bringing together different data sources across a different variety, a different velocity. So, uh, data has different dimensions. So we'll look at on structured types of data of semi-structured data, behavioral data, as well as when we look at, um, you know, predictive models, we're typically looking at historical type information, but if we're actually trying to lock at preventing fraud before it actually happens, or when a case may be in flight, which is specifically a use case, that shadow is going to talk about later it's how do I look at more of that? >>Real-time that streaming information? How do I take advantage of data, whether it be, uh, you know, uh, financial transactions we're looking at, um, asset verification, we're looking at tax records, we're looking at corporate filings. Um, and we can also look at more, uh, advanced data sources where as we're looking at, um, investigation type information. So we're maybe going out and we're looking at, uh, deep learning type models around, uh, you know, semi or that behavioral, uh, that's unstructured data, whether it be camera analysis and so forth. So quite a different variety of data and the, the breadth, um, and the opportunity really comes about when you can integrate and look at data across all different data sources. So in a sense, looking at a more extensive on data landscape. So specifically I want to focus on some of the methods, some of the data sources and some of the analytical techniques that we're seeing, uh, being used, um, in the government agencies, as well as opportunities, uh, to look at new methods. >>So as we're looking at, you know, from a, um, an audit planning or looking at, uh, the opportunity for the likelihood of non-compliance, um, specifically we'll see data sources where we're maybe looking at a constituents profile, we might actually be, um, investigating the forms that they've provided. We might be comparing that data, um, or leveraging internal data sources, possibly looking at net worth, comparing it against other financial data, and also comparison across other constituents groups. Some of the techniques that we use are some of the basic natural language processing, maybe we're going to do some text mining. We might be doing some probabilistic modeling, uh, where we're actually looking at, um, information within the agency to also comparing that against possibly tax forms. A lot of times it's information historically has been done on a batch perspective, both structured and semi-structured type information. And typically the data volumes can be low, but we're also seeing those data volumes increase exponentially based upon the types of events that we're dealing with, the number of transactions. >>Um, so getting the throughput, um, and chef's going to specifically talk about that in a moment. The other aspect is, as we look at other areas of opportunity is when we're building upon, how do I actually do compliance? How do I actually look at conducting audits, uh, or potential fraud to also looking at areas of under reported tax information? So there you might be pulling in some of our other types of data sources, whether it's being property records, it could be data that's being supplied by the actual constituents or by vendors to also pulling in social media information to geographical information, to leveraging photos on techniques that we're seeing used is possibly some sentiment analysis, link analysis. Um, how do we actually blend those data sources together from a natural language processing? But I think what's important here is also the method and the looking at the data velocity, whether it be batch, whether it be near real time, again, looking at all types of data, whether it's structured semi-structured or unstructured and the key and the value behind this is, um, how do we actually look at increasing the potential revenue or the, um, under reported revenue? >>Uh, how do we actually look at stopping fraudulent payments before they actually occur? Um, also looking at increasing the amount of, uh, the level of compliance, um, and also looking at the potential of prosecution of fraud cases. And additionally, other areas of opportunity could be looking at, um, economic planning. How do we actually perform some link analysis? How do we bring some more of those things that we saw in the data landscape on customer, or, you know, constituent interaction, bringing in social media, bringing in, uh, potentially police records, property records, um, other tax department, database information. Um, and then also looking at comparing one individual to other individuals, looking at people like a specific, like, uh, constituent, are there areas where we're seeing, uh, um, other aspects of, of fraud potentially being, uh, occurring. Um, and also as we move forward, some of the more advanced techniques that we're seeing around deep learning is looking at computer vision, um, leveraging geospatial information, looking at social network entity analysis, uh, also looking at, um, agent-based modeling techniques, where we're looking at simulation, Monte Carlo type techniques that we typically see in the financial services industry, actually applying that to fraud, waste, and abuse within the, the public sector. >>Um, and again, that really, uh, lends itself to a new opportunities. And on that, I'm going to turn it over to Chevy to talk about, uh, the reference architecture for doing these buckets. >>Sure. Yeah. Thanks, Cindy. Um, so I'm going to walk you through an example, reference architecture for fraud detection, using Cloudera as underlying technology. Um, and you know, before I get into the technical details, uh, I want to talk about how this would be implemented at a much higher level. So with fraud detection, what we're trying to do is identify anomalies or anomalous behavior within our datasets. Um, now in order to understand what aspects of our incoming data represents anomalous behavior, we first need to understand what normal behavior is. So in essence, once we understand normal behavior, anything that deviates from it can be thought of as an anomaly, right? So in order to understand what normal behavior is, we're going to need to be able to collect store and process a very large amount of historical data. And so incomes, clutters platform, and this reference architecture that needs to be for you. >>So, uh, let's start on the left-hand side of this reference architecture with the collect phase. So fraud detection will always begin with data collection. Uh, we need to collect large amounts of information from systems that could be in the cloud. It could be in the data center or even on edge devices, and this data needs to be collected so we can create from normal behavior profiles and these normal behavioral profiles would then in turn, be used to create our predictive models for fraudulent activity. Now, uh, uh, to the data collection side, one of the main challenges that many organizations face, uh, in this phase, uh, involves using a single technology that can handle, uh, data that's coming in all different types of formats and protocols and standards with different velocities and velocities. Um, let me give you an example. Uh, we could be collecting data from a database that gets updated daily, uh, and maybe that data is being collected in Agra format. >>At the same time, we can be collecting data from an edge device that's streaming in every second, and that data may be coming in Jace on or a binary format, right? So this is a data collection challenge that can be solved with cluttered data flow, which is a suite of technologies built on Apache NIFA and mini five, allowing us to ingest all of this data, do a drag and drop interface. So now we're collecting all of this data, that's required to map out normal behavior. The next thing that we need to do is enrich it, transform it and distribute it to know downstream systems for further process. Uh, so let's, let's walk through how that would work first. Let's taking Richmond for, uh, for enrichment, think of adding additional information to your incoming data, right? Let's take, uh, financial transactions, for example, uh, because Cindy mentioned it earlier, right? >>You can store known locations of an individual in an operational database, uh, with Cloudera that would be HBase. And as an individual makes a new transaction, their geo location that's in that transaction data, it can be enriched with previously known locations of that very same individual and all of that enriched data. It can be later used downstream for predictive analysis, predictable. So the data has been enrich. Uh, now it needs to be transformed. We want the data that's coming in, uh, you know, Avro and Jason and binary and whatever other format to be transformed into a single common format. So it can be used downstream for stream processing. Uh, again, this is going to be done through clutter and data flow, which is backed by NIFA, right? So the transformed semantic data is then going to be stimulated to Kafka and coffin is going to serve as that central repository of syndicated services or a buffer zone, right? >>So cough is, you know, pretty much provides you with, uh, extremely fast resilient and fault tolerance storage. And it's also going to give you the consumer API APIs that you need that are going to enable a wide variety of applications to leverage that enriched and transform data within your buffer zone. Uh, I'll add that, you know, 17, so you can store that data, uh, in a distributed file system, give you that historical context that you're going to need later on from machine learning, right? So the next step in the architecture is to leverage, uh, clutter SQL stream builder, which enables us to write, uh, streaming sequel jobs on top of Apache Flink. So we can, uh, filter, analyze and, uh, understand the data that's in the Kafka buffer zone in real-time. Uh, I'll, you know, I'll also add like, you know, if you have time series data, or if you need a lab type of cubing, you can leverage Q2, uh, while EDA or, you know, exploratory data analysis and visualization, uh, can all be enabled through clever visualization technology. >>All right, so we've filtered, we've analyzed, and we've our incoming data. We can now proceed to train our machine learning models, uh, which will detect anomalous behavior in our historically collected data set, uh, to do this, we can use a combination of supervised unsupervised, even deep learning techniques with neural networks. Uh, and these models can be tested on new incoming streaming data. And once we've gone ahead and obtain the accuracy of the performance, the X one, uh, scores that we want, we can then take these models and deploy them into production. And once the models are productionalized or operationalized, they can be leveraged within our streaming pipeline. So as new data is ingested in real time knife, I can query these models to detect if the activity is anomalous or fraudulent. And if it is, they can alert downstream users and systems, right? So this in essence is how fraudulent activity detection works. Uh, and this entire pipeline is powered by clutters technology. Uh, Cindy, next slide please. >>Right. And so, uh, the IRS is one of, uh, clutter as customers. That's leveraging our platform today and implementing a very similar architecture, uh, to detect fraud, waste, and abuse across a very large set of, uh, historical facts, data. Um, and one of the neat things with the IRS is that they've actually recently leveraged the partnership between Cloudera and Nvidia to accelerate their Spark-based analytics and their machine learning. Uh, and the results have been nothing short of amazing, right? And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the research analytics and statistics division group within the IRS with zero changes to our fraud detection workflow, we're able to obtain eight times to performance simply by adding GPS to our mainstream big data servers. This improvement translates to half the cost of ownership for the same workloads, right? So embedding GPU's into the reference architecture I covered earlier has enabled the IRS to improve their time to insights by as much as eight X while simultaneously reducing their underlying infrastructure costs by half, uh, Cindy back to you >>Chef. Thank you. Um, and I hope that you found, uh, some of the, the analysis, the information that Sheva and I have provided, uh, to give you some insights on how cloud era is actually helping, uh, with the fraud waste and abuse challenges within the, uh, the public sector, um, specifically looking at any and all types of data, how the clutter a platform is bringing together and analyzing information, whether it be you're structured you're semi-structured to unstructured data, both in a fast or in a real-time perspective, looking at anomalies, being able to do some of those on detection methods, uh, looking at neural network analysis, time series information. So next steps we'd love to have an additional conversation with you. You can also find on some additional information around how called areas working in federal government, by going to cloudera.com solutions slash public sector. And we welcome scheduling a meeting with you again, thank you for joining us today. Uh, we greatly appreciate your time and look forward to future conversations. Thank you.
SUMMARY :
So as we look at fraud and across So as we also look at a report So as we look at those areas, what are the areas that we see additional So I think one of the key elements is, you know, you can look at your, Um, and we can also look at more, uh, advanced data sources So as we're looking at, you know, from a, um, an audit planning or looking and the value behind this is, um, how do we actually look at increasing Um, also looking at increasing the amount of, uh, the level of compliance, um, And on that, I'm going to turn it over to Chevy to talk about, uh, the reference architecture for doing Um, and you know, before I get into the technical details, uh, I want to talk about how this It could be in the data center or even on edge devices, and this data needs to be collected so At the same time, we can be collecting data from an edge device that's streaming in every second, So the data has been enrich. So the next step in the architecture is to leverage, uh, clutter SQL stream builder, obtain the accuracy of the performance, the X one, uh, scores that we want, And in fact, we have a quote here from Joe and salty who's, uh, you know, the technical branch chief for the the analysis, the information that Sheva and I have provided, uh, to give you some insights
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Cindy Mikey | PERSON | 0.99+ |
Nvidia | ORGANIZATION | 0.99+ |
Molly | PERSON | 0.99+ |
Nasheb Ismaily | PERSON | 0.99+ |
PWC | ORGANIZATION | 0.99+ |
Joe | PERSON | 0.99+ |
Cindy | PERSON | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
2017 | DATE | 0.99+ |
Cindy Maike | PERSON | 0.99+ |
Today | DATE | 0.99+ |
over $65 billion | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
NIFA | ORGANIZATION | 0.99+ |
over $51 billion | QUANTITY | 0.99+ |
57 billion | QUANTITY | 0.99+ |
salty | PERSON | 0.99+ |
single | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
Jason | PERSON | 0.98+ |
one | QUANTITY | 0.97+ |
91 billion | QUANTITY | 0.97+ |
IRS | ORGANIZATION | 0.96+ |
Shev | PERSON | 0.95+ |
both | QUANTITY | 0.95+ |
Avro | PERSON | 0.94+ |
Apache | ORGANIZATION | 0.93+ |
eight | QUANTITY | 0.93+ |
$148 billion | QUANTITY | 0.92+ |
zero changes | QUANTITY | 0.91+ |
Richmond | LOCATION | 0.91+ |
Sheva | PERSON | 0.88+ |
single technology | QUANTITY | 0.86+ |
Cloudera | TITLE | 0.85+ |
Monte Carlo | TITLE | 0.84+ |
eight times | QUANTITY | 0.83+ |
cloudera.com | OTHER | 0.79+ |
Kafka | TITLE | 0.77+ |
second | QUANTITY | 0.77+ |
one individual | QUANTITY | 0.76+ |
coffin | PERSON | 0.72+ |
Kafka | PERSON | 0.69+ |
Jace | TITLE | 0.69+ |
SQL | TITLE | 0.68+ |
17 | QUANTITY | 0.68+ |
over half | QUANTITY | 0.63+ |
Chevy | ORGANIZATION | 0.57+ |
elements | QUANTITY | 0.56+ |
half | QUANTITY | 0.56+ |
mini five | COMMERCIAL_ITEM | 0.54+ |
Apache Flink | ORGANIZATION | 0.52+ |
HBase | TITLE | 0.45+ |
Stephan Ewen, data Artisans | Flink Forward 2018
>> Narrator: Live from San Francisco. It's the CUBE covering Flink Forward brought to you by data Artisans. >> Hi, this is George Gilbert. We are at Flink Forward. The conference put on by data Artisans for the Apache Flink community. This is the second Flink Forward in San Francisco and we are honored to have with us Stephan Ewen, co-founder of data Artisans, co-creator of Apache Flink, and CTO of data Artisans. Stephan, welcome. >> Thank you, George. >> Okay, so with others we were talking about the use cases they were trying to solve but you put together the sort of all the pieces in your head first and are building out, you know, something that's ultimately gets broader and broader in its applicability. Help us, now maybe from the bottom up, help us think through the problems you were trying to solve and and let's start, you know, with the ones that you saw first and then how the platform grows so that you can solve more and more a broader scale of problems. >> Yes, yeah, happy to do that. So, I think we have to take a bunch of step backs and kind of look at what is the let's say the breadth or use cases that we're looking at. How did that, you know, influence some of the inherent decisions and how we've built Flink? How does that relate to what we presented earlier today, the in Austrian processing platform and so on? So, starting to work on Flink and stream processing. Stream processing is an extremely general and broad paradigm, right? We've actually started to say what Flink is underneath the hood. It's an engine to do stateful computations over data streams. It's a system that can process data streams as a batch processor processes, you know, bounded data. It can process data streams as a real-time stream processor produces real-time streams of events. It can handle, you know, data streams as in sophisticated event by event, stateful, timely, logic as you know many applications that are, you know, implemented as data-driven micro services or so and implement their logic. And the basic idea behind how Flink takes its approach to that is just start with the basic ingredients that you need that and try not to impose any form of like various constraints and so on around the use of that. So, when I give the presentations, I very often say the basic building blocks for Flink is just like flowing streams of data, streams being, you know, like received from that systems like Kafka, file systems, databases. So, you route them, you may want to repartition them, organize them by key, broadcast them, depending on what you need to do. You implement computation on these streams, a computation that can keep state almost as if it was, you know, like a standalone java application. You don't think necessarily in terms of writing state or database. Think more in terms of maintaining your own variables or so. Sophisticated access to tracking time and progress or progress of data, completeness of data. That's in some sense what is behind the event time streaming notion. You're tracking completeness of data as for a certain point of time. And then to to round this all up, give this a really nice operational tool by introducing this concept of distributed consistent snapshots. And just sticking with these basic primitives, you have streams that just flow, no barrier, no transactional barriers necessarily there between operations, no microbatches, just streams that flow, state variables that get updated and then full tolerance happening as an asynchronous background process. Now that is what is in some sense the I would say kind of the core idea and what helps Flink generalize from batch processing to, you know, real-time stream processing to event-driven applications. And what we saw today is, in the presentation that I gave earlier, how we use that to build a platform for stream processing and event-driven applications. That's taking some of these things and in that case I'm most prominently the fourth aspect the ability to draw like some application snapshots at any point in time and and use this as an extremely powerful operational tool. You can think of it as being a tool to archive applications, migrate applications, fork applications, modify them independently. >> And these snapshots are essentially your individual snapshots at the node level and then you're sort of organizing them into one big logical snapshot. >> Yeah, each node is its own snapshot but they're consistently organized into a globally consistent snapshot, yes. That has a few very interesting and important implications for example. So just to give you one example where this makes really things much easier. If you have an application that you want to upgrade and you don't have a mechanism like that right, what is the default way that many folks do these updates today? Try to do a rolling upgrade of all your individual nodes. You replace one then the next, then the next, then the next but that has this interesting situation where at some point in time there's actually two versions of the application running at the same time. >> And operating on the same sort of data stream. >> Potentially, yeah, or on some partitions of the data stream, we have one version and some partitions you have another version. You may be at the point we have to maintain two wire formats like all pieces of your logic have to be written in understanding both versions or you try to you know use the data format that makes this a little easier but it's just inherently a thing that you don't even have to worry about it if you have this consistent distributed snapshots. It's just a way to switch from one application to the other as if nothing was like shared or in-flight at any point in time. It just gets many of these problems just out of the way. >> Okay and that snapshot applies to code and data? >> So in Flink's architecture itself, the snapshot applies first of all only to data. And that is very important. >> George: Yeah. >> Because what it actually allows you is to decouple the snapshot from the code if you want to. >> George: Okay. >> That allows you to do things like we showed earlier this morning. If you actually have an earlier snapshot where the data is correct then you change the code but you introduce the back. You can just say, "Okay, let me actually change the code "and apply different code to a different snapshot." So, you can actually, roll back or roll forward different versions of code and different versions of state independently or you can go and say when I'm forking this application I'm actually modifying it. That is a level of flexibility that's incredible to, yeah, once you've actually start to make use of it and practice it, it's incredibly useful. It's been actually almost, it's been one of the maybe least obvious things once you start to look into stream processing but once you actually started production as stream processing, this operational flexibility that you get there is I would say very high up for a lot of users when they said, "Okay this is "why we took Flink to streaming production and not others." The ability to do for example that. >> But this sounds then like with some stream processors the idea of the unbundling the database you have derived data you know at different sync points and that derived data is you know for analysis, views, whatever, but it sounds like what you're doing is taking a derived data of sort of what the application is working on in progress and creating essentially a logically consistent view that's not really derived data for some other application use but for operational use. >> Yeah, so. >> Is that a fair way to explain? >> Yeah, let me try to rephrase it a bit. >> Okay. >> When you start to take this streaming style approach to things, which you know it's been called turning the database inside out, unbundling the database, your input sequence of event is arguably the ground truth and what the stream processor computes is as a view of the state of the world. So, while this sounds you know this sounds at first super easy and you know views, you can always recompute a few, right? Now in practice this view of the world is not just something that's just like a lightweight thing that's only derived from the sequence of events. it's actually the, it's the state of the world that you want to use. It might not be fully reproducible just because either the sequence of events has been truncated or because the sequence events is just like too plain long to feasibly recompute it in a reasonable time. So, having a way to work with this in a way that just complements this whole idea of you know like event-driven, log-driven architecture very cleanly is kind of what this snapshot tool also gives you. >> Okay, so then help us think so that sounds like that was part of core Flink. >> That is part of core Flink's inherent design, yes. >> Okay, so then take us to the the next level of abstraction. The scaffolding that you're building around it with the dA platform and how that should make that sort of thing that makes stream processing more accessible, how it you know it empowers a whole other generation. >> Yeah, so there's different angles to what the dA platform does. So, one angle is just very pragmatically easing rollout of applications by having a one way to integrate the you know the platform with your metrics. Alerting logins, the ICD pipeline, and then every application that you deploy over there just like inherits all of that like every edge in the application developer doesn't have to worry about anything. They just say like this is my piece of code. I'm putting it there and it's just going to be hooked in with everything else. That's not rocket science but it's extremely valuable because there's like a lot of tedious bits here and there that you know otherwise eat up a significant amount of the development time. Like technologically maybe more challenging part that this solves is the part where we're really integrating the application snapshot, the compute resources, the configuration management and everything into this model where you don't think about I'm running a Flink job here. That Flink job has created a snapshot that is running around here. There's also a snapshot here which probably may come from that Flink application. Also, that Flink application was running. That's actually just a new version of that Flink application which is the let's say testing or acceptance run for the version that we're about to deploy here and so like tying all of these things together. >> So, it's not just the artifacts from one program, it's how they all interrelate? >> It gives you the idea of exactly of how they all interrelate because an application over its lifetime will correspond to different configurations different code versions, different different deployments on production a/b testing and so on and like how all of these things kind of work together how they interplay right, Flink, like I said before Flink deliberately couples checkpoints and code and so on in a rather loose way to allow you to to evolve the code differently then and still be able to match a previous snapshot into a newer code version and so on. We make heavy use of that but we we cannot give you a good way of first of all tracking all of these things together how do they how do they relate, when was which version running, what code version was that, having a snapshots we can always go back and reinstate earlier versions, having the ability to always move on a deployment from here to there, like fork it, drop it, and so on. That is one part of it and the other part of it is the tight integration with with Kubernetes which is initially container sweet spot was stateless compute and the way stream processing is, how architecture works is the nodes are inherently not stateless, they have a view of the state of the world. This is recoverable always. You can also change the number of containers and with Flink and other frameworks you have the ability to kind of adjust this and so on, >> Including repartitioning the-- >> Including repartitioning the state, but it's a thing that you have to be often quite careful how to do that so that it all integrates exactly consistency, like the right containers are running at the right point in time with the exact right version and there's not like there's not a split brain situation where this happens to be still running some other partitions at the same time or you're running that container goes down and it's this a situation where you're supposed to recover or rescale like, figuring all of these things out, together this is what they like the idea of integrating these things in a very tight way gives you so think of it as the following way, right? You start with, initially you just start with Docker. Doctor is a way to say, I'm packaging up everything that a process needs, all of its environment to make sure that I can deploy it here and here in here and just always works it's not like, "Oh, I'm missing "the correct version of the library here," or "I'm "interfering with that other process on a port." On top of Docker, people added things like Kubernetes to orchestrate many containers together forming an application and then on top of Kubernetes there are things like Helm or for certain frameworks there's like Kubernetes Operators and so on which try to raise the abstraction to say, "Okay we're taking care of these aspects that this needs in addition to a container orchestration," we're doing exactly that thing like we're raising the abstraction one level up to say, okay we're not just thinking about the containers the computer and maybe they're like local persistent storage but we're looking at the entire state full application with its compute, with its state with its archival storage with all of it together. >> Okay let me sort of peel off with a question about more conventionally trained developers and admins and they're used to databases for a batch and request response type jobs or applications do you see them becoming potential developers of continuous stream processing apps or do you see it only mainly for a new a new generation of developers? >> No, I would I would actually say that that a lot of the like classic... Call it request/response or call it like create update, delete create read update delete or so application working against the database, there's this huge potential for stream processing or that kind of event-driven architectures to help change this view. There's actually a fascinating talk here by the folks from (mumbles) who implemented an entire social network in this in this industry processing architecture so not against the database but against a log in and a stream processor instead it comes with some really cool... with some really cool properties like very unique way of of having operational flexibility too at the same time test, and evolve run and do very rapid iterations over your-- >> Because of the decoupling? >> Exactly, because of the decoupling because you don't have to always worry about okay I'm experimenting here with something. Let me first of all create a copy of the database and then once I actually think that this is working out well then, okay how do I either migrate those changes back or make sure that the copy of the database that I did that bring this up to speed with a production database again before I switch over to the new version and so like so many of these things, the pieces just fall together easily in the streaming world. >> I think I asked this of Kostas, but if a business analyst wants to query the current state of what's in the cluster, do they go through some sort of head node that knows where the partitions lay and then some sort of query optimizer figures out how to execute that with a cost model or something? In other words, if you want it to do some sort of batcher interactive type... >> So there's different answers to that, I think. First of all, there's the ability to log into the state of link as in you have the individual nodes that maintains they're doing the computation and you can look into this but it's more like a look up thing. >> It's you're not running a query as in a sequel query against that particular state. If you would like to do something like that, what Flink gives you as the ability is as always... There's a wide variety of connectors so you can for example say, I'm describing my streaming computation here, you can describe in an SQL, you can say the result of this thing, I'm writing it to a neatly queryable data store and in-memory database or so and then you would actually run the dashboard style exploratory queries against that particular database. So Flink's sweet spot at this point is not to run like many small fast short-lived sequel queries against something that is in Flink running at the moment. That's not what it is yet built and optimized for. >> A more batch oriented one would be the derived data that's in the form of a materialized view. >> Exactly, so this place, these two sites play together very well, right? You have the more exploratory better style queries that go against the view and then you have the stream processor and streaming sequel used to continuously compute that view that you then explore. >> Do you see scenarios where you have traditional OLTP databases that are capturing business transactions but now you want to inform those transactions or potentially automate them with machine learning. And so you capture a transaction, and then there's sort of ambient data, whether it's about the user interaction or it's about the machine data flowing in, and maybe you don't capture the transaction right away but you're capturing data for the transaction and the ambient data. The ambient data you calculate some sort of analytic result. Could be a model score and that informs the transaction that's running at the front end of this pipeline. Is that a model that you see in the future? >> So that sounds like a formal use case that has actually been run. It's not uncommon, yeah. It's actually, in some sense, a model like that is behind many of the fraud detection applications. You have the transaction that you capture. You have a lot of contextual data that you receive from which you either built a model in the stream processor or you built a model offline and push it into the stream processor. As you know, let's say a stream of model updates, and then you're using that stream of model updates. You derive your classifiers or your rule engines, or your predictor state from that set of updates and from the history of the previous transactions and then you use that to attach a classification to the transaction and then once this is actually returned, this stream is fed back to the part of the computation that actually processes that transaction itself to trigger the decision whether to for example hold it back or to let it go forward. >> So this is an application where people who have built traditional architectures would add this capability on for low latency analytics? >> Yeah, that's one way to look at it, yeah. >> As opposed to a rip and replace, like we're going to take out our request/response in our batch and put in stream processing. >> Yeah, so that is definitely a way that stream processing is used, that you you basically capture a change log or so of whatever is happening in either a database or you just immediately capture the events, the interaction from users and devices and then you let the stream processor run side by side with the old infrastructure. And just exactly compute additional information that, even a mainframe database might in the end used to decide what to do with a certain transaction. So it's a way to complement legacy infrastructure with new infrastructure without having to break off or break away the legacy infrastructure. >> So let me ask in a different direction more on the complexity that forms attacks for developers and administrators. Many of the open source community products slash projects solve narrow sort of functions within a broader landscape and there's a tax on developers and admins and trying to make those work together because of the different security models, data models, all that. >> There is a zoo of systems and technologies out there and also of different paradigms to do things. Once systems kind of have a similar paradigm, or a tier in mind, they usually work together well, but there's different philosophical takes-- >> Give me some examples of the different paradigms that don't fit together well. >> For example... Maybe one good example was initially when streaming was a rather new thing. At this point in time stream processors were very much thought of as a bit of an addition to the, let's say, the batch stack or whatever ever other stack you currently have, just look at it as an auxiliary piece to do some approximate computation and a big reason why that was the case is because, the way that these stream processors thought of state was with a different consistency model, the way they thought of time was actually different than you know like the batch processors of the databases at which use time stem fields and the early stream processors-- >> They can't handle event time. >> Exactly, just use processing time, that's why these things you know you could maybe complement the stack with that but it didn't really go well together, you couldn't just say like, okay I can actually take this batch job kind of interpret it also as a streaming job. Once the stream processors got a better interpretation. >> The OEM architecture. >> Exactly. So once the stream processors adopted a stronger consistency model a time model that is more compatible with reprocessing and so on, all of these things all of a sudden fit together much better. >> Okay so, do you see that vendors who are oriented around a single paradigm or unified paradigm, do you see them continuing to broaden their footprint so that they can essentially take some of the complexity off the developer and the admin by providing something that, one throat to choke with the pieces that were designed to work together out-of-the-box, unlike some of the zoos with the former Hadoop community? In other words, lot of vendors seem to be trying to do a broader footprint so that it's something that's just simpler to develop to and to operate? >> There there are a few good efforts happening in that space right now, so one that I really like is the idea of standardizing on some APIs. APIs are hard to standardize on but you can at least standardize on semantics, which is something, that for example Flink and Beam have been very keen on trying to have an open discussion and a road map that is very compatible in thinking about streaming semantics. This has been taken to the next level I would say with the whole streaming sequel design. Beam is adding adding stream sequel and Flink is adding stream sequel, both in collaboration with the Apache CXF project, so very similar standardized semantics and so on, and the sequel compliancy so you start to get common interfaces, which is a very important first step I would say. Standardizing on things like-- >> So sequel semantics are across products that would be within a stream processing architecture? >> Yes and I think this will become really powerful once other vendors start to adopt the same interpretation of streaming sequel and think of it as, yes it's a way to take a changing data table here and project a view of this changing data table, a changing materialized view into another system, and then use this as a starting point to maybe compute another derive, you see. You can actually start and think more high-level about things, think really relational queries, dynamic tables across different pieces of infrastructure. Once you can do something like interplay in architectures become easier to handle, because even if on the runtime level things behave a bit different, at least you start to establish a standardized model, in thinking about how to compose your architecture and even if you decide to change on the way, you frequently saved the problem of having to rip everything out and redesign everything because the next system that you bring in just has a completely different paradigm that it follows. >> Okay, this is helpful. To be continued offline or back online on the CUBE. This is George Gilbert. We were having a very interesting and extended conversation with Stephan Ewen, CTO and co-founder of data Artisans and one of the creators of Apache Flink. And we are at Flink Forward in San Francisco. We will be back after this short break.
SUMMARY :
brought to you by data Artisans. This is the second Flink Forward in San Francisco how the platform grows so that you can solve with the basic ingredients that you need that and then you're sort of organizing them So just to give you one example where this makes have to worry about it if you have this consistent the snapshot applies first of all only to data. the snapshot from the code if you want to. that you get there is I would say very high up and that derived data is you know for analysis, approach to things, which you know it's been called like that was part of core Flink. more accessible, how it you know it empowers and everything into this model where you and so on in a rather loose way to allow you to raise the abstraction to say, "Okay we're taking care that a lot of the like classic... make sure that the copy of the database that I did that In other words, if you want it to do the state of link as in you have the individual nodes or so and then you would actually run of a materialized view. go against the view and then you have the stream processor Is that a model that you see in the future? You have the transaction that you capture. As opposed to a rip and replace, and devices and then you let the stream processor run Many of the open source community there and also of different paradigms to do things. Give me some examples of the different paradigms that the batch stack or whatever ever other stack you currently you know you could maybe complement the stack with that So once the stream processors right now, so one that I really like is the idea of to maybe compute another derive, you see. and one of the creators of Apache Flink.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Stephan Ewen | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Stephan | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
one version | QUANTITY | 0.99+ |
both versions | QUANTITY | 0.99+ |
two sites | QUANTITY | 0.99+ |
Apache Flink | ORGANIZATION | 0.99+ |
two versions | QUANTITY | 0.99+ |
Flink Forward | ORGANIZATION | 0.99+ |
second | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
today | DATE | 0.98+ |
fourth aspect | QUANTITY | 0.98+ |
java | TITLE | 0.98+ |
Artisans | ORGANIZATION | 0.98+ |
one program | QUANTITY | 0.97+ |
one way | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
Kubernetes | TITLE | 0.97+ |
one angle | QUANTITY | 0.97+ |
Kafka | TITLE | 0.96+ |
one part | QUANTITY | 0.96+ |
first step | QUANTITY | 0.96+ |
two wire formats | QUANTITY | 0.96+ |
first | QUANTITY | 0.96+ |
First | QUANTITY | 0.94+ |
each node | QUANTITY | 0.94+ |
Beam | ORGANIZATION | 0.94+ |
one example | QUANTITY | 0.94+ |
CTO | PERSON | 0.93+ |
2018 | DATE | 0.93+ |
Docker | TITLE | 0.92+ |
Apache | ORGANIZATION | 0.91+ |
one good example | QUANTITY | 0.91+ |
single paradigm | QUANTITY | 0.9+ |
one application | QUANTITY | 0.89+ |
Flink | TITLE | 0.86+ |
node | TITLE | 0.79+ |
Kostas | ORGANIZATION | 0.76+ |
earlier this morning | DATE | 0.69+ |
CUBE | ORGANIZATION | 0.67+ |
SQL | TITLE | 0.64+ |
Helm | TITLE | 0.59+ |
CXF | TITLE | 0.59+ |
Holden Karau, Google | Flink Forward 2018
>> Narrator: Live from San Francisco, it's the Cube, covering Flink Forward, brought to you by Data Artisans. (tech music) >> Hi, this is George Gilbert, we're at Flink Forward, the user conference for the Apache Flink Community, sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest, with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. >> Yes. >> And now, Holden is at Google, focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's Dataflow, as well as Flink and other stream processors. >> Yeah. >> And Holden has been working on integrating it with the Google TensorFlow framework, also open-sourced. Yes. >> So, Holden, tell us about the objective of putting these together. What type of use cases.... >> So, I think it's really exciting. And it's still very early days, I want to be clear. If you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark, with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot people essentially end up having to write, like, a stream-processing job to do their data preparation, and they have to write a corresponding online serving job, to do similar data preparation for when they want to make real predictions. And by integrating tf.Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple, uniform way, that can be taken from the training time into the online serving time, without them having to rewrite their code, removing the potential for mistakes where we like, change one variable slightly in one place and forget to update it in another. And just really simplifying the deployment process for these models. >> Okay, so help us tie that back to, in this case, Flink. >> Yes. >> And also to clarify, that data prep.... My impression was data prep was a different activity. It was like design time and serving was run time. But you're saying that they can be better integrated? >> So, there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, I don't have to do that at serving time. But one of the classic examples for data prep would be tokenizing my inputs, or performing some kind of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form, or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to re-create the data prep logic that I created for training at serving time. >> So, by having common Beam API and the common provider underneath it, like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine-learning model that you want those.... It would be ideal to have those transformation activities be common in your prep work, and then in the production serving. >> Yes, very true. >> So, tell us what type of customers want to write to the Beam API and have that portability? >> Yeah, so that's a really good question. So, there's a lot of people who really want portability outside of Google Cloud, and that's one group of people, essentially people who want to adopt different Google Cloud technologies, but they don't want be locked into Google Cloud forever. Which is completely understandable. There are other people who are more interested in being able to switch streaming engines, like, they want to be able to switch between Spark and Flink. And those are people who want to try out different streaming engines without having to rewrite their entire jobs. >> Does Spark Structured Streaming support the Beam API? >> So, right now, the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the Structured Streaming API. It's a thing we're actively discussing on the mailing list, how to go about doing. Because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure. But it's something that we should look at more of. Where was I going with this? So the other one that I see, is like, Flink is a wonderful API, but it's very Java-focused. And so, Java's great, everyone loves it, but a lot of cool things that are being done nowadays, are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. Beam gives a way for people to work with Python, across these different engines. Flink supports Python, but it's maybe not a first class citizen. And the Beam Python support is still a work in progress. We're working to get it to be better, but it's.... You can see the demos this afternoon, although if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. >> In to support Go. >> Which is a little out of left field. >> So, would it be fair to say that the value of Beam, for potential Flink customers, they can work and start on Google Cloud platform. They can start on one of several stream processors. They can move to another one later, and they also inherit the better language support, or bindings from the Beam API? >> I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. It's somewhat subjective, like what better language support is. But I think definitely for Go, it's pretty clear. This stuff is all stuff that's in the master branch, it's not released today. But if people are looking to play with it, I think it's really exciting. They can go and check it out from GitHub, and build it locally. >> So, what type of customers do you see who have moved into production with machine learning? >> So the.... >> And the streaming pipelines? >> The biggest customer that's in production is obviously, or not obviously, is Spotify. One of them is Spotify. They give a lot of talks about it. Because I didn't know we were going to be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly. I'll just stick with Spotify. >> Without the names, the sort of use cases and the general industry.... >> I don't want to get in trouble. >> Okay. >> I'm just going to ... sorry. >> Okay. So then, let's talk about, does Google view Dataflow as their sort of strategic successor to map produce? >> Yes, so.... >> And is that a competitor then to Flink? >> I think Flink and Dataflow can be used in some of the same cases. But, I think they're more complementary. Flink is something you can run on-prem. You can run it in different Defenders. And Dataflow is very much like, "I can run this on Google Cloud." And part of the idea with Beam is to make it so that people who want to write Dataflow jobs but maybe want the flexibility to go back to something else later can still have that. Yeah, we couldn't swap in Flink or Dataflow execution engines if we're on Google Cloud, but.... We're not, how do I put it nicely? Provided people are running this stuff, they're burning CPU cycles, I don't really care if they're running Dataflow or Flink as the execution engine. Either way, it's a party for me, right? >> George: Okay. >> It's probably one of those, sort of, friendly competitions. Where we both push each other to do better and add more features that the respective projects have. >> Okay, 30 second question. >> Cool. >> Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? >> Totally. I mostly see it as extending existing apps. This is obviously, possibly a bias, just for the people that I talk to. But, going ground up with both streaming and machine learning, at the same time, like, starting both of those projects fresh is a really big hurdle to get over. >> George: For skills. >> For skills. It's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something ... maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly. Or you'll build a streaming system, and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping head first into both at the same time. But this could change. Batch has been King for a long time and streaming is getting it's day in the sun. So, we could start seeing people becoming more adventurous and doing both, at the same time. >> Holden, on that note, we'll have to call it a day. That was most informative. >> It's really good to see you again. >> Likewise. So this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference, sponsored by Data Artisans. And we will be back in a few minutes after this short break. (tech music)
SUMMARY :
Narrator: Live from San Francisco, it's the Cube, This is the second Flink Forward conference focused on the Beam API, which is an API And Holden has been working on integrating it So, Holden, tell us about the objective of the same preparation code available to use And also to clarify, that data prep.... I don't have to do that at serving time. and the common provider underneath it, in being able to switch streaming engines, And the Beam Python support is still a work in progress. or bindings from the Beam API? But if people are looking to play with it, I didn't have a chance to go through my customer list the sort of use cases and the general industry.... as their sort of strategic successor to map produce? And part of the idea with Beam is to make it so that and add more features that the respective projects have. at the same time, and streaming is getting it's day in the sun. Holden, on that note, we'll have to call it a day. We're on the ground at Flink Forward,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Holden Karau | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
Java | TITLE | 0.99+ |
Holden | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Spotify | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
two paths | QUANTITY | 0.99+ |
TensorFlow | TITLE | 0.99+ |
One | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
GitHub | ORGANIZATION | 0.98+ |
today | DATE | 0.98+ |
Dataflow | TITLE | 0.97+ |
Flink | ORGANIZATION | 0.97+ |
one variable | QUANTITY | 0.97+ |
a day | QUANTITY | 0.97+ |
Go | TITLE | 0.97+ |
Flink Forward | EVENT | 0.96+ |
Flink | TITLE | 0.96+ |
30 second question | QUANTITY | 0.96+ |
one place | QUANTITY | 0.95+ |
Beam | TITLE | 0.95+ |
second | QUANTITY | 0.95+ |
Google Cloud | TITLE | 0.94+ |
Apache | ORGANIZATION | 0.94+ |
one group | QUANTITY | 0.94+ |
one | QUANTITY | 0.93+ |
this afternoon | DATE | 0.9+ |
Dstream | TITLE | 0.88+ |
2018 | DATE | 0.87+ |
first | QUANTITY | 0.79+ |
Beam API | TITLE | 0.75+ |
Beam | ORGANIZATION | 0.74+ |
Apache Flink Community | ORGANIZATION | 0.72+ |
Shuyi Chen, Uber | Flink Forward 2018
>> Announcer: Live from San Francisco, it's theCUBE covering Flink Forward, brought to you by data Artisans. (upbeat music) >> This is George Gilbert. We are at Flink Forward, the user conference for the Apache Flink community, sponsored by data Artisans, the company behind Flink. And we are here with Shuyi Chen from Uber, and Shuyi works on a very important project which is the Calcite Query Optimizer, SQL Query Optimizer, that's used in Apache Flink as well as several other projects. Why don't we start with, Shuyi tell us where Calcite's used and its role. >> Calcite is basically used in the Flink Table and SQL API, as the SQL POSSTR and query optimizer in planner for Flink. >> OK. >> Yeah. >> So now let's go to Uber and talk about the pipeline or pipelines you guys have been building and then how you've been using Flink and Calcite to enable the SQL API and the Table API. What workloads are you putting on that platform, or on that pipeline? >> Yeah, so basically I'm the technical lead of the streaming platform, processing platform in Uber, and so we use Apache Flink as the stream processing engine for Uber. Basically we build two different platforms one is the, called AthenaX, which use Flink SQL. So basically enable user to use SQL to compose the stream processing logic. And we have a UI, and with one click, they can just deploy the stream processing job in production. >> When you say UI, did you build a custom UI to take essentially, turn it a business intelligence tool so you have a visual way of constructing your queries? Is that what you're describing, or? >> Yeah, so it's similar to how you compose your, write a SQL query to query database. We have a UI for you to write your SQL query, with all the syntax highlight and all the hint. To write a SQL query so that, even the data scientists and also non engineers in general can actually use that UI to compose stream processing lock jobs. >> Okay, give us an example of some applications 'cause this sounds like it's a high-level API so it makes it more accessible to a wider audience. So what are some of the things they build? >> So for example, in our Uber Eats team, they use the SQL API to, as the stream processing tool to build their Restaurant Manager Dashboard. Restaurant Manager Dashboard. >> Okay. >> So basically, the data log lives in Kafka, get real-time stream into the Flink job, which it's composed using the SQL API and then that got stored in our lab database, P notes, then when the restaurant owners opens the Restaurant Manager, they will see the dashboard of their real-time earnings and everything. And with the SQL API, they no longer need to write the Flink job, they don't need to use Java or skala code, or do any testing or debugging, It's all SQL, so they, yeah. >> And then what's the SQL coverage, the SQL semantics that are implemented in the current Calcite engine? >> So it's about basic transformation, projection, and window hopping and tumbling window and also drawing, and group eye, and having, and also not to mention about the event time and real time, processing time support. >> And you can shuffle from anywhere, you don't have to have two partitions with the same join key on one node. You can have arbitrary, the data placement can be arbitrary for the partitions? >> Well the SQL is the collective, right? And so once the user compose the logic the underlying panel will actually take care of how the key by and group by, everything. >> Okay, 'cause the reason I ask is many of the early Hadoop based MPP sequel engines had the limitation where you had to co-locate the partitions that you were going to join. >> That's the same thing for Flink. >> Oh. >> But it just the SQL part is just take care of that. >> Okay. >> So you do describe what you do, but underlying get translated into a Flink program that actually will do all the co-location. >> Oh it redoes it for you, okay >> Yeah, yeah. So now they don't even need to learn Flink, they just need to learn the SQL, yeah. >> Now you said there a second platform that Uber is building on top of Flink. >> Yeah, the second platform is the, we call it the Flink as a service platform. So the motivation is, we found that SQL actually cannot satisfy all the advanced need in Uber to build stream processing, due to the reason, like for example, they will need to call up RPC services within their stream processing application or even training the RCP call, so which is hard to express in SQL and also when they are having a complicated DAG, like a workflow, it's very difficult to debug individual stages, so they want the control to actually to use delative Flink data stream APL dataset API to build their stream of batch job. >> Is the dataset API the lowest level one? >> No it's on the same level with the data stream, so it's one for streaming, one for batch. >> Okay, data stream and then the other was table? >> Dataset. >> Oh dataset, data stream, data set. >> Yeah. >> And there's one lower than that right? >> Yeah, there's one lower API but it's usually, most people don't use that API. >> So that's system programmers? >> Yeah, yeah. >> So then tell me, who is using, like what type of programmer uses the data stream or the data set API, and what do they build at Uber? >> So for example, in one of the talk later, there's a marketplace team, marketplace dynamics team, it's actually using the platform to do online model update, machinery model update, using Flink, and so basically they need to take in the model that is trained offline and do a few group by, time and location and then apply the model, and then incrementally update the model. >> And so are they taking a window of updates and then updating the model and then somehow promoting it as the candidate or, >> Yeah, yeah, yeah. Something similar, yeah. >> Okay, that's interesting. And what type of, so are these the data scientists who are using this API? >> Well data scientists are not really, it's not designed for data scientists. >> Oh so they're just going the models off, they're preparing the models offline and then they're being updated in line on the stream processing platform. >> Yes. >> And so it's maybe, data engineers who are essentially updating the features that get fed in and are continually training, or updating the models. >> Basically it's a online model update. So as Kafka event comes in, continue to refine the model. >> Okay, and so as Uber looks out couple years, what sorts of things do you see adding to one of these, either of these pipelines, and do you see a shift away from the batch and request response type workloads towards more continuous processing. >> Yes actually there we do see that trend, actually, before becoming entirely of stream processing platform team in Uber, I was in marketplace as well and at that point we always see there's a shift, like people would love to use stream processing technology to actually replace some of the normal backhand service applications. >> Tell me some examples. >> Yeah, for example... So in our dispatch platform, we have the need to actually shard the workload by, for example, writers, to different hosts to process. For example, compute say ETA or compute some of the time average, and this is before done in back hand services and say use our internal distribution system things to do the sharding. But actually with Flink, this can be just done very easily, right. And so actually there's a shift, those people will also want to adopt stream processing technology and, so long as this is not a request response style application. >> So the key thing, just to make sure I understand it's that Flink can take care of the distributed joins, whereas when it was a data base based workload, DBA had to set up the sharding and now it's sort of more transparent like it's more automated? >> I think, it's... More of the support, so if before people writing backhand services they have to write everything: the state management, the sharding, and everything, they need to-- >> George: Oh it's not even data base based-- >> Yeah, it's not data base, it's real time. >> So they have to do the physical data management, and Flink takes care of that now? >> Yeah, yeah. >> Oh got it, got it. >> For some of the application it's real time so we don't really need to store the data all the time in the database, So it's usually keep in memory and somehow gets snapshot, But we have, for normal backhand service writer they have to do everything. But with Flink it has already built in support for state management and all the sharding, partitioning and the time window, aggregation primitive, and it's all built in and they don't need to worry about re-implement the logic and we architect the system again and again. >> So it's a new platform for real time it gives you a whole lot of services, higher abstraction for real time applications. >> Yeah, yeah. >> Okay. Alright with that, Shuyi we're going to have to call it a day. This was Shuyi Chen from Uber talking about how they're building more and more of their real time platforms on Apache Flink and using a whole bunch of services to complement it. We are at Flink Forward, the user conference of data Artisans for the Apache Flink community, we're in San Francisco, this is the second Flink Forward conference and we'll be back in a couple minutes, thanks. (upbeat music)
SUMMARY :
brought to you by data Artisans. the user conference for the Apache Flink community, as the SQL POSSTR and talk about the pipeline or pipelines Yeah, so basically I'm the technical lead Yeah, so it's similar to how you compose your, so it makes it more accessible to a wider audience. as the stream processing tool the Flink job, they don't need to use Java or skala code, and also not to mention about the event time the data placement can be arbitrary for the partitions? And so once the user compose the logic had the limitation where you had to co-locate So you do describe what you do, So now they don't even need to learn Flink, Now you said there a second platform all the advanced need in Uber to build stream processing, No it's on the same level with the data stream, Yeah, there's one lower API but it's usually, and so basically they need to take in the model Yeah, yeah, yeah. so are these the data scientists who are using this API? it's not designed for data scientists. on the stream processing platform. and are continually training, So as Kafka event comes in, continue to refine the model. Okay, and so as Uber looks out couple years, and at that point we always see there's a shift, or compute some of the time average, More of the support, and it's all built in and they don't need to worry about So it's a new platform for real time for the Apache Flink community, we're in San Francisco,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Uber | ORGANIZATION | 0.99+ |
Shuyi Chen | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
George | PERSON | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
second platform | QUANTITY | 0.99+ |
Shuyi | PERSON | 0.99+ |
Java | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
Kafka | TITLE | 0.99+ |
Uber Eats | ORGANIZATION | 0.99+ |
one click | QUANTITY | 0.99+ |
SQL Query Optimizer | TITLE | 0.99+ |
SQL POSSTR | TITLE | 0.98+ |
second | QUANTITY | 0.98+ |
Calcite | TITLE | 0.98+ |
two partitions | QUANTITY | 0.97+ |
SQL API | TITLE | 0.97+ |
Calcite Query Optimizer | TITLE | 0.97+ |
Flink Forward | EVENT | 0.96+ |
a day | QUANTITY | 0.95+ |
one | QUANTITY | 0.95+ |
Flink Table | TITLE | 0.94+ |
Apache Flink | ORGANIZATION | 0.94+ |
one node | QUANTITY | 0.88+ |
Flink | TITLE | 0.83+ |
two different platforms | QUANTITY | 0.82+ |
couple years | QUANTITY | 0.82+ |
Table | TITLE | 0.82+ |
Apache | ORGANIZATION | 0.8+ |
Artisans | ORGANIZATION | 0.78+ |
2018 | DATE | 0.77+ |
Hadoop | TITLE | 0.73+ |
one for | QUANTITY | 0.69+ |
couple minutes | QUANTITY | 0.65+ |
AthenaX | ORGANIZATION | 0.64+ |
Flink Forward | TITLE | 0.56+ |
Forward | EVENT | 0.52+ |
DBA | ORGANIZATION | 0.5+ |
MPP | TITLE | 0.47+ |
Greg Benson, SnapLogic | Flink Forward 2018
>> Announcer: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> Hi this is George Gilbert. We are at Flink Forward on the ground in San Francisco. This is the user conference for the Apache Flink Community. It's the second one in the US and this is sponsored by Data Artisans. We have with us Greg Benson, who's Chief Scientist at Snap Logic and also professor of computer science at University of San Francisco. >> Yeah that's great, thanks for havin' me. >> Good to have you. So, Greg, tell us a little bit about how Snap Logic currently sets up its, well how it builds its current technology to connect different applications. And then talk about, a little bit, where you're headed and what you're trying to do. >> Sure, sure, so Snap Logic is a data and app integration Cloud platform. We provide a graphical interface that lets you drag and drop. You can open components that we call Snaps and you kind of put them together like Lego pieces to define relatively sophisticated tasks so that you don't have to write Java code. We use machine learning to help you build out these pipelines quickly so we can anticipate based on your data sources, what you are going to need next, and that lends itself to rapid building of these pipelines. We have a couple of different ways to execute these pipelines. You can think of it as sort of this specification of what the pipeline's supposed to do. We have a proprietary engine that we can execute on single notes, either in the Cloud or behind your firewall in your data center. We also have a mode which can translate these pipelines into Spark code and then execute those pipelines at scale. So, you can do sort of small, low latency processing to sort of larger, batch processing on very large data sets. >> Okay, and so you were telling me before that you're evaluating Flink or doing research with Flink as another option. Tell us what use cases that would address that the first two don't. >> Yeah, good question. I'd love to just back up a little bit. So, because I have this dual role of Chief Scientist and as a professor of Computer Science, I'm able to get graduate students to work on research projects for credit, and then eventually as interns at SnapLogic. A recent project that we've been working on since we started last fall so working on about six or seven months now is investigating Flink as a possible new back end for the SnapLogic platform. So this allows us to you know, to explore and prototype and just sort of figure out if there's going to be a good match between an emerging technology and our platform. So, to go back to your question. What would this address? Well, so, without going into too much of the technical differences between Flink and Spark which I imagine has come up in some of your conversations or it comes up here because they can solve similar use cases our experience with Flink is the code base is easy to work with both from taking our specification of pipelines and then converting them into Flink code that can run. But there's another benefit that we see from Flink and that is, whenever any product, whether it's our product or anybody else's product, that uses something like Spark or Flink as a back end, there's this challenge because you're converting something that your users understand into this target, right, this Spark API code or Flink API code. And the challenge there is if something goes wrong, how do you propagate that back to the users so the user doesn't have to read log files or get into the nuts and bolts of how Spark really works. >> It's almost like you've compiled the code, and now if something doesn't work right, you need to work at the source level. >> That's exactly right, and that's what we don't want our users to do, right? >> Right. >> So one promising thing about Flink is that we're able to integrate the code base in such a way that we have a better understanding of what's happening in the failure conditions that occur. And we're working on ways to propagate those back to the user so they can take actionable steps to remedy those without having to understand the Flink API code iself. >> And what is it, then, about Flink or its API that gives you that feedback about errors or you know, operational status that gives you better visibility than you would get in something else like Spark. >> Yeah, so without getting too too deep on the subject, what we have found is, one thing nice about the Flink code base is the core is written in Scala, but there's a lot of, all the IO and memory handling is written in Java and that's where we need to do our primary interfacing and the building blocks, sort of the core building blocks to get to, for example, something that you build with a dataset API to execution. We have found it easier to follow the transformation steps that Flink takes to end up with the resulting sort of optimized, optimized Flink pipeline. Now by understanding that transformation, like you were saying, the compilation step, by understanding it, then we can work backwards, and understand how, when something happens, how to trace it back to what the user was originally trying to specify. >> The GUI specification. >> Yeah. Right. >> So, help me understand though it sounds like you're the one essentially building a compiler from a graphical specification language down to Spark as the, you know, sort of, pseudo, you know, psuedo compile code, >> Yep. >> Or Flink. And, but if you're the one doing that compilation, I'm still struggling to understand why you would have better reverse engineering capabilities with one. >> It just is a matter of getting visibility into the steps that the underlying frameworks are taking and so, I'm not saying this is impossible to do in Spark, but we have found that we've had, it's been easier for us to get into the transformation steps that Flink is taking. >> Almost like, for someone who's had as much programming as a one semester in night school, like a variable and specter that's already there, >> Yeah, that's a good, there you go, yeah, yeah, yeah. >> Okay, so you don't have to go try and you can't actually add it, and you don't have to then infer it from all this log data. >> Now, I should add, there's another potential Flink. You were asking about use cases and what does Flink address. As you know, Flink is a streaming platform, in addition to being a batch platform, and Flink does streaming differently than how Spark does. Spark takes a microbatch approach. What we're also looking at in my research effort is how to take advantage of Flink's streaming approach to allow the SnapLogic GUI to be used to specify streaming Flink applications. Initially we're just focused on the batch mode but now we're also looking at the potential to convert these graphical pipelines into streaming Flink applications, which would be a great benefit to customers who want-- >> George: Real time integration. >> Want to do what Alibaba and all the other companies are doing but take advantage of it without having to get to the nuts and bolts of the programming. Do it through the GUI. >> Wow, so it's almost like, it's like, Flink, Beam, in terms of obstruction layers, >> Sure. >> And then SnapLogic. >> Greg: Sure, yes. >> Not that you would compile the beam, but the idea that you would have perv and processing and a real-time pipeline. >> Yes. >> Okay. So that's actually interesting, so that would open up a whole new set of capabilities. >> Yeah and, you know, it follows our you know, company's vision in allowing lots of users to do very sophisticated things without being, you know, Hadoop developers or Spark developers, or even Flink developers, we do a lot of the hard work of trying to give you a representation that's easier to work with, right but, also allow you to sort of evolve that and de-bug it and also eventually get the performance out of these systems One of the challenges of course of Spark and Flink is that they have to be tuned, and you have to, and so what we're trying to do is, using some of our machine learning, is eventually gather information that can help us identify how to tune different types of work flows in different environments. And that, if we're able to do that in it's entirety, then we, you know, we take out a lot of the really hard work that goes into making a lot of these streaming applications both scalable and performing. >> Performimg. So this would be, but you would have, to do that, you would probably have to collect well, what's the term? I guess data from the operations of many customers, >> Right. >> Because, as training data, just as the developer alone, you won't really have enough. >> Absolutely, and that's, so that you have to bootstrap that. For our machine learning that we currently use today, we leverage, you know, the thousands of pipelines, the trillions of documents that we now process on a monthly basis, and that allows us to provide good recommendations when you're building pipelines, because we have a lot of information. >> Oh, so you are serving the runtime, these runtime compilations. >> Yes. >> Oh, they're not all hosted on the customer premises. >> Oh, no no no, we do both. So it's interesting, we do both. So you can, you can deploy completely in the cloud, we're a complete SASS provider for you. Most of our customers though, you know, Banks Healthcare, want to run our engine behind their firewalls. Even when we do that though, we still have metadata that we can get introspection, sort of anonymized, but we can get introspection into how things are behaving. >> Okay. That's very interesting. Alright, Greg we're going to have to end it on that note, but uh you know, I guess everyone stay tuned. That sounds like a big step forward in sort of specification of real time pipelines at a graphical level. >> Yeah, well, it's, I hope to be talking to you again soon with more results. >> Looking forward to it. With that, this is George Gilbert, we are at Flink Forward, the user conference for the Apache Flink conference, sorry for the Apache Flink user community, sponsored by Data Artisans, we will be back shortly. (upbeat music)
SUMMARY :
brought to you by Data Artisans. We are at Flink Forward on the ground in San Francisco. and what you're trying to do. so that you don't have to write Java code. Okay, and so you were telling me before So this allows us to you know, to explore and prototype you need to work at the source level. so they can take actionable steps to remedy those that gives you that feedback something that you build with a dataset API to execution. you would have better and so, I'm not saying this is impossible to do in Spark, and you don't have to then infer it from all this log data. As you know, Flink is a streaming platform, Want to do what Alibaba and all the other companies the idea that you would have perv and processing so that would open up a whole new is that they have to be tuned, and you have to, So this would be, but you would have, to do that, just as the developer alone, you won't really have enough. we leverage, you know, the thousands of pipelines, Oh, so you are serving the runtime, Most of our customers though, you know, Banks Healthcare, you know, I guess everyone stay tuned. Yeah, well, it's, I hope to be talking to you again soon Looking forward to it.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Greg Benson | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Greg | PERSON | 0.99+ |
US | LOCATION | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
Java | TITLE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Snap Logic | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
Scala | TITLE | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
Banks Healthcare | ORGANIZATION | 0.99+ |
second one | QUANTITY | 0.99+ |
Lego | ORGANIZATION | 0.99+ |
last fall | DATE | 0.98+ |
one semester | QUANTITY | 0.98+ |
SnapLogic | ORGANIZATION | 0.98+ |
SnapLogic | TITLE | 0.97+ |
first two | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
Flink | TITLE | 0.96+ |
single notes | QUANTITY | 0.96+ |
about six | QUANTITY | 0.96+ |
trillions of documents | QUANTITY | 0.95+ |
Flink Forward | ORGANIZATION | 0.95+ |
seven months | QUANTITY | 0.94+ |
One | QUANTITY | 0.94+ |
University of San Francisco | ORGANIZATION | 0.93+ |
one | QUANTITY | 0.92+ |
one thing | QUANTITY | 0.91+ |
Apache Flink Community | ORGANIZATION | 0.89+ |
Spark | ORGANIZATION | 0.85+ |
Apache Flink | ORGANIZATION | 0.82+ |
Flink Forward | EVENT | 0.82+ |
2018 | DATE | 0.81+ |
pipelines | QUANTITY | 0.81+ |
Flink | EVENT | 0.76+ |
SASS | ORGANIZATION | 0.73+ |
dual | QUANTITY | 0.72+ |
Logic | TITLE | 0.72+ |
Apache | ORGANIZATION | 0.57+ |
Hadoop | ORGANIZATION | 0.54+ |
thing | QUANTITY | 0.54+ |
Beam | TITLE | 0.51+ |
Kostas Tzoumas, data Artisans | Flink Forward 2018
(techno music) >> Announcer: Live, from San Francisco, it's theCUBE. Covering Flink Forward, brought to you by data Artisans. (techno music) >> Hello again everybody, this is George Gilbert, we're at the Flink Forward Conference, sponsored by data Artisans, the provider of both Apache Flink and the commercial distribution, the dA Platform that supports the productionization and operationalization of Flink, and makes it more accessible to mainstream enterprises. We're priviledged to have Kostas Tzoumas, CEO of data Artisans, with us today. Welcome Kostas. >> Thank you. Thank you George. >> So, tell us, let's start with sort of an idealized application-use case, that is in the sweet spot of Flink, and then let's talk about how that's going to broaden over time. >> Yeah, so just a little bit of an umbrella above that. So what we see very, very consistently, we see it in tech companies, and we see, so modern tech companies, and we see it in traditional enterprises that are trying to move there, is a move towards a business that runs in real time. Runs 24/7, is data-driven, so decisions are made based on data, and is software operated. So increasingly decisions are made by AI, by software, rather than someone looking at something and making a decision, yeah. So for example, some of the largest users of Apache Flink are companies like Uber, Netflix, Alibaba, Lyft, they are all working in this way. >> Can you tell us about the size of their, you know, something in terms of records per day, or cluster size, or, >> Yeah, sure. So, latest I heard, Alibaba is powering Alibaba Certs, more than a thousand nodes, terabytes of states, I'm pretty sure they will give us bigger numbers today. Netflix has reported of doing about one trillion events per day. >> George: Wow. >> On Flink. So pretty big sizes. >> So and is Netflix, I think I read, is powering their real time recommendation updates. >> They are powering a bunch of things, a bunch of applications, there's a lot of routing events internally. I think they have a talk, they had a talk definitely at the last conference, where they talk about this. And it's really a variety of use cases. It's really about building a platform, internally. And offering it to all sorts of departments in the company, be that for recommendations, be that for BI, be that for running, state of microservices, you know, all sorts of things. And we also see, the more traditional enterprise moving to this modus operandi. For example, ING is also one of our biggest partners, it's a global consumer bank based in the Netherlands, and their CEO is saying that ING is not a bank, it's a tech company that happens to have a banking license. It's a tech company that inherited a banking license. So that's how they want to operate. So what we see, is stream processing is really the enabler for this kind of business, for this kind of modern business where we interact with, in real time, they interact with the consumer in real time, they push notifications, they can change the pricing, et cetera, et cetera. So this is really the crux of stateful stream processing , for me. >> So okay, so tell us, for those who, you know, have a passing understanding of how Kafka's evolving, how Apache Spark and Structured Streaming's evolving, as distinct from, but also, Databricks. What is it about having state management that's sort of integrated, that for example, might make it easy to elastically change a cluster size by repartitioning. What can you assume about managing state internally, that makes things easier? >> Yeah, so I think really the, the sweet spot of Flink, is that if you are looking for stream process, from a stream processing engine, and for a stateful stream processing engine for that matter, Flink is the definition of this. It's the definite solution to this problem. It was created from scratch, with this in mind, it was not sort of a bolt-on on top of something else, so it's streaming from the get-go. And we have done a lot of work to make state a first-class citizen. What this means, is that in Flink programs, you can keep state that scales to terabytes, we have seen that, and you can manage this state together with your application. So Flink has this model based on check points, where you take a check point of your application and state together, and you can restart at any time from there. So it's really, the core of Flink, is around state management. >> And you manage exactly one semantics across the checkpointing? >> It's exactly once, it's application-level exactly once. We have also introduced end-to-end exactly once with Kafka. So Kafka-Flink-Kafka exactly once. So fully consistent. >> Okay so, let's drill down a little bit. What are some of the things that customers would do with an application running on a, let's say a big cluster or a couple clusters, where they want to operate both on the application logic and on the state that having it integrated you know makes much easier? >> Yeah, so it is a lot about a flipped architecture and about making operations and DevOps much, much easier. So traditionally what you would do is create, let's say a containerized stateless application and have a central centralized data store to keep all your states. What you do now, is the state becomes part of the application. So this has several benefits. It has performance benefits, it has organizational benefits in the company. >> Autonomy >> Autonomy between teams. It has, you know it gives you a lot of flexibility on what you can do with the applications, like, for example right, scaling an application. What you can do with Flink is that you have an application running with parallelism over 100 and you are getting a higher volume and you want to scale it to 500 right, so you can simply with Flink take a snapshot of the state and the application together, and then restart it at a 500 and Flink is going to resolve the state. So no need to do anything on a database. >> And then it'll reshard and Flink will reshard it. >> Will reshard and it will restart. And then one step further with the product that we have introduced, dA Platform which includes Flink, you can simply do this with one click or with one rest command. >> So, the the resharding was possible with core Flink, the Apache Flink and the dA Platform just makes it that much easier along with other operations. >> Yeah so what the dA Platform does is it gives you an API for common operational tasks, that we observed everybody that was deploying Flink at a decent scale needs to do. It abstracts, it is based on Kubernetes, but it gives you a higher-level API than Kubernetes. You can manage the application and the state together, and it gives that to you in a rest API, in a UI, et cetera. >> Okay, so in other words it's sort of like by abstracting even up from Kubernetes you might have a cluster as a first-class citizen but you're treating it almost like a single entity and then under the covers you're managing the, the things that happen across the cluster. >> So what we have in the dA Platform is a notion of a deployment which is, think of it as, I think of it as a cluster, but it's basically based on containers. So you have this notion of deployments that you can manage, (coughs) sorry, and then you have a notion of an application. And an application, is a Flink job that evolves over time. And then you have a very, you know, bird's-eye view on this. You can, when you update the code, this is the same application with updated code. You can travel through a history, you can visit the logs, and you can do common operational tasks, like as I said, rescaling, updating the code, rollbacks, replays, migrate to a new deployment target, et cetera. >> Let me ask you, outside of the big tech companies who have built much of the application management scaffolding themselves, you can democratize access to stream processing because the capabilities, you know, are not in the skill set of traditional, mainstream developers. So question, the first thing I hear from a lot of sort of newbies, or people who want to experiment, is, "Well, it's so easy to manage the state "in a shared database, even if I'm processing, "you know, continuously." Where should they make the trade-off? When is it appropriate to use a shared database? Maybe you know, for real OLTP work, and then when can you sort of scale it out and manage it integrally with the rest of the application? >> So when should we use a database and when should we use streaming, right? >> Yeah, and even if it's streaming with the embedded state. >> Yeah, that's a very good question. I think it really depends on the use case. So what we see in the market, is many enterprises start with with a use case that either doesn't scale, or it's not developer friendly enough to have these database application levels. Level separation. And then it quickly spreads out in the whole company and other teams start using it. So for example, in the work we did with ING, they started with a fraud detection application, where the idea was to load models dynamically in the application, as the data scientists are creating new models, and have a scalable fraud detection system that can handle their load. And then we have seen other teams in the company adopting processing after that. >> Okay, so that sounds like where the model becomes part of the application logic and it's a version of the application logic and then, >> The version of the model >> Is associated with the checkpoint >> Correct. >> So let me ask you then, what happens when you you're managing let's say terabytes of state across a cluster, and someone wants to query across that distributed state. Is there in Flink a query manager that, you know, knows about where all the shards are and the statistics around the shards to do a cost-based query? >> So there is a feature in Flink called queryable state that gives you the ability to do, very simple for now, queries on the state. This feature is evolving, it's in progress. And it will get more sophisticated and more production-ready over time. >> And that enables a different class of users. >> Exactly, I wouldn't, like to be frank, I wouldn't use it for complex data warehousing scenarios. That still needs a data warehouse, but you can do point queries and a few, you know, slightly more sophisticated queries. >> So this is different. This type of state would be different from like in Kafka where you can store you know the commit log for X amount of time and then replay it. This, it's in a database I assume, not in a log form and so, you have faster access. >> Exactly, and it's placed together with a log, so, you can think of the state in Flink as the materialized view of the log, at any given point in time, with various versions. >> Okay. >> And really, the way replay works is, roll back the state to a prior version and roll back the log, the input log, to that same logical time. >> Okay, so how do you see Flink spreading out, now that it's been proven in the most demanding customers, and now we have to accommodate skills, you know, where the developers and DevOps don't have quite the same distributed systems knowledge? >> Yeah, I mean we do a lot of work at data Artisans with financial services, insurance, very traditional companies, but it's definitely something that is work in progress in the sense that our product the dA Platform makes operation smarts easier. This was a common problem everywhere, this was something that tech companies solved for themselves, and we wanted to solve it for everyone else. Application development is yet another thing, and as we saw today in the last keynote, we are working together with Google and the BIM Community to bring Python, GOLD, all sorts of languages into Flink. >> Okay so that'll help at the developer level, and you're also doing work at the operations level with the platform. >> And of course there's SQL right? So Flink has Stream SQL which is standard SQL. >> And would you see, at some point, actually sort of managing the platform for customers, either on-prem or in the cloud? >> Yeah, so right now, the platform is running on Kubernetes, which means that typically the customer installs it in their clusters, in their Kubernetes clusters. Which can be either their own machines, or it can be a Kubernetes service from a cloud vendor. Moving forward I think it will be very interesting yes, to move to more hosted solutions. Make it even easier for people. >> Do you see a breakpoint or a transition between the most sophisticated customers who, either are comfortable on their own premises, or who were cloud, sort of native, from the beginning, and then sort of the rest of the mainstream? You know, what sort of applications might they move to the cloud or might coexist between on-prem and the cloud? >> Well I think it's clear that the cloud is, you know, every new business starts on the cloud, that's clear. There's a lot of enterprise that is not yet there, but there's big willingness to move there. And there's a lot of hybrid cloud solutions as well. >> Do you see mainstream customers rewriting applications because they would be so much more powerful in stream processing, or do you see them doing just new applications? >> Both, we see both. It's always easier to start with a new application, but we do see a lot of legacy applications in big companies that are not working anymore. And we see those rewritten. And very core applications, very core to the business. >> So could that be, could you be sort of the source and in an analytic processing for the continuous data and then that sort of feeds a transaction and some parameters that then feed a model? >> Yeah. >> Is that, is that a, >> Yeah. >> so in other words you could augment existing OLTP applications with analytics then inform them in real time essentially. >> Absolutely. >> Okay, 'cause that sounds like then something that people would build around what exists. >> Yeah, I mean you can do, you can think of stream processing, in a way, as transaction processing. It's not a dedicated OLTP store, but you can think of it in this flipped architecture right? Like the log is essentially the re-do log, you know, and then you create the materialized views, that's the write path, and then you have the read path, which is queryable state. This is this whole CQRS idea right? >> Yeah, Command-Query-Response. >> Exactly. >> So, this is actually interesting, and I guess this is critical, it's sort of like a new way of doing distributed databases. I know that's not the word you would choose, but it's like the derived data, managed by, sort of coming off of the state changes, then in the stream processor that goes through a single sort of append-only log, and then reading, and how do you manage consistency on the materialized views that derive data? >> Yeah, so we have seen Flink users implement that. So we have seen, you know, companies really base the complete product on the CQRS pattern. I think this is a little bit further out. Consistency-wise, Flink gives you the exactly once consistency on the write path, yeah. What we see a lot more is an architecture where there's a lot of transactional stores in the front end that are running, and then there needs to be some kind of global, of single source of truth, between all of them. And a very typical way to do that is to get these logs into a stream, and then have a Flink application that can actually scale to that. Create a single source of truth from all of these transactional stores. >> And by having, by feeding the transactional stores into this sort of hub, I presume, some cluster as a hub, and even if it's in the form of sort of a log, how can you replay it with sufficient throughput, I guess not to be a data warehouse but to, you know, have low latency for updating the derived data? And is that derived data I assume, in non-Flink products? >> Yeah, so the way it works is that, you know, you can get the change logs from the databases, you can use something like Kafka to buffer them up, and then you can use Flink for all the processing and to do the reprocessing with Flink, this is really one of the core strengths of Flink. Basically what you do is, you replay the Flink program together with the states you can get really, really high throughput reprocessing there. >> Where does the super high throughput come from? Is that because of the integration of state and logic? >> Yeah, that is because Flink is a true streaming engine. It is a high-performance streaming engine. And it manages the state, there's no tier, >> Crossing a boundary? >> no tier crossing and there's no boundary crossing when you access state. It's embedded in the Flink application. >> Okay, so that you can optimize the IO path? >> Correct. >> Okay, very, very interesting. So, it sounds like the Kafka guys, the Confluent folks, their aspirations, from the last time we talked to 'em, doesn't extend to analytics, you know, I don't know whether they want partners to do that, but it sounds like they have a similar topology, but they're, but I'm not clear how much of a first-class citizen state is, other than the log. How would you characterize the trade-offs between the two? >> Yeah, so, I mean obviously I cannot comment on Confluent, but like, what I think is that the state and the log are two very different things. You can think of the log as storage, it's a kind of hot storage because it's the most recent data but you know, you cannot query it, it's not a materialized view, right. So for me the separation is between processing state and storage. The log is is a kind of storage, so kind of message queue. State is really the active data, the real-time active data that needs to have consistency guarantees, and that's a completely different thing. >> Okay, and that's the, you're managing, it's almost like you're managing under the covers a distributed database. >> Yes, kind of. Yeah a distributed key-value store if you wish. >> Okay, okay, and then that's exposed through multiple interfaces, data stream, table. >> Data stream, table API, SQL, other languages in the future, et cetera. >> Okay, so going further down the line, how do you see the sort of use cases that are going to get you across the chasm from the big tech companies into the mainstream? >> Yeah, so we are already seeing that a lot. So we're doing a lot of work with financial services, insurance companies a lot of very traditional businesses. And it's really a lot about maintaining single source of truth, becoming more real-time in the way they interact with the outside world, and the customer, like they do see the need to transform. If we take financial services and investment banks for example, there is a big push in this industry to modernize the IT infrastructure, to get rid of legacy, to adopt modern solutions, become more real-time, et cetera. >> And so they really needed this, like the application platform, the dA Platform, because operationalizing what Netflix did isn't going to be very difficult maybe for non-tech companies. >> Yeah, I mean, you know, it's always a trade-off right, and you know for some, some companies build, some companies buy, and for many companies it's much more sensible to buy. That's why we have software products. And really, our motivation was that we worked in the open-source Flink community with all the big tech companies. We saw their successes, we saw what they built, we saw, you know, their failures. We saw everything and we decided to build this for everybody else, for everyone that, you know, is not Netflix, is not Uber, cannot hire software developers so easily, or with such good quality. >> Okay, alright, on that note, Kostas, we're going to have to end it, and to be continued, one with Stefan next, apparently. >> Nice. >> And then hopefully next year as well. >> Nice. Thank you. >> Alright, thanks Kostas. >> Thank you George. Alright, we're with Kostas Tzoumas, CEO of data Artisans, the company behind Apache Flink and now the application platform that makes Flink run for mainstream enterprises. We will be back, after this short break. (techno music)
SUMMARY :
Covering Flink Forward, brought to you by data Artisans. and makes it more accessible to mainstream enterprises. Thank you George. application-use case, that is in the sweet spot of Flink, So for example, some of the largest users of Apache Flink I'm pretty sure they will give us bigger numbers today. So pretty big sizes. So and is Netflix, I think I read, is powering it's a tech company that happens to have a banking license. So okay, so tell us, for those who, you know, and you can restart at any time from there. We have also introduced end-to-end exactly once with Kafka. and on the state that having it integrated So traditionally what you would do is and you want to scale it to 500 right, which includes Flink, you can simply do this with one click So, the the resharding was possible with and it gives that to you in a rest API, in a UI, et cetera. you might have a cluster as a first-class citizen and you can do common operational tasks, because the capabilities, you know, are not in the skill set So for example, in the work we did with ING, and the statistics around the shards that gives you the ability to do, but you can do point queries and a few, you know, where you can store you know the commit log so, you can think of the state in Flink and roll back the log, the input log, in the sense that our product the dA Platform at the operations level with the platform. And of course there's SQL right? Yeah, so right now, the platform is running on Kubernetes, Well I think it's clear that the cloud is, you know, It's always easier to start with a new application, so in other words you could augment Okay, 'cause that sounds like then something that's the write path, and then you have the read path, I know that's not the word you would choose, So we have seen, you know, companies Yeah, so the way it works is that, you know, And it manages the state, there's no tier, It's embedded in the Flink application. doesn't extend to analytics, you know, but you know, you cannot query it, Okay, and that's the, you're managing, it's almost like Yeah a distributed key-value store if you wish. Okay, okay, and then that's exposed other languages in the future, et cetera. and the customer, like they do see the need to transform. like the application platform, the dA Platform, and you know for some, some companies build, and to be continued, one with Stefan next, apparently. and now the application platform
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Alibaba | ORGANIZATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
ING | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Kostas Tzoumas | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Lyft | ORGANIZATION | 0.99+ |
Kostas | PERSON | 0.99+ |
Stefan | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
next year | DATE | 0.99+ |
Netherlands | LOCATION | 0.99+ |
two | QUANTITY | 0.99+ |
Both | QUANTITY | 0.99+ |
Kafka | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
one click | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
SQL | TITLE | 0.98+ |
first thing | QUANTITY | 0.98+ |
more than a thousand nodes | QUANTITY | 0.98+ |
Kubernetes | TITLE | 0.98+ |
500 | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
Confluent | ORGANIZATION | 0.97+ |
Artisans | ORGANIZATION | 0.96+ |
single source | QUANTITY | 0.96+ |
2018 | DATE | 0.96+ |
over 100 | QUANTITY | 0.95+ |
dA Platform | TITLE | 0.95+ |
Flink | TITLE | 0.94+ |
about one trillion events per day | QUANTITY | 0.94+ |
Apache Flink | ORGANIZATION | 0.93+ |
single | QUANTITY | 0.92+ |
Flink Forward Conference | EVENT | 0.9+ |
one step | QUANTITY | 0.9+ |
Apache | ORGANIZATION | 0.89+ |
Databricks | ORGANIZATION | 0.89+ |
Kafka | PERSON | 0.88+ |
first | QUANTITY | 0.86+ |
single entity | QUANTITY | 0.84+ |
one rest command | QUANTITY | 0.82+ |
Steven Wu, Netflix | Flink Forward 2018
>> Narrator: Live from San Francisco, it's theCube, covering Flink Forward, brought to you by Data Artisans. >> Hi, this is George Gilbert. We're back at Flink Forward, the Flink conference sponsored by Data Artisans, the company that commercializes Apache Flink and provides additional application management platforms that make it easy to take stream processing at scale for commercial organizations. We have Steven Wu from Netflix, always a company that is pushing the edge of what's possible, and one of the early Flink users. Steven, welcome. >> Thank you. >> And tell us a little about the use case that was first, you know, applied to Flink. >> Sure, our first-use case is a routing job for Keystone data pipeline. Keystone data pipeline process over three trillion events per day, so we have a thousand routing jobs that we do some simple filter projection, but the Solr routing job is a challenge for us and we recently migrated our routing job to Apache Flink. >> And so is the function of a routing job, is it like an ETL pipeline? >> Not exactly ETL pipeline, but more like it's a data pipeline to deliver data from the producers to the data syncs where people can consume those data like array search, Kafka or higher. >> Oh, so almost like the source and sync with a hub in the middle? >> Yes, that is exactly- >> Okay. >> That's the one with our big use case. And the other thing is our data engineer, they also need some stream processing today to do data analytics, so their job can be stateless or it can be stateful if it's a stateful job it can be as big as a terabyte of base state for a single job. >> So tell me what these stateful jobs, what are some of the things that you use state for? >> So, for example like a session of user activity, like if you have clicked the video on the online URI all those activity, they would need to be sessionalized window, for the windows, sessionalized, yeah those are the states, typical. >> OK, and what sort of calculations might you be doing? And which of the Flink APIs are you using? >> So, right now we're using the data stream API, so a little bit low level, we haven't used the Flink SQL yet but it's in our road map, yeah. >> OK, so what is the data stream, you know, down closer to the metal, what does that give you control over, right now, that is attractive? And will you have as much control with the SQL API? >> OK, yes, so the low level data stream API can give you the full feature set of everything. High level SQL is much easier to use, but obviously you have, the feature set is more limited. Yeah, so that's a trade-off there. >> So, tell me about, for a stateful application, is there sort of scaffolding about managing this distributed cluster that you had to build that you see coming down the pipe from Flink and Data Artisans that might make it easier, either for you or for mainstream customers? >> Sure, I think internal state management, I think that is where Flink really shines compared to other stream processing engine. So they do a lot with work underneath already. I think the main thing we need from Flink for the future, near future is regarding the job recovery performance. But like a state management API is very mature. Flink is, I think it's more mature than most of the other stream processing engines. >> Meaning like Kafka, Spark. So, in the state management, can a business user or business analyst issue a SQL query across the cluster and Flink figures out how to manage the distribution of the query and the filtering and presentation of the results transparently across the cluster? >> I'm not an expert on Flink SQL, but I think yes, essentially Flink SQL will convert to a Flink job which will be using the data stream API, so they will manage the state, yes, but, >> So, when you're using the lower level data stream API, you have to manage the distributed state and sort of retrieving and filtering, but that's something at a higher level abstraction, hopefully that'll be, >> No, I think that in either case, I think the state management is handled by Flint. >> Okay. >> Yeah. >> Distributed. >> All the state management, yes >> Even if it's querying at the data stream level? >> Yeah, but if you query at the SQL level, you won't be able to deal with those state APIs directly. You can still do actual windowing, let's say you have a SQL app doing window with some session by session by idle time that would be transfer for job and Flink will manage those window, manage those session state so you do not need to worry about either way you do not need to worry about state management. Apache Flink take care of it. >> So tell me, some of the other products you might have looked at, is the issue that if they have a clean separation from the storage layer, for large scale state management, you know, as opposed to, in memory, is it that the large scale is almost treated like a second tier and therefore, you almost have a separate set or a restricted set of operations at distributed state level versus at the compute level, would that be a limitation of other streaming processors? >> No, I don't see that. I think that given that stream will have taken a different approach, you find like a Google Cloud data flow, Google Cloud flow, they are thinking about using a big table, for example. But those are external state management. Flint decided to take a the approach of embedded state management inside of Flink. >> And when it's external, what's the trade-off? >> That's good question, I think if external, the latency may be higher, but your throughput might be a little low. Because you're going all the natural. But the benefit of that external state management is now your job becomes stateless. Your job make the recovery much faster for job failure, so either trade-off over there. >> OK. >> Yes. >> OK, got it. Alright, Steven we're going to have to end it on that, but that was most enlightening, and thanks for joining. >> Sure, thank you. >> This is George Gilbert, for Wikibon and theCube, we're again at Flink Forward in San Francisco with Data Artisans, we'll be back after a short break. (techno music)
SUMMARY :
covering Flink Forward, brought to you by Data Artisans. always a company that is pushing the edge that was first, you know, applied to Flink. but the Solr routing job is a challenge for us it's a data pipeline to deliver data from the producers And the other thing is our data engineer, like if you have clicked the video on the online URI so a little bit low level, we haven't used the Flink SQL yet but obviously you have, the feature set is more limited. than most of the other stream processing engines. across the cluster and Flink figures out how to manage the No, I think that in either case, Yeah, but if you query at the SQL level, taken a different approach, you find like But the benefit of that external state management but that was most enlightening, and thanks for joining. This is George Gilbert, for Wikibon and theCube,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Steven | PERSON | 0.99+ |
Steven Wu | PERSON | 0.99+ |
SQL | TITLE | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
Flint | ORGANIZATION | 0.98+ |
Kafka | TITLE | 0.98+ |
Flink Forward | ORGANIZATION | 0.98+ |
Spark | TITLE | 0.97+ |
second tier | QUANTITY | 0.97+ |
Wikibon | ORGANIZATION | 0.97+ |
today | DATE | 0.95+ |
over three trillion events per day | QUANTITY | 0.93+ |
Keystone | ORGANIZATION | 0.92+ |
single job | QUANTITY | 0.92+ |
Flint | PERSON | 0.91+ |
Flink SQL | TITLE | 0.91+ |
first-use case | QUANTITY | 0.86+ |
one | QUANTITY | 0.86+ |
Apache Flink | ORGANIZATION | 0.84+ |
theCube | ORGANIZATION | 0.82+ |
2018 | DATE | 0.81+ |
Forward | TITLE | 0.8+ |
SQL API | TITLE | 0.8+ |
Flink | TITLE | 0.79+ |
a thousand routing jobs | QUANTITY | 0.77+ |
Flink | EVENT | 0.77+ |
Flink Forward | EVENT | 0.73+ |
terabyte | QUANTITY | 0.71+ |
ORGANIZATION | 0.65+ | |
Cloud | TITLE | 0.48+ |
Forward | EVENT | 0.39+ |
Andrew Gao, Capital One | Flink Forward 2018
>> Announcer: Live from San Francisco. It's theCUBE. Covering Flink Forward, brought to you by data Artisans. >> Hi, this is George Gilbert, we're with theCUBE today at Flink Forward, data Artisans conference, annual conference here in San Francisco for the Apache Flink community and we're joined by Andrew Gao from Capital One. Capital One's always doing bleeding edge use cases with the latest technology. Andrew, good to have you. >> Thank you, it's good to be here. >> So, tell us about your latest, most bleeding edge use case with Apache Flink. What are you guys trying to enable? >> Yeah, sure, for the last year and a half, me and a couple teams have been working on developing a fraud decisioning platform on Kubernetes. We've been running in production since September, and we have three use cases on it now. >> Okay, so tell me about, let's pick one use case, and tell me what is it about steam processing to start with that makes it better and then let's talk about the Apache Flink tooling that's coming out to make it even more accessible. >> Sure, in terms of what we're using Apache Flink for, the use case I worked on specifically was the one where customers go to a bank and they either cash their check or try to withdraw cash and we deployed a defense there to make real time decisions on their past transactions. >> And again, just to be clear, the defense is to make sure it should be authorized. That this is fraud or not fraud. >> Whether they should call their fraud operators or not, pretty much. >> Okay, so tell us how Flink made that better relative to what you were doing before. >> Sure, at Capital One, we're definitely sold on the idea of a stream processing throughout the company where moving towards copper related architecture. We've had a pretty good experience just like, building up features in real time using Flink and then sending them off to our models, our machine learning models to return a result. >> Oh, so, okay so here the use case is sort of continuous learning for the models and to keep the application going live. >> Well we're not so far to the point of continuously training models, though that is probably the end goal. But we will have our models that we developed offline and then we'll use the transactions that are coming in through our streams to calculate these features and send those features to the models which will tell us whether it's fraud or not fraud. >> Okay, so now, Capital One and a few other companies are really sophisticated and have been able to take the open source code, you know, and then put the, sort of, infrastructure around it to make it easy to operate. Easier for your dev-ops teams. What is it that you see coming from the data Artisans folks that might make that job much easier? >> So, unfortunately we started developing our Kubernetes platform before any of this data Artisans platform was announced. So we're excited to, we've already done some work with deploying Flink applications on Kubernetes but we're definitely excited to see what the original contributors have to offer with their data Artisans platform. >> And in addition to the resource management, what about orchestrating, you know, what's essentially a very distributed application, where you have the compute and state management co-located on notes, and in fact, highly integrated. So, things like, have you been able to do elastic scaling and repartitioning of the data and check points, you know, for distributed state and then restoring that for rolling out new versions, things like that? >> So, so far, at least for Flink, we generally provision our clusters ahead of time, so there's no re-scaling at the Flink cluster level. We actually have multiple Flink clusters running on, like a single Kubernetes cluster. In terms of state management, so far, I won't say our experience has been painless, but it's been pretty good to us in terms of restoring from failures. >> Restoring from? >> Failures, like if we have task managers that die, Kubernetes can just let them die and it will recreate, it'll auto heal pretty much and recover from the checkpoint by itself. >> Okay, and have you guys been monitoring the capabilities rolling out with the DA platform? >> Yeah, especially with the resource manager. So right now as I said, we do have multiple Flink clusters on the Kubernetes platform, and that was pretty much to address the issue of resource sharing and that being a problem. Some apps, if one app died on the same Flink cluster, it could impact the other apps. So, we, our approach so far was to separate that, like have multiple Flink clusters and separate them by name spaces. But it seems like the resource manager could offer some a similar feature. >> And so what are some of the things that you'd like to add that either your additional tooling might help with or where additional, sort of applications, support framework in the form of the DA platform might make some of your wishlist easier. >> That's a hard one. >> I didn't mean to stump you, yeah. >> That's a hard. >> 'Cause you have to weigh between what's coming from the vendor and what you're doing. >> Yeah exactly. We do have. There's a lot of things internally that we want to do and so far we've pretty much dealt with our problems and ourselves and just did work arounds however much so we haven't had too much experience directing that type of work towards the Flink contributors. >> Okay, okay. Well it sounds like you're definitely pushing the envelope on production and I imagine you see lots more use cases coming down the road. >> Yeah, we have three use cases right now that are running in production but this fraud platform we're trying to build is supposed to handle pretty much all the bank fraud. >> All fraud, ultimately. >> Ideally. >> Ideally, wow okay. On that note, Andrew, we should end it and hopefully we'll see you back next year and you can tell us how far you've gone. >> All right sounds good. >> So this is George Gilbert with Andrew Gao of Capital One and we will be back after this short break with more from Flink Forward in San Francisco.
SUMMARY :
Covering Flink Forward, brought to you by data Artisans. and we're joined by Andrew Gao from Capital One. most bleeding edge use case with Apache Flink. and we have three use cases on it now. and tell me what is it about steam processing to start with and we deployed a defense there the defense is to make sure or not, pretty much. relative to what you were doing before. to return a result. and to keep the application going live. and send those features to the models What is it that you see coming from the data Artisans folks contributors have to offer with and check points, you know, for distributed state but it's been pretty good to us and recover from the checkpoint by itself. and that was pretty much to address the issue And so what are some of the things that you'd like to add from the vendor and what you're doing. and so far we've pretty much dealt with our problems coming down the road. is supposed to handle pretty much all the bank fraud. and hopefully we'll see you back next year and we will be back after this short break
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Andrew | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Andrew Gao | PERSON | 0.99+ |
Capital One | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
next year | DATE | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
three use cases | QUANTITY | 0.98+ |
September | DATE | 0.97+ |
Flink Forward | ORGANIZATION | 0.97+ |
today | DATE | 0.97+ |
one use case | QUANTITY | 0.96+ |
Apache Flink | ORGANIZATION | 0.96+ |
Kubernetes | TITLE | 0.95+ |
last year and a half | DATE | 0.94+ |
Artisans | ORGANIZATION | 0.93+ |
2018 | DATE | 0.93+ |
one app | QUANTITY | 0.92+ |
theCUBE | ORGANIZATION | 0.9+ |
couple teams | QUANTITY | 0.86+ |
Flink Forward | EVENT | 0.8+ |
Flink | TITLE | 0.75+ |
single | QUANTITY | 0.73+ |
Apache | ORGANIZATION | 0.68+ |
Kubernetes | ORGANIZATION | 0.54+ |
Artisans | EVENT | 0.49+ |
Xiaowei Jiang | Flink Forward 2017
>> Welcome everyone, we're back at the first Flink Forward Conference in the U.S. It's the Flink User Conference sponsored by Data Artisans, the creators of Apache Flink. We're on the ground at the Kabuki Hotel, and we've heard some very high-impact customer presentations this morning, including Uber and Netflix. And we have the great honor to have Xiaowei Jiang from Alibaba with us. He's Senior Director of Research, and what's so special about having him as our guest is that they have the largest Flink cluster in operation in the world that we know of, and that the Flink folks know of as well. So welcome, Xiaowei. >> Thanks for having me. >> So we gather you have a 1,500 node cluster running Flink. Let's sort of unpack how you got there. What were some of the use cases that drove you in the direction of Flink and complementary technologies to build with? >> Okay, I explain a few use cases. The first use case that prompted us to look into Flink is the classical Soch ETL case. Where basically it needs to process all the data that's necessary for such series. So we look into Flink about two years ago. The next case we use is the A/B testing framework which is used to evaluate how your machine learning model works. So, today we using a few other very interesting case, like we are using to do machine learning to adjust ranking of search results, to personalize your search results at real-time to deliver the best search results for our user. We are also using to do real-time anti-fraud detection for ads. So these are the typical use case we are doing. >> Okay, this is very interesting because with the ads, and the one before that, was it fraud? >> Ads is anti-fraud. Before that is machine learning, real-time machine learning. >> So for those, low latency is very important. Now, help unpack that. Are you doing the training for these models like in a central location and then pushing the models out close to where they're going to be used for like the near real-time decisions? Or is that all run in the same cluster? >> Yeah, so basically we are doing two things. We use Flink to do real-time feature update which change the feature at the real-time, like in a few seconds. So for example, when a user buy a product, the inventory needs to be updated. Such features get reflected in the ranking of search results to real-time. We also use it to do real-time trending of the model itself. This becomes important in some special events. For example, on China Singles Day which is the largest shopping holiday in China, it generates more revenue than Black Friday in United States already. On such a day, because things go on sale for almost 50 percent off, the user's behavior changes a lot. So whatever model you trend before does not work reliably. So it's really nice to have a way to adjust the model at real-time to deliver a best experience to our users. All this is actually running in the same cluster. >> OK, that's really interesting. So, it's like you have a multi-tenant solution that sounds like it's rather resource intensive. >> Yes. >> When you're changing a feature, or features, in the models, how do you go through the process of evaluating them and finding out their efficacy before you put them into production? >> Yeah, so this is exactly the A/B testing framework I just mentioned earlier. >> George: Okay. >> So, we also use Flink to track the metrics, the performance of these models at real time. Once these data are processed, we upload them into our Olark system so we can see the performance of the models at real time. >> Okay. Very, very impressive. So, explain perhaps why Flink was appropriate for those use cases. Is it because you really needed super low latency, or that you wanted less resource-intensive sort of streaming engine to support these? What made it fit that right sweet spot? >> Yeah, so Soch has lots of different products. They have lots of different data processing needs, so when we looked into all these needs, we quickly realized we actually need a computer engine that can do both batch processing and streaming processing. And in terms of streaming processing, we have a few needs. For example, we really need super low latency. So in some cases, for example, if a product is sold out, and is still displaying in your search results, when users click and try to buy they cannot buy it. It's a bad experience. So, the sooner you can get the data processed, the better. So with- >> So near real-time for you means, how many milliseconds does the- >> It's usually like a second. One second, something like that. >> But that's one second end to end talking to inventory. >> That's right. >> How much time would the model itself have to- >> Oh, it's very short. Yeah. >> George: In the single digit milliseconds? >> It's probably around that. There are some scenarios that require single digit milliseconds. Like a security scenario; that's something we are currently looking into. So when you do transactions in our site, we need to detect if it's a fraud transaction. We want to be able to block such transactions at real-time. For that to happen, we really need a latency that's below 10 millisecond. So when we're looking at computer engines, this is also one of the requirements we will think about. So we really need a computer engine which is able to deliver sub-second latency if necessary, and at the same time can also do batch efficiently. So we are looking for solutions that can cover all our computation needs. >> So one way of looking at it is many vendors and customers talk about elasticity as in the size of the cluster, but you're talking about elasticity or scaling in terms of latency. >> Yes, latency and the way of doing computation. So you can view the security in the scenario as super restrict on the latency requirement, but view Apache as most relaxed version of latency requirement. We want a full spectrum; it's a part of the full spectrum. It's possible that you can use different engines for each scenario; but which means you are required to maintain more code bases, which can be a headache. And we believe it's possible to have a single solution that works for all these use cases. >> So, okay, last question. Help us understand, for mainstream customers who don't hire the top Ph.D's out of the Chinese universities but who have skilled data scientists but not an unending supply, and aspire to build solutions like this; tell us some of the trade-offs they should consider given that, you know, the skillset and the bench strength is very deep at Alibaba, and it's perhaps not as widely disseminated or dispersed within a mainstream enterprise. How should they think about the trade-offs in terms of the building blocks for this type of system? >> Yeah, that's a very good question. So we actually thought about this. So, initially what we did is we were using data set and data string API, which is a relatively lower level API. So to develop an application with this is reasonable, but it still requires some skill. So we want a way to make it even simpler, for example, to make it possible for data scientists to do this. And so in the last half a year, we spent a lot of time working on Tableau API and SQL Support, which basically tries to describe your computation logic or data processing logic using SQL. SQL is used widely, so a lot of people have experience in it. So we are hoping with this approach, it will greatly lower the threshold of how people to use Flink. At the same time, SQL is also a nice way to unify the streaming processing and the batch processing. With SQL, you only need to write your process logic once. You can run it in different modes. >> So, okay this is interesting because some of the Flink folks say you know, structured streaming, which is a table construct with dataframes in Spark, is not a natural way to think about streaming. And yet, the Spark guys say hey that's what everyone's comfortable with. We'll live with probabilistic answers instead of deterministic answers, because we might have late arrivals in the data. But it sounds like there's a feeling in the Flink community that you really do want to work with tables despite their shortcomings, because so many people understand them. >> So ease of use is definitely one of the strengths of SQL, and the other strength of SQL is it's very descriptive. The user doesn't need to say exactly how you do the computation, but it just tells you what I want to get. This gives the framework a lot of freedom in optimization. So users don't need to worry about hard details to optimize their code. It lets the system do its work. At the same time, I think that deterministic things can be achieved in SQL. It just means the framework needs to handle such kind things correctly with the implementation of SQL. >> Okay. >> When using SQL, you are not really sacrificing such determinism. >> Okay. This is, we'll have to save this for a follow-up conversation, because there's more to unpack there. But Xiaowei Jiang, thank you very much for joining us and imparting some of the wisdom from Alibaba. We are on the ground at Flink Forward, the Data Artisans conference for the Flink community at the Kabuki hotel in San, Francisco; and we'll be right back.
SUMMARY :
and that the Flink folks know of as well. So we gather you have a 1,500 node cluster running Flink. So these are the typical use case we are doing. Before that is machine learning, Or is that all run in the same cluster? adjust the model at real-time to deliver a best experience So, it's like you have a multi-tenant solution Yeah, so this is exactly the A/B testing framework of the models at real time. or that you wanted less resource-intensive So, the sooner you can get the data processed, the better. It's usually like a second. Oh, it's very short. For that to happen, we really need a latency as in the size of the cluster, So you can view the security in the scenario in terms of the building blocks for this type of system? So we are hoping with this approach, because some of the Flink folks say It just means the framework needs to handle you are not really sacrificing such determinism. and imparting some of the wisdom from Alibaba.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
China | LOCATION | 0.99+ |
Xiaowei Jiang | PERSON | 0.99+ |
One second | QUANTITY | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
United States | LOCATION | 0.99+ |
SQL | TITLE | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Xiaowei | PERSON | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
San, Francisco | LOCATION | 0.99+ |
Tableau | TITLE | 0.99+ |
one second | QUANTITY | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
Black Friday | EVENT | 0.98+ |
Soch | ORGANIZATION | 0.98+ |
each scenario | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Spark | TITLE | 0.98+ |
both | QUANTITY | 0.98+ |
first use case | QUANTITY | 0.97+ |
U.S. | LOCATION | 0.97+ |
Data Artisans | ORGANIZATION | 0.96+ |
China Singles Day | EVENT | 0.96+ |
almost 50 percent | QUANTITY | 0.96+ |
Flink Forward Conference | EVENT | 0.95+ |
single solution | QUANTITY | 0.94+ |
Flink User Conference | EVENT | 0.93+ |
two years ago | DATE | 0.93+ |
Flink Forward | EVENT | 0.91+ |
below 10 millisecond | QUANTITY | 0.91+ |
Apache Flink | ORGANIZATION | 0.91+ |
1,500 node | QUANTITY | 0.91+ |
Chinese | OTHER | 0.9+ |
second | QUANTITY | 0.89+ |
single | QUANTITY | 0.88+ |
this morning | DATE | 0.88+ |
Kabuki Hotel | LOCATION | 0.87+ |
last half a year | DATE | 0.87+ |
Apache | ORGANIZATION | 0.85+ |
Olark | ORGANIZATION | 0.83+ |
single digit | QUANTITY | 0.77+ |
Data Artisans conference | EVENT | 0.77+ |
Flink | TITLE | 0.74+ |
2017 | DATE | 0.74+ |
first | QUANTITY | 0.65+ |
seconds | QUANTITY | 0.62+ |
Kabuki | ORGANIZATION | 0.56+ |
people | QUANTITY | 0.51+ |
lot | QUANTITY | 0.47+ |
Dean Wampler Ph.D | Flink Forward 2017
>> Welcome everyone to the first ever U.S. user conference of Apache Flink, sponsored by data Artisans, the creators of Flink. The conference kicked off this morning with some very high-profile customer use cases, including Netflix and Uber, which were quite impressive. We're on the ground at the Kabuki Hotel in San Francisco and our first guest is Dean Wampler, VP of fast data engineering at Lightbend. Welcome Dean. >> Thank you. Good to see you again George. >> So, big picture context setting, Spark exploded on the scene, blew away the expectations, even of their creators, with the speed and the deeply integrated libraries, and essentially replaced MapReduce really quickly. >> Yeah. >> So what is behind Flink's rapid adoption? >> Right, I think it's an interesting story and if you'd asked me a year ago, I probably would've said, well I'm not sure we really need Flink, Spark seems to meet all our needs. But, I pretty quickly changed my mind as I got to know about Flink because, it is a broad ecosystem, there's a wide variety of problems people are trying to solve, and what Flink is doing very well is solving low latency streaming, but still at scale, like Spark. Where Spark is still primarily a mini-batch model, so it has longer latency. And Flink has been on the cutting edge too, of embracing some of the more advanced streaming scenarios, like proper handling of late arrival of data, windowing semantics, things like this. So it's really filling an important niche, but a fairly broad niche that people have. And also, not everybody needs the full-featured capabilities of Spark like batch analytics or whatever, and so having one tool that's focused just on processing streams is often a good idea. >> So would that relate to a smaller surface area to learn and to administer? >> I think it's a big part of it, yeah. I mean Spark is incredibly well engineered and it works very well, but it's a bigger system so there's going to be more to run. And there is something very attractive about having a more focused tool that, you know, less things to break basically. >> You mention sort of lower-latency and a few extra, a few fewer bells and whistles. Can you give us some examples of use cases where you wouldn't need perhaps all of the integrated libraries of Spark or the big footprint that gives you all that resilience and, you know, the functional programming that lets you sort of, recreate lineage. Tell us sort of how a customer who's approaching this should pick the trade-offs. >> Right. Well normally when you have a low latency problem, it means you have less time to do work, so you tend to do simpler things, in that time frame. But, just to give you a really interesting example, I was talking with a development team at a bank recently that does credit card authorizations. You click by on a website and there's maybe a few hundred milliseconds when the user is expecting a reply, right. But it turns out there's so many things going on in that loop, from browser to servers and back that they only have about ten milliseconds, when they get the data, to make a decision about whether this looks fraudulent or it looks legit, and they make a decision. So ten milliseconds is fairly narrow, that means you have to have your models already done and ready to go. And a quick way to actually apply them, you know, take this data, ask the model is this okay, and get a response. So, a lot of it is kind of boiling down to that, it's either, I would say one of two things, it's either I'm doing basic filtering, transforming of data, like raw data coming into my environment/ Or I have some maybe more sophisticated analytics that are running behind the scenes, and then in real time, so it's, so to speak, data is coming in and I'm asking questions against those models about this data, like authorizing credit cards. >> Okay, so to recap, the low latency means you have to have perhaps scored your models already. Okay, so trained and scored in the background and then, with this low latency solution you can look up, key based look up I guess, to an external store, okay. So how is Lightbend making it simple to put, what essentially has to be for any pipeline it appears, multiple products together seamlessly. >> That is the challenge. I mean it would be great if you could just deploy Flink, and that was the only thing you needed or Kafka, or pick any one of them. But of course, the reality is, we always have to integrate a bunch of tools together, and it's that integration that's usually the hard part. How do I know why this thing's misbehaving, when maybe it's something upstream that's misbehaving? That sort of thing. So, we've been surveying the landscape to understand, first of all, what are the tools that seem to be most mature, most vibrant as a community, that address the variety of scenarios people are trying to deal with, some of which we just discussed. And what are the kind of integration problems that you have to solve to make these reliable systems? So we've been building a platform, called the Fast Data Platform, that's approaching its first beta, that is designed to help solve a lot of those problems for you, so you can focus on your actual business problems. >> And from a customer point of view, would you take end-to-end ownership of that solution, so that if they chose you could manage it On-Prem or in the Cloud, and handle level three support across the stack? >> That's an interesting question. We think eventually we'll get to that point of more of a service offering, but right now most of the customers we're talking to are still more interested in managing things themselves, but not having as much of a hassle of doing it themselves. So what we're trying to balance is tooling that makes it easier to get started quickly and build applications, but also leverages some of the modern, like machine-learning, artificial intelligence stuff to automatically detect and correct for a lot of common problems, and other management scenarios. So at least it's not quite as, you're on your own, as it could be if you were just trying to glue everything together yourself. >> So if I understand, it sounds like the first stage in the journey is, help me rationalize what I'm trying to get to work together On-Prem, and part of that is using machine-learning now, as part of management. And then, over time, this management gets better and better at root-cause analysis and auto-remediation, and then it can move into the Cloud. And these disparate components become part of a single SAS solution, under the management. >> That's the long-term goal, definitely yeah. >> Looking out at where all this intense interest is right now in IOT applications. We can't really go back to the Cloud for, send all the data back to the Cloud, and get an immediate answer, and then drive an action. How do you see that shaping up in terms of what's on the edge and what's on the Cloud? >> Yeah, that's a really interesting question, and there are some particular challenges, because a lot of companies will migrate to the Cloud in a peace meal fashion, so they've got a sort of hybrid deployment scenario with things On-Premise and in the Cloud, and so forth. One of the things you mentioned that's pretty important, is I've got all this data coming in, how do I capture it reliably? So, tools like Kafka are really good for that and Pravega that Strachan from EMC mentioned, is sort of filling the same need, that I need to capture stuff reliably, serve downstream consumers, make it easy to do analytics over this stream that looks a lot different than a traditional database, where it's kind of data at rest, it's not static, but it's not moving. So, that's one of the things you have to do well, and then figure out how to get that data to the right consumer, and account for all of the latencies, like if I needed that ten millisecond credit card authorization, but I had data split over my On-Premise and my Cloud environment, you know, that would not work very well. So there's a lot of that kind of architecture of data flow, so it becomes really important. >> Do you see Lightbend offering that management solution that enforces SLAs or do you see sourcing that technology from others and then integrating it tightly with the particular software building blocks that make up the pipeline? >> It's a little of both. We're sort of in the early stages of building services along those lines. Some of the technology we've had for a while, our Akka middleware system, and the streaming API on top of it would be really good for basing that kind of a platform, where you can think about SLA requirements and trading off performance, or whatever, versus getting answers in a reasonable time, good recovery and error scenarios, stuff like that. So it's all early days, but we are thinking very hard about that problem, because ultimately, at the end of the, that's what customers care about, they don't care about Kafka versus Spark, or whatever. They just care about, I've got data coming in, I need an answer, and ten milliseconds or I lose money, and that's the kind of thing that they want you to sell for them, so that's really what we have to focus on. >> So, last question before we have to go, do you see potentially a scenario where there's one type of technology on the edge, or many types, and then something more dominant in the Cloud, where basically you do more training, model training, and out on the edge you do the low latency predictions or prescriptions. >> That's pretty much the architecture that has emerged. I'm going to talk a little bit about this today, in my talk, where, like we said earlier, I may have a very short window in which I have to make a decision, but it's based on a model that I have been building for a while and I can build in the background, where I have more tolerance for the time it takes. >> Up in the Cloud? >> Up in the Cloud. Actually this is kind of independent of deployment scenario, but it could be both like that, so you could have something that is closer to the consumer of the data, maybe in the Cloud, and deployed in Europe for European customers, but it might be working with systems back in the U.S.A. that are doing the heavy-lifting of building these models and so forth. We live in such a world where you can put things where you want, you can move things around, you can glue things together, and a lot of times it's just knowing what's the right combination of stuff. >> Alright Dean, it was great to see you and to hear the story. It sounds compelling. >> Thank you very much. >> So, this is George Gilbert. We are on the ground at Flink Forward, data Artisans user conference for the Flink product, and we will be back after this short break.
SUMMARY :
We're on the ground at the Kabuki Hotel in San Francisco Good to see you again George. Spark exploded on the scene, of embracing some of the more advanced streaming scenarios, you know, less things to break basically. that gives you all that resilience and, you know, that means you have to have your models already done Okay, so to recap, the low latency means you have to have and that was the only thing you needed that makes it easier to get started quickly and part of that is using machine-learning now, send all the data back to the Cloud, So, that's one of the things you have to do well, and that's the kind of thing in the Cloud, where basically you do more training, but it's based on a model that I have been building that are doing the heavy-lifting and to hear the story. We are on the ground at Flink Forward,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dean Wampler | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Dean | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
U.S.A. | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
EMC | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Lightbend | ORGANIZATION | 0.99+ |
first beta | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
first guest | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
ten milliseconds | QUANTITY | 0.98+ |
Kafka | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
Uber | ORGANIZATION | 0.98+ |
ten millisecond | QUANTITY | 0.98+ |
Spark | TITLE | 0.97+ |
U.S. | LOCATION | 0.97+ |
two things | QUANTITY | 0.96+ |
first stage | QUANTITY | 0.96+ |
Netflix | ORGANIZATION | 0.95+ |
a year ago | DATE | 0.95+ |
about ten milliseconds | QUANTITY | 0.94+ |
level three | QUANTITY | 0.94+ |
Flink Forward | ORGANIZATION | 0.93+ |
one type | QUANTITY | 0.93+ |
single | QUANTITY | 0.92+ |
2017 | DATE | 0.89+ |
MapReduce | TITLE | 0.89+ |
Apache Flink | ORGANIZATION | 0.89+ |
Akka | ORGANIZATION | 0.88+ |
European | OTHER | 0.83+ |
this morning | DATE | 0.78+ |
Kabuki Hotel | LOCATION | 0.78+ |
one tool | QUANTITY | 0.77+ |
Pravega | TITLE | 0.72+ |
few hundred milliseconds | QUANTITY | 0.66+ |
Strachan | PERSON | 0.62+ |
SAS | ORGANIZATION | 0.58+ |
Forward | EVENT | 0.55+ |
Cloud | TITLE | 0.51+ |
Fast Data Platform | TITLE | 0.5+ |
Flink | TITLE | 0.5+ |
Lightbend | PERSON | 0.38+ |