Joseph Nelson, Roboflow | AWS Startup Showcase

(chill electronic music) >> Hello everyone, welcome to theCUBE's presentation of the AWS Startups Showcase, AI and machine learning, the top startups building generative AI on AWS. This is the season three, episode one of the ongoing series covering the exciting startups from the AWS ecosystem, talk about AI and machine learning. Can't believe it's three years and season one. I'm your host, John Furrier. Got a great guest today, we're joined by Joseph Nelson, the co-founder and CEO of Roboflow, doing some cutting edge stuff around computer vision and really at the front end of this massive wave coming around, large language models, computer vision. The next gen AI is here, and it's just getting started. We haven't even scratched a service. Thanks for joining us today. >> Thanks for having me. >> So you got to love the large language model, foundation models, really educating the mainstream world. ChatGPT has got everyone in the frenzy. This is educating the world around this next gen AI capabilities, enterprise, image and video data, all a big part of it. I mean the edge of the network, Mobile World Conference is happening right now, this month, and it's just ending up, it's just continue to explode. Video is huge. So take us through the company, do a quick explanation of what you guys are doing, when you were founded. Talk about what the company's mission is, and what's your North Star, why do you exist? >> Yeah, Roboflow exists to really kind of make the world programmable. I like to say make the world be read and write access. And our North Star is enabling developers, predominantly, to build that future. If you look around, anything that you see will have software related to it, and can kind of be turned into software. The limiting reactant though, is how to enable computers and machines to understand things as well as people can. And in a lot of ways, computer vision is that missing element that enables anything that you see to become software. So in the virtue of, if software is eating the world, computer vision kind of makes the aperture infinitely wide. It's something that I kind of like, the way I like to frame it. And the capabilities are there, the open source models are there, the amount of data is there, the computer capabilities are only improving annually, but there's a pretty big dearth of tooling, and an early but promising sign of the explosion of use cases, models, and data sets that companies, developers, hobbyists alike will need to bring these capabilities to bear. So Roboflow is in the game of building the community around that capability, building the use cases that allow developers and enterprises to use computer vision, and providing the tooling for companies and developers to be able to add computer vision, create better data sets, and deploy to production, quickly, easily, safely, invaluably. >> You know, Joseph, the word in production is actually real now. You're seeing a lot more people doing in production activities. That's a real hot one and usually it's slower, but it's gone faster, and I think that's going to be more the same. And I think the parallel between what we're seeing on the large language models coming into computer vision, and as you mentioned, video's data, right? I mean we're doing video right now, we're transcribing it into a transcript, linking up to your linguistics, times and the timestamp, I mean everything's data and that really kind of feeds. So this connection between what we're seeing, the large language and computer vision are coming together kind of cousins, brothers. I mean, how would you compare, how would you explain to someone, because everyone's like on this wave of watching people bang out their homework assignments, and you know, write some hacks on code with some of the open AI technologies, there is a corollary directly related to to the vision side. Can you explain? >> Yeah, the rise of large language models are showing what's possible, especially with text, and I think increasingly will get multimodal as the images and video become ingested. Though there's kind of this still core missing element of basically like understanding. So the rise of large language models kind of create this new area of generative AI, and generative AI in the context of computer vision is a lot of, you know, creating video and image assets and content. There's also this whole surface area to understanding what's already created. Basically digitizing physical, real world things. I mean the Metaverse can't be built if we don't know how to mirror or create or identify the objects that we want to interact with in our everyday lives. And where computer vision comes to play in, especially what we've seen at Roboflow is, you know, a little over a hundred thousand developers now have built with our tools. That's to the tune of a hundred million labeled open source images, over 10,000 pre-trained models. And they've kind of showcased to us all of the ways that computer vision is impacting and bringing the world to life. And these are things that, you know, even before large language models and generative AI, you had pretty impressive capabilities, and when you add the two together, it actually unlocks these kind of new capabilities. So for example, you know, one of our users actually powers the broadcast feeds at Wimbledon. So here we're talking about video, we're streaming, we're doing things live, we've got folks that are cropping and making sure we look good, and audio/visual all plugged in correctly. When you broadcast Wimbledon, you'll notice that the camera controllers need to do things like track the ball, which is moving at extremely high speeds and zoom crop, pan tilt, as well as determine if the ball bounced in or out. The very controversial but critical key to a lot of tennis matches. And a lot of that has been historically done with the trained, but fallible human eye and computer vision is, you know, well suited for this task to say, how do we track, pan, tilt, zoom, and see, track the tennis ball in real time, run at 30 plus frames per second, and do it all on the edge. And those are capabilities that, you know, were kind of like science fiction, maybe even a decade ago, and certainly five years ago. Now the interesting thing, is that with the advent of of generative AI, you can start to do things like create your own training data sets, or kind of create logic around once you have this visual input. And teams at Tesla have actually been speaking about, of course the autopilot team's focused on doing vision tasks, but they've combined large language models to add reasoning and logic. So given that you see, let's say the tennis ball, what do you want to do? And being able to combine the capabilities of what LLM's represent, which is really a lot of basically, core human reasoning and logic, with computer vision for the inputs of what's possible, creates these new capabilities, let alone multimodality, which I'm sure we'll talk more about. >> Yeah, and it's really, I mean it's almost intoxicating. It's amazing that this is so capable because the cloud scales here, you got the edge developing, you can decouple compute power, and let Moore's law and all the new silicone and the processors and the GPUs do their thing, and you got open source booming. You're kind of getting at this next segment I wanted to get into, which is the, how people should be thinking about these advances of the computer vision. So this is now a next wave, it's here. I mean I'd love to have that for baseball because I'm always like, "Oh, it should have been a strike." I'm sure that's going to be coming soon, but what is the computer vision capable of doing today? I guess that's my first question. You hit some of it, unpack that a little bit. What does general AI mean in computer vision? What's the new thing? Because there are old technology's been around, proprietary, bolted onto hardware, but hardware advances at a different pace, but now you got new capabilities, generative AI for vision, what does that mean? >> Yeah, so computer vision, you know, at its core is basically enabling machines, computers, to understand, process, and act on visual data as effective or more effective than people can. Traditionally this has been, you know, task types like classification, which you know, identifying if a given image belongs in a certain category of goods on maybe a retail site, is the shoes or is it clothing? Or object detection, which is, you know, creating bounding boxes, which allows you to do things like count how many things are present, or maybe measure the speed of something, or trigger an alert when something becomes visible in frame that wasn't previously visible in frame, or instant segmentation where you're creating pixel wise segmentations for both instance and semantic segmentation, where you often see these kind of beautiful visuals of the polygon surrounding objects that you see. Then you have key point detection, which is where you see, you know, athletes, and each of their joints are kind of outlined is another more traditional type problem in signal processing and computer vision. With generative AI, you kind of get a whole new class of problem types that are opened up. So in a lot of ways I think about generative AI in computer vision as some of the, you know, problems that you aimed to tackle, might still be better suited for one of the previous task types we were discussing. Some of those problem types may be better suited for using a generative technique, and some are problem types that just previously wouldn't have been possible absent generative AI. And so if you make that kind of Venn diagram in your head, you can think about, okay, you know, visual question answering is a task type where if I give you an image and I say, you know, "How many people are in this image?" We could either build an object detection model that might count all those people, or maybe a visual question answering system would sufficiently answer this type of problem. Let alone generative AI being able to create new training data for old systems. And that's something that we've seen be an increasingly prominent use case for our users, as much as things that we advise our customers and the community writ large to take advantage of. So ultimately those are kind of the traditional task types. I can give you some insight, maybe, into how I think about what's possible today, or five years or ten years as you sort go back. >> Yes, definitely. Let's get into that vision. >> So I kind of think about the types of use cases in terms of what's possible. If you just imagine a very simple bell curve, your normal distribution, for the longest time, the types of things that are in the center of that bell curve are identifying objects that are very common or common objects in context. Microsoft published the COCO Dataset in 2014 of common objects and contexts, of hundreds of thousands of images of chairs, forks, food, person, these sorts of things. And you know, the challenge of the day had always been, how do you identify just those 80 objects? So if we think about the bell curve, that'd be maybe the like dead center of the curve, where there's a lot of those objects present, and it's a very common thing that needs to be identified. But it's a very, very, very small sliver of the distribution. Now if you go out to the way long tail, let's go like deep into the tail of this imagined visual normal distribution, you're going to have a problem like one of our customers, Rivian, in tandem with AWS, is tackling, to do visual quality assurance and manufacturing in production processes. Now only Rivian knows what a Rivian is supposed to look like. Only they know the imagery of what their goods that are going to be produced are. And then between those long tails of proprietary data of highly specific things that need to be understood, in the center of the curve, you have a whole kind of messy middle, type of problems I like to say. The way I think about computer vision advancing, is it's basically you have larger and larger and more capable models that eat from the center out, right? So if you have a model that, you know, understands the 80 classes in COCO, well, pretty soon you have advances like Clip, which was trained on 400 million image text pairs, and has a greater understanding of a wider array of objects than just 80 classes in context. And over time you'll get more and more of these larger models that kind of eat outwards from that center of the distribution. And so the question becomes for companies, when can you rely on maybe a model that just already exists? How do you use your data to get what may be capable off the shelf, so to speak, into something that is usable for you? Or, if you're in those long tails and you have proprietary data, how do you take advantage of the greatest asset you have, which is observed visual information that you want to put to work for your customers, and you're kind of living in the long tails, and you need to adapt state of the art for your capabilities. So my mental model for like how computer vision advances is you have that bell curve, and you have increasingly powerful models that eat outward. And multimodality has a role to play in that, larger models have a role to play in that, more compute, more data generally has a role to play in that. But it will be a messy and I think long condition. >> Well, the thing I want to get, first of all, it's great, great mental model, I appreciate that, 'cause I think that makes a lot of sense. The question is, it seems now more than ever, with the scale and compute that's available, that not only can you eat out to the middle in your example, but there's other models you can integrate with. In the past there was siloed, static, almost bespoke. Now you're looking at larger models eating into the bell curve, as you said, but also integrating in with other stuff. So this seems to be part of that interaction. How does, first of all, is that really happening? Is that true? And then two, what does that mean for companies who want to take advantage of this? Because the old model was operational, you know? I have my cameras, they're watching stuff, whatever, and like now you're in this more of a, distributed computing, computer science mindset, not, you know, put the camera on the wall kind of- I'm oversimplifying, but you know what I'm saying. What's your take on that? >> Well, to the first point of, how are these advances happening? What I was kind of describing was, you know, almost uni-dimensional in that you have like, you're only thinking about vision, but the rise of generative techniques and multi-modality, like Clip is a multi-modal model, it has 400 million image text pairs. That will advance the generalizability at a faster rate than just treating everything as only vision. And that's kind of where LLMs and vision will intersect in a really nice and powerful way. Now in terms of like companies, how should they be thinking about taking advantage of these trends? The biggest thing that, and I think it's different, obviously, on the size of business, if you're an enterprise versus a startup. The biggest thing that I think if you're an enterprise, and you have an established scaled business model that is working for your customers, the question becomes, how do you take advantage of that established data moat, potentially, resource moats, and certainly, of course, establish a way of providing value to an end user. So for example, one of our customers, Walmart, has the advantage of one of the largest inventory and stock of any company in the world. And they also of course have substantial visual data, both from like their online catalogs, or understanding what's in stock or out of stock, or understanding, you know, the quality of things that they're going from the start of their supply chain to making it inside stores, for delivery of fulfillments. All these are are visual challenges. Now they already have a substantial trove of useful imagery to understand and teach and train large models to understand each of the individual SKUs and products that are in their stores. And so if I'm a Walmart, what I'm thinking is, how do I make sure that my petabytes of visual information is utilized in a way where I capture the proprietary benefit of the models that I can train to do tasks like, what item was this? Or maybe I'm going to create AmazonGo-like technology, or maybe I'm going to build like delivery robots, or I want to automatically know what's in and out of stock from visual input fees that I have across my in-store traffic. And that becomes the question and flavor of the day for enterprises. I've got this large amount of data, I've got an established way that I can provide more value to my own customers. How do I ensure I take advantage of the data advantage I'm already sitting on? If you're a startup, I think it's a pretty different question, and I'm happy to talk about. >> Yeah, what's startup angle on this? Because you know, they're going to want to take advantage. It's like cloud startups, cloud native startups, they were born in the cloud, they never had an IT department. So if you're a startup, is there a similar role here? And if I'm a computer vision startup, what's that mean? So can you share your your take on that, because there'll be a lot of people starting up from this. >> So the startup on the opposite advantage and disadvantage, right? Like a startup doesn't have an proven way of delivering repeatable value in the same way that a scaled enterprise does. But it does have the nimbleness to identify and take advantage of techniques that you can start from a blank slate. And I think the thing that startups need to be wary of in the generative AI enlarged language model, in multimodal world, is building what I like to call, kind of like sandcastles. A sandcastle is maybe a business model or a capability that's built on top of an assumption that is going to be pretty quickly wiped away by improving underlying model technology. So almost like if you imagine like the ocean, the waves are coming in, and they're going to wipe away your progress. You don't want to be in the position of building sandcastle business where, you don't want to bet on the fact that models aren't going to get good enough to solve the task type that you might be solving. In other words, don't take a screenshot of what's capable today. Assume that what's capable today is only going to continue to become possible. And so for a startup, what you can do, that like enterprises are quite comparatively less good at, is embedding these capabilities deeply within your products and delivering maybe a vertical based experience, where AI kind of exists in the background. >> Yeah. >> And we might not think of companies as, you know, even AI companies, it's just so embedded in the experience they provide, but that's like the vertical application example of taking AI and making it be immediately usable. Or, of course there's tons of picks and shovels businesses to be built like Roboflow, where you're enabling these enterprises to take advantage of something that they have, whether that's their data sets, their computes, or their intellect. >> Okay, so if I hear that right, by the way, I love, that's horizontally scalable, that's the large language models, go up and build them the apps, hence your developer focus. I'm sure that's probably the reason that the tsunami of developer's action. So you're saying picks and shovels tools, don't try to replicate the platform of what could be the platform. Oh, go to a VC, I'm going to build a platform. No, no, no, no, those are going to get wiped away by the large language models. Is there one large language model that will rule the world, or do you see many coming? >> Yeah, so to be clear, I think there will be useful platforms. I just think a lot of people think that they're building, let's say, you know, if we put this in the cloud context, you're building a specific type of EC2 instance. Well, it turns out that Amazon can offer that type of EC2 instance, and immediately distribute it to all of their customers. So you don't want to be in the position of just providing something that actually ends up looking like a feature, which in the context of AI, might be like a small incremental improvement on the model. If that's all you're doing, you're a sandcastle business. Now there's a lot of platform businesses that need to be built that enable businesses to get to value and do things like, how do I monitor my models? How do I create better models with my given data sets? How do I ensure that my models are doing what I want them to do? How do I find the right models to use? There's all these sorts of platform wide problems that certainly exist for businesses. I just think a lot of startups that I'm seeing right now are making the mistake of assuming the advances we're seeing are not going to accelerate or even get better. >> So if I'm a customer, if I'm a company, say I'm a startup or an enterprise, either one, same question. And I want to stand up, and I have developers working on stuff, I want to start standing up an environment to start doing stuff. Is that a service provider? Is that a managed service? Is that you guys? So how do you guys fit into your customers leaning in? Is it just for developers? Are you targeting with a specific like managed service? What's the product consumption? How do you talk to customers when they come to you? >> The thing that we do is enable, we give developers superpowers to build automated inventory tracking, self-checkout systems, identify if this image is malignant cancer or benign cancer, ensure that these products that I've produced are correct. Make sure that that the defect that might exist on this electric vehicle makes its way back for review. All these sorts of problems are immediately able to be solved and tackled. In terms of the managed services element, we have solutions as integrators that will often build on top of our tools, or we'll have companies that look to us for guidance, but ultimately the company is in control of developing and building and creating these capabilities in house. I really think the distinction is maybe less around managed service and tool, and more around ownership in the era of AI. So for example, if I'm using a managed service, in that managed service, part of their benefit is that they are learning across their customer sets, then it's a very different relationship than using a managed service where I'm developing some amount of proprietary advantages for my data sets. And I think that's a really important thing that companies are becoming attuned to, just the value of the data that they have. And so that's what we do. We tell companies that you have this proprietary, immense treasure trove of data, use that to your advantage, and think about us more like a set of tools that enable you to get value from that capability. You know, the HashiCorp's and GitLab's of the world have proven like what these businesses look like at scale. >> And you're targeting developers. When you go into a company, do you target developers with freemium, is there a paid service? Talk about the business model real quick. >> Sure, yeah. The tools are free to use and get started. When someone signs up for Roboflow, they may elect to make their work open source, in which case we're able to provide even more generous usage limits to basically move the computer vision community forward. If you elect to make your data private, you can use our hosted data set managing, data set training, model deployment, annotation tooling up to some limits. And then usually when someone validates that what they're doing gets them value, they purchase a subscription license to be able to scale up those capabilities. So like most developer centric products, it's free to get started, free to prove, free to poke around, develop what you think is possible. And then once you're getting to value, then we're able to capture the commercial upside in the value that's being provided. >> Love the business model. It's right in line with where the market is. There's kind of no standards bodies these days. The developers are the ones who are deciding kind of what the standards are by their adoption. I think making that easy for developers to get value as the model open sources continuing to grow, you can see more of that. Great perspective Joseph, thanks for sharing that. Put a plug in for the company. What are you guys doing right now? Where are you in your growth? What are you looking for? How should people engage? Give the quick commercial for the company. >> So as I mentioned, Roboflow is I think one of the largest, if not the largest collections of computer vision models and data sets that are open source, available on the web today, and have a private set of tools that over half the Fortune 100 now rely on those tools. So we're at the stage now where we know people want what we're working on, and we're continuing to drive that type of adoption. So companies that are looking to make better models, improve their data sets, train and deploy, often will get a lot of value from our tools, and certainly reach out to talk. I'm sure there's a lot of talented engineers that are tuning in too, we're aggressively hiring. So if you are interested in being a part of making the world programmable, and being at the ground floor of the company that's creating these capabilities to be writ large, we'd love to hear from you. >> Amazing, Joseph, thanks so much for coming on and being part of the AWS Startup Showcase. Man, if I was in my twenties, I'd be knocking on your door, because it's the hottest trend right now, it's super exciting. Generative AI is just the beginning of massive sea change. Congratulations on all your success, and we'll be following you guys. Thanks for spending the time, really appreciate it. >> Thanks for having me. >> Okay, this is season three, episode one of the ongoing series covering the exciting startups from the AWS ecosystem, talking about the hottest things in tech. I'm John Furrier, your host. Thanks for watching. (chill electronic music)

Published Date : Mar 9 2023

SUMMARY :

of the AWS Startups Showcase, of what you guys are doing, of the explosion of use and you know, write some hacks on code and do it all on the edge. and the processors and of the traditional task types. Let's get into that vision. the greatest asset you have, eating into the bell curve, as you said, and flavor of the day for enterprises. So can you share your your take on that, that you can start from a blank slate. but that's like the that right, by the way, How do I find the right models to use? Is that you guys? and GitLab's of the world Talk about the business model real quick. in the value that's being provided. The developers are the that over half the Fortune and being part of the of the ongoing series

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Joseph NelsonPERSON

0.99+

JosephPERSON

0.99+

WalmartORGANIZATION

0.99+

AmazonORGANIZATION

0.99+

John FurrierPERSON

0.99+

TeslaORGANIZATION

0.99+

400 millionQUANTITY

0.99+

2014DATE

0.99+

80 objectsQUANTITY

0.99+

AWSORGANIZATION

0.99+

three yearsQUANTITY

0.99+

ten yearsQUANTITY

0.99+

80 classesQUANTITY

0.99+

first questionQUANTITY

0.99+

five yearsQUANTITY

0.99+

MicrosoftORGANIZATION

0.99+

twoQUANTITY

0.99+

RoboflowORGANIZATION

0.99+

WimbledonEVENT

0.99+

todayDATE

0.98+

bothQUANTITY

0.98+

five years agoDATE

0.98+

GitLabORGANIZATION

0.98+

oneQUANTITY

0.98+

North StarORGANIZATION

0.98+

first pointQUANTITY

0.97+

eachQUANTITY

0.97+

over 10,000 pre-trained modelsQUANTITY

0.97+

a decade agoDATE

0.97+

RivianORGANIZATION

0.97+

Mobile World ConferenceEVENT

0.95+

over a hundred thousand developersQUANTITY

0.94+

EC2TITLE

0.94+

this monthDATE

0.93+

season oneQUANTITY

0.93+

30 plus frames per secondQUANTITY

0.93+

twentiesQUANTITY

0.93+

sandcastleORGANIZATION

0.9+

HashiCorpORGANIZATION

0.89+

theCUBEORGANIZATION

0.88+

hundreds of thousandsQUANTITY

0.87+

waveEVENT

0.87+

North StarORGANIZATION

0.86+

400 million image text pairsQUANTITY

0.78+

season threeQUANTITY

0.78+

episode oneQUANTITY

0.76+

AmazonGoORGANIZATION

0.76+

over halfQUANTITY

0.69+

a hundred millionQUANTITY

0.68+

Startup ShowcaseEVENT

0.66+

Fortune 100TITLE

0.66+

COCOTITLE

0.65+

RoboflowPERSON

0.6+

ChatGPTORGANIZATION

0.58+

DatasetTITLE

0.53+

MoorePERSON

0.5+

COCOORGANIZATION

0.39+