CB Bohn, Principal Data Engineer, Microfocus | The Convergence of File and Object

>> Announcer: From around the globe it's theCUBE. Presenting the Convergence of File and Object brought to you by Pure Storage. >> Okay now we're going to get the customer perspective on object and we'll talk about the convergence of file and object, but really focusing on the object pieces this is a content program that's being made possible by Pure Storage and it's co-created with theCUBE. Christopher CB Bohn is here. He's a lead architect for MicroFocus the enterprise data warehouse and principal data engineer at MicroFocus. CB welcome good to see you. >> Thanks Dave good to be here. >> So tell us more about your role at Microfocus it's a pan Microfocus role because we know the company is a multi-national software firm it acquired the software assets of HP of course including Vertica tell us where you fit. >> Yeah so Microfocus is you know, it's like I can says it's wide, worldwide company that it sells a lot of software products all over the place to governments and so forth. And it also grows often by acquiring other companies. So there is there the problem of integrating new companies and their data. And so what's happened over the years is that they've had a number of different discreet data systems so you've got this data spread all over the place and they've never been able to get a full complete introspection on the entire business because of that. So my role was come in, design a central data repository and an enterprise data warehouse, that all reporting could be generated against. And so that's what we're doing and we selected Vertica as the EDW system and Pure Storage FlashBlade as the communal repository. >> Okay so you obviously had experience with with Vertica in your previous role, so it's not like you were starting from scratch, but paint a picture of what life was like before you embarked on this sort of consolidated approach to your data warehouse. Was it just dispared data all over the place? A lot of M and A going on, where did the data live? >> CB: So >> Right so again the data is all over the place including under people's desks and just dedicated you know their own private SQL servers, It, a lot of data in a Microfocus is one on SQL server, which has pros and cons. Cause that's a great transactional database but it's not really good for analytics in my opinion. So but a lot of stuff was running on that, they had one Vertica instance that was doing some select reporting. Wasn't a very powerful system and it was what they call Vertica enterprise mode where it had dedicated nodes which had the compute and storage in the same locus on each server okay. So Vertica Eon mode is a whole new world because it separates compute from storage. Okay and at first was implemented in AWS so that you could spin up you know different numbers of compute nodes and they all share the same communal storage. But there has been a demand for that kind of capability, but in an on-prem situation. Okay so Pure storage was the first vendor to come along and have an S3 emulation that was actually workable. And so Vertica worked with Pure Storage to make that all happen and that's what we're using. >> Yeah I know back when back from where we used to do face-to-face, we would be at you know Pure Accelerate, Vertica was always there it stopped by the booth, see what they're doing so tight integration there. And you mentioned Eon mode and the ability to scale, storage and compute independently. And so and I think Vertica is the only one I know they were the first, I'm not sure anybody else does that both for cloud and on-prem, but so how are you using Eon mode, are you both in AWS and on-prem are you exclusively cloud? Maybe you could describe that a little bit. >> Right so there's a number of internal rules at Microfocus that you know there's, it's not AWS is not approved for their business processes. At least not all of them, they really wanted to be on-prem and all the transactional systems are on-prem. And so we wanted to have the analytics OLAP stuff close to the OLTP stuff right? So that's why they called there, co-located very close to each other. And so we could, what's nice about this situation is that these S3 objects, it's an S3 object store on the Pure Flash Blade. We could copy those over if we needed it to AWS and we could spin up a version of Vertica there, and keep going. It's like a tertiary GR strategy cause we actually have a, we're setting up a second, Flash Blade Vertica system geo located elsewhere for backup and we can get into it if you want to talk about how the latest version of the Pure software for the Flash Blade allows synchronization across network boundaries of those Flash Blade which is really nice because if, you know there's a giant sinkhole opens up under our Koll of facility and we lose that thing then we just have to switch to DNS. And we were back in business of the DR. And then the third one was to go, we could copy those objects over to AWS and be up and running there. So we're feeling pretty confident about being able to weather whatever comes along. >> Yeah I'm actually very interested in that conversation but before we go there. you mentioned you want, you're going to have the old lab close to the OLTP, was that for latency reasons, data movement reasons, security, all of the above. >> Yeah it's really all of the above because you know we are operating under the same sub-net. So to gain access to that data, you know you'd have to be within that VPN environment. We didn't want to going out over the public internet. Okay so and just for latency reasons also, you know we have a lot of data and we're continually doing ETL processes into Vertica from our production data, transactional databases. >> Right so they got to be approximate. So I'm interested in so you're using the Pure Flash Blade as an object store, most people think, oh object simple but slow. Not the case for you is that right? >> Not the case at all >> Why is that. >> This thing had hoop It's ripping, well you have to understand about Vertica and the way it stores data. It stores data in what they call storage containers. And those are immutable, okay on disc whether it's on AWS or if you had a enterprise mode Vertica, if you do an update or delete it actually has to go and retrieve that object container from disc and it destroys it and rebuilds it, okay which is why you don't, you want to avoid updates and deletes with vertica because the way it gets its speed is by sorting and ordering and encoding the data on disk. So it can read it really fast. But if you do an operation where you're deleting or updating a record in the middle of that, then you've got to rebuild that entire thing. So that actually matches up really well with S3 object storage because it's kind of the same way, it gets destroyed and rebuilt too okay. So that matches up very well with Vertica and we were able to design the system so that it's a panda only. Now we have some reports that we're running in SQL server. Okay which we're taking seven days. So we moved that to Vertica from SQL server and we rewrote the queries, which were had, which had been written in TC SQL with a bunch of loops and so forth and we were to get, this is amazing it went from seven days to two seconds, to generate this report. Which has tremendous value to the company because it would have to have this long cycle of seven days to get a new introspection in what they call the knowledge base. And now all of a sudden it's almost on demand two seconds to generate it. That's great and that's because of the way the data is stored. And the S3 you asked about, oh you know it, it's slow, well not in that context. Because what happens really with Vertica Eon mode is that it can, they have, when you set up your compute nodes, they have local storage also which is called the depot. It's kind of a cache okay. So the data will be drawn from the Flash Blade and cached locally. And that was, it was thought when they designed that, oh you know it's that'll cut down on the latency. Okay but it turns out that if you have your compute nodes close meaning minimal hops to the Flash Blade that you can actually tell Vertica, you know don't even bother caching that stuff just read it directly on the fly from the from the Flash Blade and the performance is still really good. It depends on your situation. But I know for example a major telecom company that uses the same topologies we're talking about here they did the same thing. They just dropped the cache cause the Flash Blade was able to deliver the data fast enough. >> So that's, you're talking about that's speed of light issues and just the overhead of switching infrastructure is that, it's eliminated and so as a result you can go directly to the storage array? >> That's correct yeah, it's like, it's fast enough that it's almost as if it's local to the compute node. But every situation is different depending on your needs. If you've got like a few tables that are heavily used, then yeah put them in the cache because that'll be probably a little bit faster. But if you're have a lot of ad hoc queries that are going on, you know you may exceed the storage of the local cache and then you're better off having it just read directly from the, from the Flash Blade. >> Got it so it's >> Okay. >> It's an append only approach. So you're not >> Right >> Overwriting on a record, so but then what you have automatically re index and that's the intelligence of the system. how does that work? >> Oh this is where we did a little bit of magic. There's not really anything like magic but I'll tell you what it is I mean. ( Dave laughing) Vertica does not have indexes. They don't exist. Instead I told you earlier that it gets a speed by sorting and encoding the data on disk and ordering it right. So when you've got an append-only situation, the natural question is well if I have a unique record, with let's say ID one, two, three, what happens if I append a new version of that, what happens? Well the way Vertica operates is that there's a thing called a projection which is actually like a materialized columnar data store. And you can have a, what they call a top-K projection, which says only put in this projection the records that meet a certain condition. So there's a field that we like to call a discriminator field which is like okay usually it's the latest update timestamp. So let's say we have record one, two, three and it had yesterday's date and that's the latest version. Now a new version comes in. When the data at load time vertical looks at that and then it looks in the projection and says does this exist already? If it doesn't then it adds it. If it does then that one now goes into that projection okay. And so what you end up having is a projection that is the latest snapshot of the data, which would be like, oh that's the reality of what the table is today okay. But inherent in that is that you now have a table that has all the change history of those records, which is awesome. >> Yeah. >> Because, you often want to go back and revisit, you know what it will happen to you. >> But that materialized view is the most current and the system knows that at least can (murmuring). >> Right so we then create views that draw off from that projection so that our users don't have to worry about any of that. They just get oh and say select from this view and they're getting the latest greatest snapshot of what the reality of the data is right now. But if they want to go back and say, well how did this data look two days ago? That's an easy query for them to do also. So they get the best of both worlds. >> So could you just plug any flash array into your system and achieve the same results or is there anything really unique about Pure? >> Yeah well they're the only ones that have got I think really dialed in the S3 object form because I don't think AWS actually publishes every last detail of that S3 spec. Okay so it had, there's a certain amount of reverse engineering they had to do I think. But they got it right. When we've, a couple maybe a year and a half ago or so there they were like at 99%, but now they worked with Vertica people to make sure that that object format was true to what it should be. So that it works just as if Vertica doesn't care, if it is on AWS or if it's on Pure Flash Blade because Pure did a really good job of dialing in that format and so Vertica doesn't care. It just knows S3, doesn't know what it doesn't care where it's going it just works. >> So the essentially vendor R and D abstracted that complexity so you didn't have to rewrite the application is that right? >> Right, so you know when Vertica ships it's software, you don't get a specific version for Pure or AWS, it's all in one package, and then when you configure it, it knows oh okay well, I'm just pointed at the, you know this port, on the Pure storage Flash Blade, and it just works. >> CB what's your data team look like? How is it evolving? You know a lot of customers I talked to they complain that they struggled to get value out of the data and they don't have the expertise, what does your team look like? How is it, is it changing or did the pandemic change things at all? I wonder if you could bring us up to date on that? >> Yeah but in some ways Microfocus has an advantage in that it's such a widely dispersed across the world company you know it's headquartered in the UK, but I deal with people I'm in the Bay Area, we have people in Mexico, Romania, India. >> Okay enough >> All over the place yeah all over the place. So when this started, it was actually a bigger project it got scaled back, it was almost to the point where it was going to be cut. Okay, but then we said, well let's try to do almost a skunkworks type of thing with reduced staff. And so we're just like a hand. You could count the number of key people on this on one hand. But we got it all together, and it's been a traumatic transformation for the company. Now there's, it's one approval and admiration from the highest echelons of this company that, hey this is really providing value. And the company is starting to get views into their business that they didn't have before. >> That's awesome, I mean, I've watched Microfocus for years. So to me they've always had a, their part of their DNA is private equity I mean they're sharp investors, they do great M and A >> CB: Yeah >> They know how to drive value and they're doing modern M and A, you know, we've seen what they what wait, what they did with SUSE, obviously driving value out of Vertica, they've got a really, some sharp financial people there. So that's they must have loved the the Skunkworks, fast ROI you know, small denominator, big numerator. (laughing) >> Well I think that in this case, smaller is better when you're doing development. You know it's a two-minute cooks type of thing and if you've got people who know what they're doing, you know I've got a lot of experience with Vertica, I've been on the advisory board for Vertica for a long time. >> Right And you know I was able to learn from people who had already, we're like the second or third company to do a Pure Flash Blade Vertica installation, but some of the best companies after they've already done it we are members of the advisory board also. So I learned from the best, and we were able to get this thing up and running quickly and we've got you know, a lot of other, you know handful of other key people who know how to write SQL and so forth to get this up and running quickly. >> Yeah so I mean, look it Pure is a fit I mean I sound like a fan boy, but Pure is all about simplicity, so is object. So that means you don't have to ra, you know worry about wrangling storage and worrying about LANs and all that other nonsense and file names but >> I have burned by hardware in the past you know, where oh okay they built into a price and so they cheap out on stuff like fans or other things in these components fail and the whole thing goes down, but this hardware is super good quality. And so I'm happy with the quality of that we're getting. >> So CB last question. What's next for you? Where do you want to take this initiative? >> Well we are in the process now of, we're when, so I designed a system to combine the best of the Kimball approach to data warehousing and the inland approach okay. And what we do is we bring over all the data we've got and we put it into a pristine staging layer. Okay like I said it's a, because it's append-only, it's essentially a log of all the transactions that are happening in this company, just as they appear okay. And then from the Kimball side of things we're designing the data marts now. So that's what the end users actually interact with. So we're taking the, we're examining the transactional systems to say, how are these business objects created? What's the logic there and we're recreating those logical models in Vertica. So we've done a handful of them so far, and it's working out really well. So going forward we've got a lot of work to do, to create just about every object that the company needs. >> CB you're an awesome guest really always a pleasure talking to you and >> Thank you. >> congratulations and good luck going forward stay safe. >> Thank you, you too Dave. >> All right thank you. And thank you for watching the Convergence of File and Object. This is Dave Vellante for theCUBE. (soft music)

Published Date : Apr 28 2021

SUMMARY :

brought to you by Pure Storage. but really focusing on the object pieces it acquired the software assets of HP all over the place to Okay so you obviously so that you could spin up you know and the ability to scale, and we can get into it if you want to talk security, all of the above. Yeah it's really all of the above Not the case for you is that right? And the S3 you asked about, storage of the local cache So you're not and that's the intelligence of the system. and that's the latest version. you know what it will happen to you. and the system knows that at least the data is right now. in the S3 object form and then when you configure it, I'm in the Bay Area, And the company is starting to get So to me they've always had loved the the Skunkworks, I've been on the advisory a lot of other, you know So that means you don't have to by hardware in the past you know, Where do you want to take this initiative? object that the company needs. congratulations and good And thank you for watching

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Mexico	LOCATION	0.99+
AWS	ORGANIZATION	0.99+
MicroFocus	ORGANIZATION	0.99+
Vertica	ORGANIZATION	0.99+
UK	LOCATION	0.99+
seven days	QUANTITY	0.99+
Romania	LOCATION	0.99+
99%	QUANTITY	0.99+
HP	ORGANIZATION	0.99+
Microfocus	ORGANIZATION	0.99+
two-minute	QUANTITY	0.99+
second	QUANTITY	0.99+
two seconds	QUANTITY	0.99+
India	LOCATION	0.99+
Kimball	ORGANIZATION	0.99+
Pure Storage	ORGANIZATION	0.99+
each server	QUANTITY	0.99+
CB Bohn	PERSON	0.99+
yesterday	DATE	0.99+
two days ago	DATE	0.99+
first	QUANTITY	0.99+
Christopher CB Bohn	PERSON	0.98+
SQL	TITLE	0.98+
Vertica	TITLE	0.98+
a year and a half ago	DATE	0.98+
both worlds	QUANTITY	0.98+
Pure Flash Blade	COMMERCIAL_ITEM	0.98+
both	QUANTITY	0.98+
vertica	TITLE	0.98+
Bay Area	LOCATION	0.97+
one	QUANTITY	0.97+
Flash Blade	COMMERCIAL_ITEM	0.97+
third one	QUANTITY	0.96+
CB	PERSON	0.96+
one package	QUANTITY	0.96+
today	DATE	0.95+
Pure storage Flash Blade	COMMERCIAL_ITEM	0.95+
first vendor	QUANTITY	0.95+
pandemic	EVENT	0.94+
S3	TITLE	0.94+
marts	DATE	0.92+
Skunkworks	ORGANIZATION	0.91+
SUSE	ORGANIZATION	0.89+
three	QUANTITY	0.87+
S3	COMMERCIAL_ITEM	0.87+
third company	QUANTITY	0.84+
Pure Flash Blade Vertica	COMMERCIAL_ITEM	0.83+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for CB Bohn: