Partner im RedaktionsNetzwerk Deutschland
PodcastsTechnologieData Engineering Podcast

Data Engineering Podcast

Tobias Macey
Data Engineering Podcast
Neueste Episode

Verfügbare Folgen

5 von 466
  • From Data Discovery to AI: The Evolution of Semantic Layers
    SummaryIn this episode of the Data Engineering Podcast, host Tobias Macy welcomes back Shinji Kim to discuss the evolving role of semantic layers in the era of AI. As they explore the challenges of managing vast data ecosystems and providing context to data users, they delve into the significance of semantic layers for AI applications. They dive into the nuances of semantic modeling, the impact of AI on data accessibility, and the importance of business logic in semantic models. Shinji shares her insights on how SelectStar is helping teams navigate these complexities, and together they cover the future of semantic modeling as a native construct in data systems. Join them for an in-depth conversation on the evolving landscape of data engineering and its intersection with AI.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Shinji Kim about the role of semantic layers in the era of AIInterviewIntroductionHow did you get involved in the area of data management?Semantic modeling gained a lot of attention ~4-5 years ago in the context of the "modern data stack". What is your motivation for revisiting that topic today?There are several overlapping concepts – "semantic layer," "metrics layer," "headless BI." How do you define these terms, and what are the key distinctions and overlaps?Do you see these concepts converging, or do they serve distinct long-term purposes?Data warehousing and business intelligence have been around for decades now. What new value does semantic modeling beyond practices like star schemas, OLAP cubes, etc.?What benefits does a semantic model provide when integrating your data platform into AI use cases?How is it different between using AI as an interface to your analytical use cases vs. powering customer facing AI applications with your data?Putting in the effort to create and maintain a set of semantic models is non-zero. What role can LLMs play in helping to propose and construct those models?For teams who have already invested in building this capability, what additional context and metadata is necessary to provide guidance to LLMs when working with their models?What's the most effective way to create a semantic layer without turning it into a massive project? There are several technologies available for building and serving these models. What are the selection criteria that you recommend for teams who are starting down this path?What are the most interesting, innovative, or unexpected ways that you have seen semantic models used?What are the most interesting, unexpected, or challenging lessons that you have learned while working with semantic modeling?When is semantic modeling the wrong choice?What do you predict for the future of semantic modeling?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksSelectStarSun MicrosystemsMarkov Chain Monte CarloSemantic ModelingSemantic LayerMetrics LayerHeadless BICubePodcast EpisodeAtScaleStar SchemaData VaultOLAP CubeRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeKNN == K-Nearest NeighbersHNSW == Hierarchical Navigable Small Worlddbt Metrics LayerSoda DataLookMLHexPowerBITableauSemantic View (Snowflake)Databricks GenieSnowflake Cortex AnalystMalloyThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    --------  
    49:30
  • Balancing Off-the-Shelf and Custom Solutions in Data Engineering
    SummaryIn this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologiesInterviewIntroductionHow did you get involved in the area of data management?Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?What are the most interesting, innovative, or unexpected ways that you have seen solutions to large-scale data management designed?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data management across axes of scale?What are the ways that you are thinking about the future trajectory of your work??Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksBlackRockSparkFlinkKafkaCassandraRocksDBNetflix Maestro workflow orchestratorPagerdutyIcebergThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    --------  
    46:05
  • StarRocks: Bridging Lakehouse and OLAP for High-Performance Analytics
    SummaryIn this episode of the Data Engineering Podcast Sida Shen, product manager at CelerData, talks about StarRocks, a high-performance analytical database. Sida discusses the inception of StarRocks, which was forked from Apache Doris in 2020 and evolved into a high-performance Lakehouse query engine. He explains the architectural design of StarRocks, highlighting its capabilities in handling high concurrency and low latency queries, and its integration with open table formats like Apache Iceberg, Delta Lake, and Apache Hudi. Sida also discusses how StarRocks differentiates itself from other query engines by supporting on-the-fly joins and eliminating the need for denormalization pipelines, and shares insights into its use cases, such as customer-facing analytics and real-time data processing, as well as future directions for the platform.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Sida Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patternsInterviewIntroductionHow did you get involved in the area of data management?Can you describe what StarRocks is and the story behind it?There are numerous analytical databases on the market. What are the attributes of StarRocks that differentiate it from other options?Can you describe the architecture of StarRocks?What are the "-ilities" that are foundational to the design of the system?How have the design and focus of the project evolved since it was first created?What are the tradeoffs involved in separating the communication layer from the data layers?The tiered architecture enables the shared nothing and shared data behaviors, which allows for the implementation of lakehouse patterns. What are some of the patterns that are possible due to the single interface/dual pattern nature of StarRocks?The shared data implementation has cacheing built in to accelerate interaction with datasets. What are some of the limitations/edge cases that operators and consumers should be aware of?StarRocks supports management of lakehouse tables (Iceberg, Delta, Hudi, etc.), which overlaps with use cases for Trino/Presto/Dremio/etc. What are the cases where StarRocks acts as a replacement for those systems vs. a supplement to them?The other major category of engines that StarRocks overlaps with is OLAP databases (e.g. Clickhouse, Firebolt, etc.). Why might someone use StarRocks in addition to or in place of those techologies?We would be remiss if we ignored the dominating trend of AI and the systems that support it. What is the role of StarRocks in the context of an AI application?What are the most interesting, innovative, or unexpected ways that you have seen StarRocks used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on StarRocks?When is StarRocks the wrong choice?What do you have planned for the future of StarRocks?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksStarRocksCelerDataApache DorisSIMD == Single Instruction Multiple DataApache IcebergClickHousePodcast EpisodeDruidFireboltPodcast EpisodeSnowflakeBigQueryTrinoDatabricksDremioData LakehouseDelta LakeApache HiveC++Cost-Based OptimizerIceberg Summit Tencent Games PresentationApache PaimonLancePodcast EpisodeDelta UniformApache ArrowStarRocks Python UDFDebeziumPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    --------  
    59:41
  • Exploring NATS: A Multi-Paradigm Connectivity Layer for Distributed Applications
    SummaryIn this episode of the Data Engineering Podcast Derek Collison, creator of NATS and CEO of Synadia, talks about the evolution and capabilities of NATS as a multi-paradigm connectivity layer for distributed applications. Derek discusses the challenges and solutions in building distributed systems, and highlights the unique features of NATS that differentiate it from other messaging systems. He delves into the architectural decisions behind NATS, including its ability to handle high-speed global microservices, support for edge computing, and integration with Jetstream for data persistence, and explores the role of NATS in modern data management and its use cases in industries like manufacturing and connected vehicles.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Derek Collison about NATS, a multi-paradigm connectivity layer for distributed applications.InterviewIntroductionHow did you get involved in the area of data management?Can you describe what NATS is and the story behind it?How have your experiences in past roles (cloud foundry, TIBCO messaging systems) informed the core principles of NATS?What other sources of inspiration have you drawn on in the design and evolution of NATS? (e.g. Kafka, RabbitMQ, etc.)There are several patterns and abstractions that NATS can support, many of which overlap with other well-regarded technologies. When designing a system or service, what are the heuristics that should be used to determine whether NATS should act as a replacement or addition to those capabilities? (e.g. considerations of scale, speed, ecosystem compatibility, etc.)There is often a divide in the technologies and architecture used between operational/user-facing applications and data systems. How does the unification of multiple messaging patterns in NATS shift the ways that teams think about the relationship between these use cases?How does the shared communication layer of NATS with multiple protocol and pattern adaptaters reduce the need to replicate data and logic across application and data layers?Can you describe how the core NATS system is architected?How have the design and goals of NATS evolved since you first started working on it?In the time since you first began writing NATS (~2012) there have been several evolutionary stages in both application and data implementation patterns. How have those shifts influenced the direction of the NATS project and its ecosystem?For teams who have an existing architecture, what are some of the patterns for adoption of NATS that allow them to augment or migrate their capabilities?What are some of the ecosystem investments that you and your team have made to ease the adoption and integration of NATS?What are the most interesting, innovative, or unexpected ways that you have seen NATS used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on NATS?When is NATS the wrong choice?What do you have planned for the future of NATS?Contact InfoGitHubLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksNATSNATS JetStreamSynadiaCloud FoundryTIBCOApplied Physics Lab - Johns Hopkins UniversityCray SupercomputerRVCM Certified MessagingTIBCO ZMSIBM MQJMS == Java Message ServiceRabbitMQMongoDBNodeJSRedisAMQP == Advanced Message Queueing ProtocolPub/Sub PatternCircuit Breaker PatternZero MQAkamaiFastlyCDN == Content Delivery NetworkAt Most OnceAt Least OnceExactly OnceAWS KinesisMemcachedSQSSegmentRudderstackPodcast EpisodeDLQ == Dead Letter QueueMQTT == Message Queueing Telemetry TransportNATS Kafka Bridge10BaseT NetworkWeb AssemblyRedPandaPodcast EpisodePulsar FunctionsmTLSAuthZ (Authorization)AuthN (Authentication)NATS Auth CalloutsOPA == Open Policy AgentRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeHome AssistantPodcast.__init__ EpisodeTailscaleOllamaCDC == Change Data CapturegRPCThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    --------  
    1:12:50
  • Advanced Lakehouse Management With The LakeKeeper Iceberg REST Catalog
    SummaryIn this episode of the Data Engineering Podcast Viktor Kessler, co-founder of Vakmo, talks about the architectural patterns in the lake house enabled by a fast and feature-rich Iceberg catalog. Viktor shares his journey from data warehouses to developing the open-source project, Lakekeeper, an Apache Iceberg REST catalog written in Rust that facilitates building lake houses with essential components like storage, compute, and catalog management. He discusses the importance of metadata in making data actionable, the evolution of data catalogs, and the challenges and innovations in the space, including integration with OpenFGA for fine-grained access control and managing data across formats and compute engines.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Viktor Kessler about architectural patterns in the lakehouse that are unlocked by a fast and feature-rich Iceberg catalogInterviewIntroductionHow did you get involved in the area of data management?Can you describe what LakeKeeper is and the story behind it? What is the core of the problem that you are addressing?There has been a lot of activity in the catalog space recently. What are the driving forces that have highlighted the need for a better metadata catalog in the data lake/distributed data ecosystem?How would you characterize the feature sets/problem spaces that different entrants are focused on addressing?Iceberg as a table format has gained a lot of attention and adoption across the data ecosystem. The REST catalog format has opened the door for numerous implementations. What are the opportunities for innovation and improving user experience in that space?What is the role of the catalog in managing security and governance? (AuthZ, auditing, etc.)What are the channels for propagating identity and permissions to compute engines? (how do you avoid head-scratching about permission denied situations)Can you describe how LakeKeeper is implemented?How have the design and goals of the project changed since you first started working on it?For someone who has an existing set of Iceberg tables and catalog, what does the migration process look like?What new workflows or capabilities does LakeKeeper enable for data teams using Iceberg tables across one or more compute frameworks?What are the most interesting, innovative, or unexpected ways that you have seen LakeKeeper used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on LakeKeeper?When is LakeKeeper the wrong choice?What do you have planned for the future of LakeKeeper?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksLakeKeeperSAPMicrosoft AccessMicrosoft ExcelApache IcebergPodcast EpisodeIceberg REST CatalogPyIcebergSparkTrinoDremioHive MetastoreHadoopNATSPolarsDuckDBPodcast EpisodeDataFusionAtlanPodcast EpisodeOpen MetadataPodcast EpisodeApache AtlasOpenFGAHudiPodcast EpisodeDelta LakePodcast EpisodeLance Table FormatPodcast EpisodeUnity CatalogPolaris CatalogApache GravitinoPodcast Episode KeycloakOpen Policy Agent (OPA)Apache RangerApache NiFiThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    --------  
    57:13

Weitere Technologie Podcasts

Über Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Podcast-Website

Hören Sie Data Engineering Podcast, Darknet Diaries und viele andere Podcasts aus aller Welt mit der radio.de-App

Hol dir die kostenlose radio.de App

  • Sender und Podcasts favorisieren
  • Streamen via Wifi oder Bluetooth
  • Unterstützt Carplay & Android Auto
  • viele weitere App Funktionen

Data Engineering Podcast: Zugehörige Podcasts

Rechtliches
Social
v7.18.2 | © 2007-2025 radio.de GmbH
Generated: 5/22/2025 - 5:41:18 AM