投稿時間:2021-12-19 00:15:33 RSSフィード2021-12-19 00:00 分まとめ(23件)

カテゴリー等 サイト名等 記事タイトル・トレンドワード等 リンクURL 頻出ワード・要約等/検索ボリューム 登録日
AWS lambdaタグが付けられた新着投稿 - Qiita LTE-M CO2センサー RS-LTECO2 スターターキットをセットアップして各種アラートを飛ばしてみる https://qiita.com/keni_w/items/33c9c132545aee386894 送信間隔については、設定値なので、それ以外の値を可視化することになります。 2021-12-18 23:24:25
js JavaScriptタグが付けられた新着投稿 - Qiita WebGLをGISに応用してみる【DEM編】 https://qiita.com/Kanahiro/items/9c7096c307d3c816582f WebGLでも標高値テクスチャがあれば楽しいことがあるとわかりました。 2021-12-18 23:35:30
Program [全てのタグ]の新着質問一覧|teratail(テラテイル) VirtualBoxのLinuxに立てたOracleへの接続 https://teratail.com/questions/374481?rss=all VirtualBoxのLinuxに立てたOracleへの接続学習のためにVirtualnbspBoxでLinuxの仮想環境を作り、そこにOracleをたてて接続を試みていますが、sqldeveloperからうまいこと接続できなくて困ってます。 2021-12-18 23:51:09
Program [全てのタグ]の新着質問一覧|teratail(テラテイル) Laravel 動的なセレクトボックスの実装方法 https://teratail.com/questions/374480?rss=all 2021-12-18 23:50:35
Program [全てのタグ]の新着質問一覧|teratail(テラテイル) javascript tabメニュー https://teratail.com/questions/374479?rss=all javascripttabメニュー表題の通りjavascriptでtabメニューを作ってみました。 2021-12-18 23:27:45
Program [全てのタグ]の新着質問一覧|teratail(テラテイル) Windows版 VSCodeのGo公式拡張機能でInstall /Update Toolsを使って、gopkgsを入れようとしても「Error: Command failed」が返される https://teratail.com/questions/374478?rss=all Windows版VSCodeのGo公式拡張機能でInstallUpdateToolsを使って、gopkgsを入れようとしても「ErrorCommandfailed」が返されるやろうとしたことと、状況WIndowsのPCにGo言語と、VSCodeのGo言語の拡張機能をインストール後、表示gtコマンドパレットgtInstallnbspUpdatenbspToolsを選択して、gopkgsをインストールしようとしましたが、毎回インストールに失敗します。 2021-12-18 23:21:37
Docker dockerタグが付けられた新着投稿 - Qiita 【Docker Compose】Route53管理下のドメインのウェブサイトにhttpsアクセスできるようにする https://qiita.com/polarbear08/items/a0f35e5a47a3d0c67970 【DockerCompose】Route管理下のドメインのウェブサイトにhttpsアクセスできるようにするこの記事はDockerAdventCalendarの日目の記事として書かれました。 2021-12-18 23:06:06
Azure Azureタグが付けられた新着投稿 - Qiita Azure Static Web Appsをもっと知ってほしい https://qiita.com/mikaka360/items/d4b9ac4cd98fba0cdf5e 完了リソースの作成完了後、URLの箇所をクリックしてみましょう。 2021-12-18 23:58:33
技術ブログ Developers.IO 地図の作成やデザインを手軽に行える「Mappin Drop」を触ってみた https://dev.classmethod.jp/articles/i-tried-making-a-map-with-mappin-drop/ mappind 2021-12-18 14:54:16
海外TECH MakeUseOf XP-Pen Artist Pro 16TP Review: The Best Affordable Multi-Touch 4K Drawing Tablet? https://www.makeuseof.com/xppen-artist-pro-16tp-review-4k-drawing-tablet/ drawing 2021-12-18 14:55:22
海外TECH MakeUseOf How Does the 360 Degree Camera in a Car Work? https://www.makeuseof.com/how-does-360-degree-camera-in-car-work/ camera 2021-12-18 14:30:40
海外TECH MakeUseOf Management Skills for Resumes: How to Show You Are a Good Team Leader https://www.makeuseof.com/management-skills-for-resume/ Management Skills for Resumes How to Show You Are a Good Team LeaderEvery manager needs a set of skills that support production motivation and inspiration Here are some essential management skills for your resume 2021-12-18 14:15:12
海外TECH DEV Community How to fix "Property 'ethereum' does not exist on type 'Window & typeof globalThis'.ts(2339)" https://dev.to/nmurgor/how-to-fix-property-ethereum-does-not-exist-on-type-window-typeof-globalthists2339-42dh How to fix quot Property x ethereum x does not exist on type x Window amp typeof globalThis x ts quot IntroductionWhen working with MetaMask APIs you might encounter this warning Property ethereum does not exist on type Window amp typeof globalThis ts when accessing window ethereum in a TypeScript project MetaMask injects a global API into websites visited by its users at window ethereumTo fix this warning and corresponding compile error add this declaration at the top level of your file after imports declare var window anyNote this is not the best fix as this temporarily fixes the warning by telling the TypeScript compiler to treat window as of type any hence ignore any warnings This sacrifices also any IntelliSense and auto completion provided by code editor on the window object This article was originally published at 2021-12-18 14:50:31
海外TECH DEV Community How to send e-mail using Python https://dev.to/raghavmri/how-to-send-e-mail-using-python-kmk How to send e mail using PythonSo Have you guys ever wondered how do companies like Amazon Walmart eBay send emails to your inbox immediately after your purchase Do you think it s an email sent by an human I know most of you must have guessed it s an Automated E mail but how do they do that In order to solve the question that was raised we are gonna be trying to send a email using simple Python Code Feel free to customize the code according to your needs A Simple Program Points To NoteDo let me know where I can improve in writing posts in the comment section below as I am too learning new things Thanks For Reading 2021-12-18 14:31:54
海外TECH DEV Community Ruby 3 Mindset, what's new ? https://dev.to/shayanholakouee/ruby-3-mindset-whats-new--1kh9 Ruby Mindset what x s new The number is very significant in the Ruby release Be it release version number making performance x faster or the trio of core contributors Matz TenderLove Koichi Similarly there were major goals of Ruby being faster having better concurrency and ensuring correctness Ruby s main selling point is that it s times faster than Ruby Matz and the team have been trying to bring in a lot of features in the latest version of Ruby but their foremost priority has been to ensure that performance is improved drastically and that backward compatibility is maintained There are a whole bunch of new additions like type checking ractor and scheduler along with improvements to the currently existing paradigms like performance fibers memory concurrency and static analysis We ll take a deeper look at each one of them one by one One of the most promising updates for Ruby CPU performance is the new Ruby JIT Another new feature that might significantly improve the programming language s performance is the Medium Internal Representation JIT compiler which is lightweight and convenient to work with We hope that this Ruby updates will actually be included in Ruby since it might solve Ruby s issues and result in significant performance improvement beyond the current JIT The point of a new big number is that it breaks stuff from previous releases One final note In order to manage projects developed on different versions of Ruby you ll want a version manager to allow you to easily switch back and forth I ve used rbenv I m also aware of chruby and RVM that are frequently recommended and accomplish the same thing 2021-12-18 14:17:09
海外TECH DEV Community Storage Best Practices for Data and Analytics Applications | AWS White Paper Summary https://dev.to/awsmenacommunity/storage-best-practices-for-data-and-analytics-applications-aws-white-paper-summary-3cmd Storage Best Practices for Data and Analytics Applications AWS White Paper Summary IntroductionAmazon Simple Storage Service Amazon S and Amazon Simple Storage Service Glacier Amazon S Glacier provide ideal storage solutions for data lakes Data lakes powered by Amazon S provide you with unmatched availability agility and flexibility required to combine different types of data and analytics approaches to gain deeper insights in ways that traditional data silos and data warehouses cannot In addition data lakes built on Amazon S integrate with other analytical services for ingestion inventory transformation and security of your data in the data lake This guide explains each of these options and provides best practices for building securing managing and scaling a data lake built on Amazon S Because organizations are collecting and analyzing increasing amounts of data traditional on premises solutions for data storage data management and analytics can no longer keep pace Data silos that aren t built to work well together make it difficult to consolidate the storage for more comprehensive and efficient analytics This in turn limits an organization s agility and ability to derive more insights and value from its data In addition this reduces the capability to seamlessly adopt more sophisticated analytics tools and processes because it necessitates upskilling of the workforce A data lake is an architectural approach that allows you to store all your data in a centralized repository so that it can be categorized catalogued secured and analyzed by a diverse set of users and tools In a data lake you can ingest and store structured semi structured and unstructured data and transform these raw data assets as needed Using a cloud based data lake you can easily decouple the compute from storage and scale each component independently which is a huge advantage over an on premises or Hadoop based data lake You can use a complete portfolio of data exploration analytics machine learning reporting and visualization tools on the data A data lake makes data and the optimal analytics tools available to more users across more lines of business enabling them to get all of the business insights they need whenever they need them More organizations are building data lakes for various use cases To guide customers in their journey Amazon Web Services AWS has developed a data lake architecture that allows you to build scalable secure data lake solutions cost effectively using Amazon S and other AWS services Using the data lake built on Amazon S architecture capabilities you can Ingest and store data from a wide variety of sources into a centralized platform Build a comprehensive data catalog to find and use data assets stored in the data lake Secure protect and manage all of the data stored in the data lake Use tools and policies to monitor analyze and optimize infrastructure and data Transform raw data assets in place into optimized usable formats Query data assets in place Integrate the unstructured data assets from Amazon S with structured data assets in a data warehouse solution to gather valuable business insights Store the data assets into separate buckets as the data goes through extraction transformation and load process Use a broad and deep portfolio of data analytics data science machine learning and visualization tools Quickly integrate current and future third party data processing tools Securely share processed datasets and results Scale virtually to unlimited capacity The remainder of this paper provides more information about each of these capabilities The following figure illustrates a sample AWS data lake platform High level AWS data lake technical reference architecture Central storage Amazon S as the data lake storage platformA data lake built on AWS uses Amazon S as its primary storage platform Amazon S provides an optimal foundation for a data lake because of its virtually unlimited scalability and high durability You can seamlessly and non disruptively increase storage from gigabytes to petabytes of content paying only for what you use Amazon S is designed to provide durability It has scalable performance ease of use features native encryption and access control capabilities Amazon S integrates with a broad portfolio of AWS and third party ISV tools for data ingestion data processing and data security Key data lake enabling features of Amazon S include the following Decoupling of storage from compute and data processing In traditional Hadoop and data warehouse solutions storage and compute are tightly coupled making it difficult to optimize costs and data processing workflows With Amazon S you can cost effectively store all data types in their native formats You can then launch as many or as few virtual servers as you need using Amazon Elastic Compute Cloud Amazon EC to run analytical tools and use services in AWS analytics portfolio such as Amazon Athena AWS Lambda Amazon EMR and Amazon QuickSight to process your data Centralized data architecture Amazon S makes it easy to build a multi tenant environment where multiple users can run different analytical tools against the same copy of the data This improves both cost and data governance over that of traditional solutions which require multiple copies of data to be distributed across multiple processing platforms S Cross Region Replication You can use Cross Region Replication to copy your objects across S buckets within the same account or even with a different account Cross Region Replication is particularly useful in meeting compliance requirements minimizing latency by storing the objects closer to the user location and improving operational efficiency Integration with clusterless and serverless AWS services You can use Amazon S with Athena Amazon Redshift Spectrum and AWS Glue to query and process data Amazon S also integrates with AWS Lambda serverless computing to run code without provisioning or managing servers You can process event notifications from S through AWS Lambda such as when an object is created or deleted from a bucket With all of these capabilities you only pay for the actual amounts of data you process or for the compute time that you consume For machine learning use cases you need to store the model training data and the model artifacts generated during model training Amazon SageMaker integrates seamlessly with Amazon S so you can store the model training data and model artifacts on a single or different S bucket Standardized APIs Amazon S RESTful application programming interfaces APIs are simple easy to use and supported by most major third party independent software vendors ISVs including leading Apache Hadoop and analytics tool vendors This allows customers to bring the tools they are most comfortable with and knowledgeable about to help them perform analytics on data in Amazon S Secure by default Amazon S is secure by default Amazon S supports user authentication to control access to data It provides access control mechanisms such as bucket policies and access control lists to provide fine grained access to data stored in S buckets to specific users and groups of users You can also manage the access to shared data within Amazon S using S Access Points More details about S Access Points are included in the Securing protecting and managing data section You can also securely access data stored in S through SSL endpoints using HTTPS protocol An additional layer of security can be implemented by encrypting the data in transit and data at rest using server side encryption SSE Amazon S for storage of raw and iterative data sets When working with a data lake the data undergoes various transformations With extract transform load ETL processes and analytical operations various versions of the same data sets are created or required for advanced processing You can create different layers for storing the data based on the stage of the pipeline such as raw transformed curated and logs Within these layers you can also create additional tiers based on the sensitivity of the data Storage classes for cost savings durability and availability Amazon S provides a range of storage classes for various use cases S Standard General purpose storage for frequently accessed data S Standard Infrequent Access S Standard IA and S One Zone Infrequent Access S One Zone IA Infrequently accessed long lived data S Glacier and S Glacier Deep Archive Long term archival of data Using S Lifecycle policy you can move the data across different storage classes for compliance and cost optimization Scalability and support for structured semi structured and unstructured data Amazon S is a petabyte scale object store which provides virtually unlimited scalability to store any type of data You can store structured data such as relational data semi structured data such as JSON XML and CSV files and unstructured data such as images or media files This feature makes Amazon S the appropriate storage solution for your cloud data lake Data ingestion methodsA core capability of a data lake architecture is the ability to quickly and easily ingest multiple types of data Real time streaming data and bulk data assets from on premises storage platforms Structured data generated and processed by legacy on premises platforms mainframes and data warehouses Unstructured and semi structured data images text files audio and video and graphs AWS provides services and capabilities to ingest different types of data into your data lake built on Amazon S depending on your use case This section provides an overview of various ingestion services Amazon Kinesis Data FirehoseAmazon Kinesis Data Firehose is part of the Kinesis family of services that makes it easy to collect process and analyze real time streaming data at any scale Kinesis Data Firehose is a fully managed service for delivering real time streaming data directly to data lakes Amazon S data stores and analytical services for further processing Kinesis Data Firehose automatically scales to match the volume and throughput of streaming data and requires no ongoing administration Kinesis Data Firehose can also be configured to transform streaming data before it s stored in a data lake built on Amazon S Its transformation capabilities include compression encryption data batching and Lambda functions Kinesis Data Firehose integrates with Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka to deliver the streaming data into destinations such as Amazon S Amazon Redshift Amazon OpenSearch Service and third party solutions such as Splunk Kinesis Data Firehose can convert your input JSON data to Apache Parquet and Apache ORC before storing the data into your data lake built on Amazon S Parquet and Orc being columnar data formats help save space and allow faster queries on the stored data compared to row based formats such as JSON Kinesis Data Firehose can compress data before it s stored in Amazon S It currently supports GZIP ZIP and SNAPPY compression formats GZIP is the preferred format because it can be used by Amazon Athena Amazon EMR and Amazon Redshift Kinesis Data Firehose also allows you to invoke Lambda functions to perform transformations on the input data Using Lambda blueprints you can transform the input comma separated values CSV structured text such as Apache Log and Syslog formats into JSON first You can optionally store the source data to another S bucket The following figure illustrates the data flow between Kinesis Data Firehose and different destinations Kinesis Data Firehose also provides the ability to group and partition the target files using custom prefixes such as dates for S objects This facilitates faster querying by the use of the partitioning and incremental processing further with the same feature Delivering real time streaming data with Kinesis Data Firehose to different destinations with optional backupKinesis Data Firehose also natively integrates with Amazon Kinesis Data Analytics which provides you with an efficient way to analyze and transform streaming data using Apache Flink and SQL applications Apache Flink is an open source framework and engine for processing streaming data using Java and Scala Using Kinesis Data Analytics you can develop applications to perform time series analytics feed real time dashboards and create real time metrics You can also use Kinesis Data Analytics for transforming the incoming stream and create a new data stream that can be written back into Kinesis Data Firehose before it is delivered to a destination Finally Kinesis Data Firehose encryption supports Amazon S server side encryption with AWS Key Management Service AWS KMS for encrypting delivered data in your data lake built on Amazon S You can choose not to encrypt the data or to encrypt the data with a key from the list of AWS KMS keys that you own refer to the Data encryption with Amazon S and AWS KMS section of this document Kinesis Data Firehose can concatenate multiple incoming records and then deliver them to Amazon S as a single S object This is an important capability because it reduces the load of Amazon S transaction costs and transactions per second You can grant your application access to send data to Kinesis Data Firehose using AWS Identity and Access Management IAM Using IAM you can also grant Kinesis Data Firehose access to S buckets Amazon Redshift cluster or Amazon OpenSearch Service cluster You can also use Kinesis Data Firehose with virtual private cloud VPC endpoints AWS PrivateLink AWS PrivateLink is an AWS technology that enables private communication between AWS services using an elastic network interface with private IPs in your Amazon VPC AWS Snow FamilyAWS Snow Family comprised of AWS Snowcone AWS Snowball and AWS Snowmobile offers hardware devices of varying capacities for movement of data from on premises locations to AWS The devices also offer cloud computing capabilities at the edge for the applications that need to perform computations closer to the source of the data Using Snowcone you can transfer data generated continuously from sensors IoT devices and machines to the AWS Cloud Snowcone features TB of storage Snowball and Snowmobile are used to transfer massive amounts of data up to PB Snowball moves terabytes of data into your data lake built on Amazon S You can use it to transfer databases backups archives healthcare records analytics datasets historic logs IoT sensor data and media content especially in situations where network conditions hinder transfer of large amounts of data both into and out of AWS AWS Snow Family uses physical storage devices to transfer large amounts of data between your on premises data centers and your data lake built on Amazon S You can use AWS Storage Optimized Snowball to securely and efficiently migrate bulk data from on premises storage platforms and Hadoop clusters Snowball supports encryption and uses AES bit encryption Encryption keys are never shipped with the Snowball device so the data transfer process is highly secure Data is transferred from the Snowball device to your data lake built on Amazon S and stored as S objects in their original or native format Snowball also has a Hadoop Distributed File System HDFS client so data may be migrated directly from Hadoop clusters into an S bucket in its native format Snowball devices can be particularly useful for migrating terabytes of data from data centers and locations with intermittent internet access AWS GlueAWS Glue is a fully managed serverless ETL service that makes it easier to categorize clean transform and reliably transfer data between different data stores in a simple and cost effective way The core components of AWS Glue consists of a central metadata repository known as AWS Glue Data Catalog which is a drop in replacement for an Apache Hive metastore refer to the Catalog and search section of this document for more information and an ETL job system that automatically generates Python and Scala code and manages ETL jobs The following figure depicts the high level architecture of an AWS Glue environment Architecture of an AWS Glue environmentTo ETL the data from source to target you create a job in AWS Glue which involves the following steps Before you can run an ETL job define a crawler and point it to the data source to identify the table definition and the metadata required to run the ETL job The metadata and the table definitions are stored in the Data Catalog The data source can be an AWS service such as Amazon RDS Amazon S Amazon DynamoDB or Kinesis Data Streams as well as a third party JDBC accessible database Similarly a data target can be an AWS service such as Amazon S Amazon RDS and Amazon DocumentDB with MongoDB compatibility as well as a third party JDBC accessible database Either provide a script to perform the ETL job or AWS Glue can generate the script automatically Run the job on demand or use the scheduler component that helps in initiating the job in response to an event and schedule at a defined time When the job is run the script extracts the data from the source transforms the data and finally loads the data into the data target AWS DataSyncAWS DataSync is an online data transfer service that helps in moving data between on premises storage systems and AWS storage services as well as between different AWS storage services You can automate the data movement between on premises Network File Systems NFS Server Message Block SMB or a self managed object store to your data lake built on Amazon S DataSync allows data encryption and data integrity validation to ensure safe and secure transfer of data DataSync also has support for an HDFS connector to read directly from on premises Hadoop clusters and replicate your data to your data lake built on Amazon S AWS Transfer FamilyAWS Transfer Family is a fully managed and secure transfer service that helps you to move files into and out of AWS storage services for example your data lake built on Amazon S storage and Amazon Elastic File System Amazon EFS Network File System NFS AWS Transfer Family supports Secure Shell SSH File Transfer Protocol FTP FTP Secure FTPS and FTP You can use AWS Transfer Family to ingest data into your data lakes built on Amazon S from third parties such as vendors and partners to perform an internal transfer within the organization and distribute subscription based data to customers Storage GatewayStorage Gateway can be used to integrate legacy on premises data processing platforms with a data lake built on Amazon S The File Gateway configuration of Storage Gateway offers on premises devices and applications a network file share through an NFS connection Files written to this mount point are converted to objects stored in Amazon S in their original format without any proprietary modification This means that you can integrate applications and platforms that don t have native Amazon S capabilitiesーsuch as on premises lab equipment mainframe computers databases and data warehousesーwith S buckets and then use analytical tools such as Amazon EMR or Amazon Athena to process this data Apache Hadoop distributed copy commandAmazon S natively supports distributed copy DistCp which is a standard Apache Hadoop data transfer mechanism This allows you to run DistCp jobs to transfer data from an on premises Hadoop cluster to an S bucket The command to transfer data is similar to the following hadoop distcp hdfs source folder sa destination bucket AWS Direct ConnectAWS Direct Connect establishes a dedicated network connection between your on premises internal network and AWS AWS Direct Connect links the internal network to an AWS Direct Connect location over a standard Ethernet fiber optic cable Using the dedicated connection you can create virtual interface directly with Amazon S which can be used to securely transfer data from on premises into a data lake built on Amazon S for analysis AWS Database Migration ServiceAWS Database Migration Service AWS DMS facilitates the movement of data from various data stores such as relational databases NoSQL databases data warehouses and other data stores into AWS AWS DMS allows one time migration and ongoing replication change data capture to keep the source and target data stores in sync Using AWS DMS you can use Amazon S as a target for the supported database sources AWS DMS task for Amazon S writes both full load migration and change data capture CDC in a comma separated value CSV format by default You can also write the data into Apache Parquet format parquet for more compact storage and faster query options Both CSV and parquet formats are favorable for in place querying using services such as Amazon Athena and Amazon Redshift Spectrum refer to the In place querying section of this document for more information As mentioned earlier Parquet format is recommended for analytical querying It is useful to use AWS DMS to migrate databases from on premises to or across different AWS accounts to your data lake built on Amazon S during initial data transfer or on a regular basis Data lake foundationAmazon S provides the foundation for building a data lake along with integration to other services that can be tailored to your business needs A common challenge faced by users when building a data lake is the categorization of data and maintaining data across different stages as it goes through the transformation process The following figure depicts a sample data lake and the transformation journey data goes through in its lifecycle Sample data lake transformation journeyThis section provides a recommended bucket strategy for building a data lake foundation A data lake can be broadly categorized across four distinct buckets Raw data Data ingested from the data sources in the raw data format which is the immutable copy of the data This can include structured semi structured and unstructured data objects such as databases backups archives JSON CSV XML text files or images Transformed This bucket consists of transformed data normalized to a specific use case for performance improvement and cost reduction In this stage data can be transformed into columnar data formats such as Apache Parquet and Apache ORC which can be used by Amazon Athena Curated The transformed data can be further enriched by blending it with other data sets to provide additional insights This layer typically contains S objects which are optimized for analytics reporting using Amazon Athena Amazon Redshift Spectrum and loading into massively parallel processing data warehouses such as Amazon Redshift Logs This bucket stores process logs for Amazon S and other services in the data lake architecture The logs can include S access logs CloudWatch logs or CloudTrail logs The following table shows a recommended sample folder structure for your data lake per environment Each environment can have the same folder structure with tags to segregate across each environment Recommended sample folder structure for your data lake per environmentIt is recommended you follow best practices when defining your bucket strategy for your data lake built on Amazon S Buckets names must be unique within a partition Currently there are three partitions aws Standard Regions aws cn China Regions aws us gov AWS GoveCloud US Regions however names can be reused after the buckets are deleted with several exceptions Deleted bucket names are not available to reuse immediately hence if users want to use a particular bucket name they should not delete the bucket All bucket names across all AWS Regions should comply with DNS naming conventions Buckets can store an unlimited number of objects in a bucket without impacting performance The objects can be stored in a single bucket or across multiple buckets however you cannot create a bucket from within another bucket Production S buckets should be hosted under a different AWS account separate from non production workloads Build an automatic ingestion mechanism to catalog and create the multiple layers of data storage including Raw Transformed and Curated Consider building automatic data classification rules based on schema and data Consider additional folders within the data lakes such as reports downstream applications or user folders Enable versioning if needed for protection from accidental deletes Use separate buckets for S data which needs to be replicated Catalog and searchThe earliest challenges that inhibited building a data lake were keeping track of all of the raw assets as they were loaded into the data lake and then tracking all of the new data assets and versions that were created by data transformation data processing and analytics Thus an essential component of a data lake built on Amazon S is the Data Catalog The Data Catalog provides an interface to query all assets stored in data lake S buckets The Data Catalog is designed to provide a single source of truth about the contents of the data lake AWS GlueAWS Glue is a fully managed ETL service that makes it easier to categorize clean transform and reliably transfer data between different data stores The AWS Glue Data Catalog acomponent of AWS Glue provides a unified metadata repository for performing analytical operations across various data sources such as Amazon EMR Amazon Athena Amazon Redshift and Amazon Redshift Spectrum and any application that is compatible with a Hive metastore To create a Data Catalog use AWS Glue crawlers that crawl the data from the data sources as registered with AWS Glue in the AWS Management Console The data source can be an S bucket a table in Amazon RDS Amazon DynamoDB or Amazon Redshift or any external database that supports JDBC connectivity You can define custom classifiers or use the built in classifiers provided by AWS Glue to classify the data AWS Glue provides classifiers for common file types such as CSV JSON AVRO or XML AWS Glue also provides classifiers for common relational database management systems using a JDBC connection You can create a custom classifier using a grok pattern an XML tag JavaScript Object Notation JSON or comma separated values CSV The following figure depicts the working of AWS Glue in building a Data Catalog AWS Glue Data CatalogData Catalog is a database that stores metadata in tables consisting of data schema data location and runtime metrics Data Catalog is also Apache Hive metastore compatible that can be used as a central repository for storing structural and operational metadata AWS Glue also provides out of box integration with Amazon EMR that allows you to use Data Catalog as an external Hive metastore Data Catalog is recommended especially when you need a persistent metastore or a metastore shared between different applications services clusters or AWS accounts Data Catalog can also be used to create an external table for Athena or Amazon Redshift AWS Lake FormationAWS Lake Formation helps to easily build secure and manage data lakes Lake Formation provides centralized governance and access control for the data in a data lake built on S and controls access to the data through various services such as AWS Glue Athena Amazon Redshift Spectrum Amazon QuickSight and Amazon EMR AWS Lake Formation can connect to an S bucket and orchestrate a dataflow that can ingest clean transform and organize the raw data Lake Formation uses AWS Glue Data Catalog to automatically classify data in data lakes data sources transforms and targets Apart from the metadata the Data Catalog also stores information consisting of resource links to shared databases and tables in external accounts for cross account access to the data in a data lake built on S Lake Formation provides you with a grant revoke permission model to control access to Data Catalog resources consisting of database and metadata tables S buckets and underlying data in these buckets Lake Formation permissions along with IAM policies provide granular access to the data stored in data lakes built on S These permissions can be used to share Data Catalog resources with external AWS accounts The users from these accounts can run jobs and queries by combining data from multiple data catalogs across multiple accounts Comprehensive data catalogYou have the flexibility to create a comprehensive data catalog using standard AWS services such as AWS Lambda DynamoDB and Amazon OpenSearch Service At a high level AWS Lambda triggers are used to populate DynamoDB tables with object names and metadata when those objects are put into S Amazon OpenSearch Service is used to search for specific assets related metadata and data classifications The following figure shows a high level architectural overview of this solution Comprehensive data catalog using AWS Lambda DynamoDB and Amazon OpenSearch Service Transforming data assetsA core value of a data lake is that it is the collection point and repository for all of an organization s data assets regardless of their native formats This allows quick ingestion elimination of data duplication and data sprawl and centralized governance and management After the data assets are collected they need to be transformed into normalized formats to be used by a variety of data analytics and processing tools The key to democratizing the data and making the data lake available to the widest number of users of varying skill sets and responsibilities is to transform data assets into a format that allows for efficient one time SQL querying Data transformation or ETL process prepares the data to be consumed by downstream systems for advanced analytics visualizations and business reporting or machine learning The first step involves extracting information from the different data sources which was discussed in the Data ingestion methods section The data is stored in a raw bucket refer to the Data lake foundation section of this document After you have the data you need to transform it Transformations might involve aggregating the data from different sources or changing the file format of the data received such as from CSV to Parquet format to reduce the file size and optimize the Athena query You can compress the files or convert the data format from string to double or number to date format After the transformation is complete the data can be written into transformed bucket refer to the Data lake foundation section of this document which is then further used by the downstream systems There are multiple ways to transform data assets and the “best way often comes down to the nature of the analytics application individual preference skill sets and the tools available When a data lake is built on AWS there is a wide variety of tools and services available for data transformation so you can pick the methods and tools that best suits your purpose Because the data lake is inherently multi tenant multiple data transformation jobs using different tools can run concurrently There are two common and straightforward methods to transform data assets into Parquet in a data lake built on S using Amazon EMR clusters The first method involves creating an Amazon EMR cluster with Hive installed using the raw data assets in Amazon S as input transforming those data assets into Hive tables and then writing those Hive tables back out to Amazon S in Parquet format The second method is to use Spark on Amazon EMR With this method a typical transformation can be achieved with only lines of PySpark code A third data transformation method on a data lake built on S is to use AWS Glue AWS Glue simplifies and automates difficult and time consuming data discovery conversion mapping and job scheduling tasks AWS Glue guides you through the process of transforming and moving your data assets with an easy to use console that helps you understand your data sources transform and prepare these data assets for analytics and load them reliably from S data sources back into S destinations AWS Glue automatically crawls raw data assets in your data lake s S buckets identifies data formats and then suggests schemas and transformations so that you don t have to spend time hand coding data flows You can then edit these transformations if necessary using the tools and technologies you already know such as Python Spark Git and your favorite integrated developer environment and then share them with other AWS Glue users of the data lake AWS Glue s flexible job scheduler can be set up to run data transformation flows on a recurring basis in response to triggers or even in response to AWS Lambda events AWS Glue automatically and transparently provisions hardware resources and distributes ETL jobs on Apache Spark nodes so that ETL run times remain consistent as data volume grows AWS Glue coordinates the implementation of data lake jobs in the right sequence and automatically retries failed jobs With AWS Glue there are no servers or clusters to manage and you pay only for the resources consumed by your ETL jobs AWS Lambda functions can also be used for transforming the data stored in your data lake built on Amazon S Lambda functions can respond to event notifications from Amazon S when an object is created or deleted You can configure a Lambda function to perform an action asynchronously based on the event When data is stored in a data lake built on S you can share the data with multiple applications These applications can vary in nature and purpose for example ecommerce applications analytics applications and marketing applications and necessitate a different view for each application S Object Lambda can be used to add your own code to process the data retrieved before returning it to an application for example masking the personally identifiable information PII data or masking credit card information before transferring it to the application S Object Lambda uses a Lambda function to automatically process and transform the data as it is being retrieved from an S bucket The following figure shows an example of an S Object Lambda Using S Object Lambda to transform data before retrieval by applicationsS Object Lambda can be very useful in redacting PII data for analytics applications format conversion enriching data from other data sources resizing objects or even implementing custom authorization rules to access the data stored in your data lake built on Amazon S Another transformation feature that S supports natively is batch operations Using S Batch Operations you can define an operation to be performed on the exabytes of objects stored in your bucket S Batch Operations manages the entire lifecycle of the batch operation by tracking the progress sending notifications and storing a detailed completion report for all the operations performed on the objects By using S Batch Operations you can copy objects from a bucket into another bucket in the same or different Region or invoke a Lambda function to transform the object S Batch Operation can also be used to manage the tags defined for the objects and restoring a large number of objects from S Glacier S Batch Operation along with S Object Lock can be used to manage retention dates and legal holds to apply compliance and governance rules for multiple objects at the same time S Object Lock works on the write once read many WORM model that can help in prevention of accidental deletion of objects It can also help in adhering to regulatory controls In place queryingOne of the most important capabilities of a data lake on AWS is the ability to perform in place transformation and querying of data assets This allows users to run sophisticated analytical queries directly on their data stored in S without having to copy and load data into separate analytics platforms or data warehouses There are various tools to perform in place querying for data stored in a data lake such as Presto on Amazon EMR and various partner tools This section provides an overview of serverless services that not only helps perform in place querying but also avoids the procurement and management of servers Users can query S data without any additional infrastructure and only pay for the queries that they run This makes the ability to analyze vast amounts of unstructured data accessible to any data lake user who can use SQL and makes it far more cost effective than the traditional method of performing an ETL process creating a Hadoop cluster or data warehouse loading the transformed data into these environments and then running query jobs AWS Glue as described in the previous sections provides the data discovery and ETL capabilities and Amazon Athena and Amazon Redshift Spectrum provides the serverless in place querying capabilities In addition to in place querying using Athena and Redshift Spectrum S also provides capabilities to retrieve subset of your data through S Select and S Glacier Select that improves the performance of accessing large amounts of data from your data lake built on S Using S Select users can run SQL statements to filter and retrieve only a subset of data stored in their data lake S Select operates on objects stored in CSV JSON or Apache Parquet format and other compression formats such as GZIP or BZIP Users can also delimit the result set thus reducing latency to retrieve the data and optimizing cost Amazon AthenaAmazon Athena is an interactive query service that makes it easier for you to analyze data directly in S using standard SQL With a few actions in the AWS Management Console you can use Athena directly against data assets stored in the data lake built on S and begin using standard SQL to run one time queries and get results in a matter of seconds Athena is serverless so there is no infrastructure to set up or manage and you only pay for the volume of data assets scanned during the queries you run Athena scales automaticallyーrunning queries in parallelーso results are fast even with large datasets and complex queries You can use Athena to process unstructured semi structured and structured data sets Supported data asset formats include CSV JSON or columnar data formats such as Apache Parquet and Apache ORC You can also use Athena to run one time queries using ANSI SQL without first aggregating or loading the data into Athena Athena integrates with Amazon QuickSight for easy data visualization It can also be used with third party reporting and business intelligence tools by connecting these tools to Athena with a JDBC driver Athena also natively integrates with AWS Glue Data Catalog which provides a persistent metadata store for the data stored in S You can create the table and use Athena to query the data based on a metadata store that integrates with the ETL and data discovery features of AWS Glue When querying an existing table Athena uses Presto under the hood a distributed SQL engine Athena can also be used to query S inventory using standard SQL Amazon S Inventory is an S tool to help manage storage You can use the tool to audit and report on the replication and encryption status of the objects for business compliance and regulatory needs Athena supports querying S inventory files in ORC Parquet or CSV format AWS recommends using ORC formatted or Parquet formatted inventory files because these formats provide faster query performance and lower query costs ORC and Parquet formats are columnar formats that allows the reader to read decompress and process the columns that are only required for the current query Amazon Redshift SpectrumAnother way to perform in place querying of data assets in a data lake built on Amazon S is to use Amazon Redshift Spectrum Amazon Redshift is a large scale managed data warehouse service that supports massive parallel processing By contrast Amazon Redshift allows you to run Redshift SQL queries directly against massive amounts of dataーup to exabytesーstored in a data lake built on S Amazon Redshift Spectrum applies sophisticated query optimization scaling processing across thousands of nodes so results are fastーeven with large data sets and complex queries Amazon Redshift Spectrum can directly query a wide variety of data assets stored in the data lake including CSV TSV Parquet Sequence and RCFile Because Amazon Redshift Spectrum supports the SQL syntax of Amazon Redshift you can run sophisticated queries using the same business intelligence tools that you use today You also have the flexibility to run queries that span frequently accessed data assets that are stored locally in Amazon Redshift and your full data sets stored in S Because Amazon Athena and Amazon Redshift share a common data catalog and common data formats you can use both Athena and Amazon Redshift Spectrum against the same data assets You would typically use Athena for one time data discovery and SQL querying and then use Amazon Redshift Spectrum for more complex queries and scenarios where a large number of data lake users want to run concurrent business intelligence and reporting workloads The broader analytics portfolioWith a data lake built on AWS data assets get ingested and stored in one massively scalable low cost performant platform Data discovery transformation and SQL querying can all be done in place using innovative AWS services such as AWS Glue Amazon Athena and Amazon Redshift Spectrum In addition there are a wide variety of other AWS services that can be directly integrated with S to create any number of sophisticated analytics machine learning and artificial intelligence data processing pipelines This allows you to quickly solve a wide range of analytics business challenges on a single platform against common data assets without having to worry about provisioning hardware and installing and configuring complex software packages before loading data and performing analytics Plus you only pay for what you consume Some of the most common AWS services that can be used with data assets in a data lake built on S are describedin this section Amazon EMRAmazon EMR is a highly distributed computing framework used to quickly and easily process data in a cost effective manner Amazon EMR uses Apache Hadoop an open source framework to distribute data and processing across an elastically resizable cluster of EC instances and allows you to use all the common Hadoop tools such as Hive Pig Spark Flink Hudi Hue Livy MXNet Presto TensorFlow HBase or Zeppelin Amazon EMR does all the heavily lifting involved with provisioning managing and maintaining the infrastructure and software of a Hadoop cluster and is integrated directly with S With Amazon EMR you can launch a persistent cluster that stays up indefinitely or a temporary cluster that ends after the analysis is complete In either scenario you only pay for the hours the cluster is up Amazon EMR supports a variety of EC instance types encompassing general purpose compute memory storage I O optimized and GPU instances for example Mg C R Z I D G and P and all Amazon EC pricing options On Demand Reserved and Spot The latest version of EMR at the time of writing this document also supports ARM based instance types AWS Graviton processors are custom built by Amazon Web Services and use bit Arm Neoverse cores to deliver the best price performance When you launch an Amazon EMR cluster called a job flow you choose how many and what type of EC instances to provision Organizations with many different lines of business and a large number of users can build a single data lake solution store their data assets in S and then spin up multiple EMR clusters to share data assets in a multi tenant fashion Amazon EMR integrates with S for input output and intermediate data storage You can choose the HDFS which runs on a cluster consisting of master and core nodes for processing data that does not need to persist after the lifecycle of the cluster You can use Amazon EMR file system EMRFS using S as a data layer for applications running on your cluster The data can be stored on EMRFS to persist even after the cluster lifecycle This mechanism helps separate compute and storage thus enabling you to independently scale the compute and storage depending on the workload You can scale your compute needs by resizing your cluster and you can scale your storage needs by using S Amazon SageMakerMachine learning is another important data lake use case Amazon SageMaker is a fully managed machine learning service that provides a set of features to build train and deploy machine learning models SageMaker provides a suite of built in algorithms and mechanisms to build custom algorithms based on user requirements SageMaker also provides an integrated Jupyter notebook instance to prepare and transform the data develop code for model training and subsequently deploy test and validate the models SageMaker provides visualization tools and wizards to guide you through the process of creating machine learning models without having to learn complex algorithms and technology SageMaker can create machine learning models based on datasets stored in your data lake built on S SageMaker greatly enhances machine learning capabilities by combining with your data lake built on S For training machine learning models a large amount of data is required and a data lake backed by S and Amazon S Glacier provides cost effective storage Amazon QuickSightAmazon QuickSight is a fast business analytics service that makes it easy for you to build visualizations perform targeted analysis and quickly get business insights from your data assets stored in the data lake any time on any device You can use Amazon QuickSight to seamlessly discover AWS data sources such as Amazon Redshift Amazon RDS Amazon Aurora Amazon Athena and Amazon S connect to any of these data sources and data assets and get insights from this data in minutes Amazon QuickSight allows organizations using the data lake to seamlessly scale their business analytics capabilities to hundreds of thousands of users It delivers fast and responsive query performance by using a Super fast Parallel In memory Calculation Engine SPICE Amazon OpenSearch ServiceAmazon OpenSearch Service is a managed service where you can deploy and manage OpenSearch clusters at scale Common use cases include log analytics real time application monitoring and clickstream analytics Amazon OpenSearch Service can be used to add a search box to a website analyze metrics security event data application and infrastructure logs or store data to automate business workflows Amazon OpenSearch Service manages the infrastructure provisioning and maintenance thus reducing the overhead of managing the Amazon OpenSearch Service cluster yourself Amazon OpenSearch Service integrates with various open source and native AWS services to provide a seamless experience Amazon OpenSearch Service integrates with open source Logstash to conveniently upload bulk data into Amazon OpenSearch Service domain and with open source Kibana for visualization Amazon OpenSearch Service also runs with Kinesis Data Firehose for ingestion of streaming data AWS CloudTrail and Amazon CloudWatch for logging and monitoring respectively You can also use your data lake built on Amazon S to store the streaming data before it is ingested into the Amazon OpenSearch Service domain Ultrawarm storage and Cold storage types of Amazon OpenSearch Service storage use Amazon S for storing infrequently accessed read only data in a cost efficient manner Securing protecting and managing data Monitoring and optimizing the data lake environmentBeyond the efforts required to architect and build a data lake your organization must also consider the operational aspects of a data lake and how to cost effectively and efficiently operate a production data lake at large scale Key elements you must consider are monitoring the operations of the data lake making sure that it meets performance expectations and service level agreements analyzing utilization patterns and using this information to optimize the cost and performance of your data lake AWS provides multiple features and services to help optimize a data lake that is built on AWS including S storage analytics Amazon CloudWatch metrics AWS CloudTrail and Amazon S Glacier Data lake monitoringA key aspect of operating a data lake environment is understanding how all of the components that comprise the data lake are operating and performing and generating notifications when issues occur or operational performance falls below predefined thresholds Amazon CloudWatchAs an administrator you need to look at the complete data lake environment holistically This can be achieved using Amazon CloudWatch CloudWatch is a monitoring service for AWS Cloud resources and the applications that run on AWS You can use CloudWatch to collect and track metrics collect and monitor log files set thresholds and initiate alarms This allows you to automatically react to changes in your data lake built on S You can also use CloudWatch metrics to understand and improve the performance of applications that are using Amazon S You can use CloudWatch for generating daily storage metrics for your data lake built on CloudWatch by collecting and processing storage data for your S buckets You can also monitor the requests to your data lake built on S and identify and act upon operational issues quickly In addition you can also monitor the S APIs that are pending replication total size and the maximum time required for replication to the destination Region Amazon MacieAmazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover monitor and protect your sensitive data stored in your data lake Macie can be used to scan your data lakes and discover sensitive information such as PII or financial data and identify and report overly permissive or unencrypted buckets Using Macie you can run a sensitive data discovery job to identify sensitive information using built in criteria and techniques such as machine learning and pattern matching called as managed data identifiers to analyze objects in your data lake You can also define your own criteria called as custom data identifiers using regular expressions defining a text pattern to match and optionally character sequences and a proximity rule that refines the results Additionally after enabled Macie maintains a complete inventory of your S buckets and evaluates and monitors the bucket for security and access control If Macie detects a potential issue the service creates a policy finding Policy findings are generated when policies or settings for S bucket are changed in a way that reduces the security of the bucket and its objects Macie integrates with other AWS services such as Amazon EventBridge which is a serverless event bus service that can send findings data to services such as AWS Lambda and Amazon Simple Notification Service Amazon SNS to act on the policy findings AWS CloudTrailAn operational data lake has many users and multiple administrators and may be subject to compliance and audit requirements so it s important to have a complete audit trail of actions taken and who has performed these actions AWS CloudTrail is an AWS service that enables governance compliance operational auditing and risk auditing of AWS accounts CloudTrail continuously monitors and retains events related to API calls across AWS services that comprise a data lake CloudTrail provides a history of AWS API calls for an account including API calls made through the AWS Management Console AWS SDKs command line tools and most data lakes built on S services You can identify which users and accounts made requests or took actions against AWS services that support CloudTrail the source IP address the actions were made from and when the actions occurred CloudTrail can be used to simplify data lake compliance audits by automatically recording and storing activity logs for actions made within AWS accounts Integration with Amazon CloudWatch logs provides a convenient way to search through log data identify out of compliance events accelerate incident investigations and expedite responses to auditor requests CloudTrail logs are stored in a separate logs bucket within your data lake refer to the Data lake foundation section of this document for durability and deeper analysis Data lake optimizationOptimizing a data lake environment includes minimizing operational costs By building a data lake on S you only pay for the data storage and data processing services that you actually use as you use them You can reduce costs by optimizing how you use these services Data asset storage is often a significant portion of the costs associated with a data lake Fortunately AWS has several features that can be used to optimize and reduce costs These include S Lifecycle management S storage class analysis S Intelligent Tiering S Storage Lens and Amazon S Glacier storage class Amazon S Lifecycle managementAmazon S Lifecycle management allows you to create lifecycle rules which can be used to automatically migrate data assets to a lower cost tier of storageーsuch as S Standard IA storage class or Amazon S Glacier storage classーor let them expire when they are no longer needed A lifecycle configuration which consists of an XML file comprises a set of rules with predefined actions that you want Amazon S to perform on data assets during their lifetime Lifecycle configurations can perform actions based on data asset age and data asset names but can also be combined with S object tagging to perform very granular management of data assets If the access pattern for your S buckets is constantly changing or is evolving you can use S Intelligent Tiering storage class for automatic cost savings Amazon S Storage class analysisOne of the challenges of developing and configuring lifecycle rules for the data lake is gaining an understanding of how data assets are accessed over time It only makes economic sense to transition data assets to a more cost effective storage or archive tier if those objects are infrequently accessed Otherwise data access charges associated with these more cost effective storage classes can negate any potential savings Amazon S provides storage class analysis to help you understand how data lake data assets are used Storage class analysis uses machine learning algorithms on collected access data to help you develop lifecycle rules that will optimize costs Seamlessly tiering to lower cost storage tiers is an important capability of a data lake particularly as its users plan for and move to more advanced analytics and machine learning capabilities Data lake users will typically ingest raw data assets from many sources and transform those assets into harmonized formats that they can use for one time querying and on going business intelligence querying through SQL However users also want to perform more advanced analytics using streaming analytics machine learning and artificial intelligence These more advanced analytics capabilities consist of building data models validating these data models with data assets and then training and refining these models with historical data Keeping more historical data assets particularly raw data assets allows for better training and refinement of models Additionally as your organization s analytics sophistication grows you may want to go back and reprocess historical data to look for new insights and value These historical data assets are infrequently accessed and consume a lot of capacity so they are often well suited to be stored on an archival storagelayer Another long term data storage need for the data lake is to keep processed data assets and results for long term retention for compliance and audit purposes to be accessed by auditors when needed Both of these use cases are well served by Amazon S Glacier storage class which is an S storage class optimized for infrequently used cold data and for storing write once read many WORM data You can also use S Object Lock to adhere to regulatory compliance S Intelligent TieringS Intelligent Tiering is designed to optimize your storage cost by automatically moving the data within your buckets by monitoring your access patterns S Intelligent Tiers makes use of two low latency high throughout access tiers one tier for frequently accessed data and another tier for infrequently accessed data S Intelligent Tiering monitors access patterns for the data for a period of months and automatically moves the data to one of the access tiers without operational overhead or performance impact S Intelligent Tiering is recommended for data that has unpredictable access patterns regardless of object type size and retention period such as data lakes data analytics applications or new applications If the data from infrequently accessed tier is accessed the data will be moved to frequently accessed tier Additionally you can also configure S Intelligent Tiering to automatically move data from infrequently accessed tier that has not been accessed for consecutive days to archive access tier and if not accessed for consecutive days to deep archive access tier You can also configure the last access time for archiving up to a maximum of days for both Archive access tier and Deep Archive access tier S Intelligent Tiering Archive access tier provides same performance as Amazon S Glacier storage class and S Intelligent Tiering Deep Archive access tier provides same performance as Amazon S Glacier Deep Archive storage class S Storage LensAs your data lake becomes more popular and you start expanding to accommodate data from multiple applications across multiple accounts and S buckets It can become increasingly complicated to understand usage of the data across the organization optimize cost and understand the security posture S Storage Lens gives you the visibility into your object storage across your organization with point in time metrics trend lines and actionable insights You can generate insights at organization account region bucket and prefix level S aggregates your usage and metrics across all the accounts and provides an account snapshot on the S console Bucket home page You can use the Storage Lens dashboard to visualize insights and trends identify outliers receive recommendations for storage cost optimization and so on You can use Storage Lens dashboard to identify your largest buckets and take necessary actions to optimize the cost since the rate charged depends on the object size duration of storage and storage class In case you are uploading objects using multi part upload there might be a case when the uploads fail or not completed The incomplete uploads remain in your buckets and are chargeable You can identify these incomplete multipart uploads using Storage Lens dashboard Additionally storage lens can also help identify multiple versions of the objects Finally you can also use Storage Lens to uncover cold buckets from your account Cold buckets are the buckets which are not accessed for a long period of time All these insights can be accessed from the Storage Lens dashboard Amazon S GlacierAmazon S Glacier is a low cost Amazon S storage class that provides durable storage with security features for data archiving and backup S Glacier has the same data durability and supports lifecycle management on data assets stored in S so that data assets can seamlessly migrate from Amazon S to S Glacier S Glacier storage class is a great storage choice when low storage cost is essential data assets are rarely retrieved and retrieval latency of several minutes to several hours is acceptable Different types of data lake assets may have different retrieval needs For example compliance data may be infrequently accessed and be relatively small in size but needs to be made available in minutes when auditors request data whereas historical raw data assets may be very large but can be retrieved in bulk over the course of a day when needed S Glacier allows data lake users to specify retrieval times when the data retrieval request is created with longer retrieval times leading to lower retrieval costs For processed data and records that need to be securely retained Amazon S Glacier Vault Lock allows data lake administrators to deploy and enforce compliance controls on individual S Glacier vaults by a lockable policy Administrators can specify controls such as WORM in a Vault Lock policy and lock the policy from future edits After locked the policy becomes immutable and S Glacier will enforce the prescribed controls to help achieve your compliance objectives and provide an audit trail for these assets using AWS CloudTrail Cost and performance optimizationYou can optimize your data lake using cost and performance Amazon S provides a very performant foundation for a data lake because its enormous scale provides virtually limitless throughput and extremely high transaction rates Using S best practices for data asset naming ensures high levels of performance These best practices can be found in the Amazon Simple Storage Service Developer Guide Another area of optimization is to use optimal data formats when transforming raw data assets into normalized formats in preparation for querying and analytics These optimal data formats can compress data and reduce data capacities needed for storage and also substantially increase query performance by common data lake built on S analytic services Data lake environments are designed to ingest and process many types of data and store raw data assets for future archival and reprocessing purposes as well as store processed and normalized data assets for active querying analytics and reporting A key best practice to reduce storage and analytics processing costs and improve analytics querying performance is to use an optimized data format particularly a format like Apache Parquet Parquet is a columnar compressed storage file format that is designed for querying large amounts of data regardless of the data processing framework data model or programming language Compared to common raw data log formats such as CSV JSON or TXT format Parquet can reduce the required storage footprint improve query performance significantly and greatly reduce querying costs for AWS services which charge by amount of data scanned Amazon tests comparing CSV and Parquet formats using one TB of log data stored in CSV format to Parquet format showed the following Space savings of with Parquet TB of log data stored in CSV format compressed to GB with Parquet A query time for a representative Athena query was x faster with Parquet seconds for CSV versus seconds for Parquet and the amount of data scanned for that Athena query was less TB scanned for CSV versus GB for Parquet The cost to run that Athena query was less for CSV versus for Parquet Parquet has the additional benefit of being an open data format that can be used by multiple querying and analytics tools in a data lake built on Amazon S particularly Amazon Athena Amazon EMR Amazon Redshift and Amazon Redshift Spectrum Additional options for performance optimization include rightsizing of the S objects to MB partitioning based on business dates which are typically used while querying Future proofing the data lakeA data lake built on AWS can immediately solve a broad range of business analytics challenges and quickly provide value to your business However business needs are constantly evolving AWS and the analytics partner ecosystem are rapidly evolving and adding new services and capabilities as businesses and their data lake users achieve more experience and analytics sophistication over time Therefore it s important that the data lake can seamlessly and non disruptively evolve as needed AWS future proofs your data lake with a standardized storage solution that grows with your organization by ingesting and storing all of your business s data assets on a platform with virtually unlimited scalability and well defined APIs and integrates with a wide variety of data processing tools This allows you to add new capabilities to your data lake as you need them without infrastructure limitations or barriers Additionally you can perform agile analytics experiments against data lake assets to quickly explore new processing methods and tools and then scale the promising ones into production without the need to build new infrastructure duplicate and or migrate data and have users migrate to a new platform ConclusionAmazon S is a scalable highly durable and reliable service to build and manage a secure data lake at scale You can ingest and store structured semi structured and unstructured data from wide variety of sources into a centralized platform With a data lake built on S you can use native AWS services to run big data analytics artificial intelligence AI and machine learning ML applications to gain insights from your unstructured data sets You also have the flexibility to use your preferred analytics AI and ML solutions from the Amazon Partner Network APN S provides a wide variety of service features to empower IT managers storage administrators and data scientists to enforce access policies manage objects at scale audit activities and secure data across their data lake built on S In closing a data lake built on AWS allows you to evolve your business around your data assets and to use these data assets to quickly and agilely drive more business value and competitive differentiation without limits ReferenceOriginal paper 2021-12-18 14:10:14
Apple AppleInsider - Frontpage News YouTube TV loses Disney networks as Google fails to make a deal https://appleinsider.com/articles/21/12/18/youtube-tv-loses-disney-networks-as-google-fails-to-make-a-deal?utm_medium=rss YouTube TV loses Disney networks as Google fails to make a dealYouTube TV subscribers are no longer able to watch Disney owned networks on the streaming service after Google failed to make a new deal with the media giant before Friday s deadline Users of YouTube TV using the service on Saturday will see that a number of channels have disappeared from the roster The channels including ESPN ABC FX and National Geographic are pulled for subscribers due to Google and Disney negotiations failing to progress to a deal The blackout is due to an inability for the two companies to meet in the middle in order to renew an agreement concerning the content While the two are still supposedly continuing talks on the matter the channels are no longer watchable on YouTube TV Read more 2021-12-18 14:55:00
Apple AppleInsider - Frontpage News How to manage notifications in iOS 15 and iPadOS 15 https://appleinsider.com/articles/21/11/01/how-to-manage-notifications-in-ios-15-and-ipados-15?utm_medium=rss How to manage notifications in iOS and iPadOS Apple makes managing your notifications even easier oniOS and iPadOS allowing you to choose when ーand where ーyou see all your alerts While there s no doubt that notifications can be useful too many at once can be overwhelming Thanks to new updates introduced in iOS and iPadOS you can now minimize interruptions throughout your day We ll show you how to change your alert styles set up notification summaries and how to control your group notification settings Read more 2021-12-18 14:20:00
ニュース BBC News - Home Covid-19: Cabinet ministers briefed on latest UK data as Omicron surges https://www.bbc.co.uk/news/uk-59711474?at_medium=RSS&at_campaign=KARANGA emergency 2021-12-18 14:51:15
ニュース BBC News - Home Covid: Action needed to limit hospital admissions - Sage scientists https://www.bbc.co.uk/news/health-59707252?at_medium=RSS&at_campaign=KARANGA england 2021-12-18 14:54:11
ニュース BBC News - Home Aston Villa v Burnley postponed after Villa record 'increased amount' of positive Covid-19 cases https://www.bbc.co.uk/sport/football/59665333?at_medium=RSS&at_campaign=KARANGA Aston Villa v Burnley postponed after Villa record x increased amount x of positive Covid casesAston Villa s match with Burnley is postponed after the Birmingham club confirmed an increased amount of Covid test results in their squad 2021-12-18 14:05:33
ニュース BBC News - Home Ryan Reynolds and Wrexham: More playing as club on Football Manager https://www.bbc.co.uk/news/uk-wales-59698398?at_medium=RSS&at_campaign=KARANGA popularity 2021-12-18 14:16:48
北海道 北海道新聞 複合男子、渡部暁は7位 スキーW杯第7戦 https://www.hokkaido-np.co.jp/article/624814/ 複合 2021-12-18 23:13:00

コメント

このブログの人気の投稿

投稿時間:2021-06-17 22:08:45 RSSフィード2021-06-17 22:00 分まとめ(2089件)

投稿時間:2021-06-20 02:06:12 RSSフィード2021-06-20 02:00 分まとめ(3871件)

投稿時間:2021-06-17 05:05:34 RSSフィード2021-06-17 05:00 分まとめ(1274件)