投稿時間:2023-06-26 11:12:09 RSSフィード2023-06-26 11:00 分まとめ(12件)
カテゴリー等 | サイト名等 | 記事タイトル・トレンドワード等 | リンクURL | 頻出ワード・要約等/検索ボリューム | 登録日 |
---|---|---|---|---|---|
IT | ITmedia 総合記事一覧 | [ITmedia ビジネスオンライン] 中業企業の8割が「夏のボーナスを支給予定」 「増やす」または「減らす」理由は? | https://www.itmedia.co.jp/business/articles/2306/26/news068.html | itmedia | 2023-06-26 10:23:00 |
IT | ITmedia 総合記事一覧 | [ITmedia ビジネスオンライン] 東京23区の新築マンション平均価格ランキング 2位「千代田区」の2.3億円、1位の港区は? | https://www.itmedia.co.jp/business/articles/2306/26/news069.html | itmedia | 2023-06-26 10:20:00 |
AWS | AWS Japan Blog | 週刊AWS – 2023/6/19週 | https://aws.amazon.com/jp/blogs/news/aws-weekly-20230619/ | awsglue | 2023-06-26 01:54:37 |
js | JavaScriptタグが付けられた新着投稿 - Qiita | 【TypeScript/JavaScript】配列操作、利用数順に紹介 | https://qiita.com/rokumura7/items/e1e7f4006897e9247572 | javascript | 2023-06-26 10:16:58 |
js | JavaScriptタグが付けられた新着投稿 - Qiita | テストコードにもLinterを導入しよう | https://qiita.com/ken7253_/items/c0f9e1c707693a5c6773 | linter | 2023-06-26 10:10:16 |
js | JavaScriptタグが付けられた新着投稿 - Qiita | Reactのフォルダ構成 | https://qiita.com/omo_taku/items/9553430a64bd3e79b948 | 2023-06-26 10:05:06 | |
技術ブログ | Developers.IO | メモの整理が苦手なら、4つの箱に整理しよう!その名もPARAメソッド | https://dev.classmethod.jp/articles/cm-an-nm-notion-gathering-1st-para-method/ | kankan | 2023-06-26 01:01:39 |
海外TECH | DEV Community | Unsupervised Clustering with K-Means | https://dev.to/pedrohgoncalves/unsupervised-clustering-with-k-means-5h26 | Unsupervised Clustering with K MeansIn the past few weeks I have been studying about clustering and some of its models to apply in a project at the company I work for When you study clustering you quickly come across the centroid model which leads you to K Means the most famous method for working with this type of clustering We will use K Means for our activity Speaking a bit about how we will perform our activity we will use a dataset containing sales information from an unspecified company and we will cluster its customers taking into account some of their behaviors in the store For this purpose we will also utilize the RFM concept recency frequency monetary which is widely used in marketing teams It s important to remember that there are various other types of clustering and centroid based clustering algorithms This article specifically focuses on K Means and a practical application of its algorithm In this article we will not discuss data transformation or data visualization If you would like to provide feedback on my code you can visit the repository where the code used in this article is located as there are some visualization and transformation aspects not shown here K MeansK Means is an unsupervised algorithm which means it does not require labels on the events unlike supervised algorithms that need labels for training Unsupervised algorithms are designed to learn from the data itself by autonomously identifying patterns often not visible to the naked eye The goal of the algorithm is to generate K clusters where K is defined by the scientist reducing the variance between clusters and increasing the similarity among points assigned to the same cluster How it worksThe algorithm randomly assigns K numbers in the feature space The distance is calculated by iterating over each point and each centroid and the point is assigned to the centroid with the shortest distance the calculation of distance uses the Euclidean distance formula Recalculate the position of the clusters based on the mean coordinates of each point assigned to the same cluster Repeat steps and until the position of the clusters no longer undergoes significant changes or until a certain number of iterations is reached Determining the ideal number of KTo determine the number of K clusters we will use the elbow method which is the most commonly used method for this task We will also use the distance point line calculation to further refine and better define our number of clusters The elbow method calculates the sum of squared distances between the points within a cluster Its goal is to minimize the total inertia variability of the points of each cluster The formula for this calculation is as follows Where K is the number of clusters x is the point within each cluster and μis the mean distance between each point The distance point line calculation is the perpendicular distance of each point along a line defined by two points It is used to discover the greatest homogeneity within a cluster and the greatest difference between clusters P and P are our starting point P and our last point P y represents the y value in a Cartesian plane at our last point P and the same applies to y for the first point The same logic applies to x and x In the equation as we usually iterate for each number of clusters x and y represent the x and y values of the cluster being calculated We will start by defining two functions calculateWcss iterates from to clusters we don t want to have too many customer clusters in our dataset and this is generally determined and tested with the data and business teams It calculates the total inertia for each number of clusters and returns a list with the cluster number and its inertia def calculateWcss data gt list wcss for k in range kmeans KMeans n clusters k kmeans fit X data data clusters kmeans labels wcss append kmeans inertia return wcss def plotFigure quadraticSum list figsize tuple int plt figure figsize figsize plt plot quadraticSum plt xlabel Clusters plt show dfRecencyModel dfOrderCostumer recency quadraticSum calculateWcss dfRecencyModel plotFigure quadraticSum Calling the calculateWcss function using the column in the dataset that represents the number of days since the last purchase and plotting it in the plotFigure function we get the following result Interpreting this graph we might think Well the number of clusters is the best because it has the lowest inertia It s not entirely incorrect but not entirely correct either As mentioned earlier we don t want too many clusters so we re looking for the point where the inertia doesn t decrease drastically always aiming to have the fewest clusters Upon reevaluation we could say that and are strong candidates However we will use the distance point line calculation to ensure the number of clusters we will apply Let s define the distancePointLine function in code It calculates the distance of the number of clusters to the points P and P which are and our number of clusters defined in calculateWcss It returns the ideal number of clusters where we have the greatest perpendicular distance between the starting point and the ending point def distancePointLine wcss list gt int import math x y wcss x y wcss len wcss distance for i in range len wcss x i y wcss i numerator abs y y x x x y x y y x denominator math sqrt y y x x distance append numerator denominator return distance index max distance Clustering our datasetIn our dataset we have information such as recency which we used to determine the ideal number of clusters and represents the number of days since the last purchase frequency which represents the number of times a particular customer has made purchases in our store and monetary value representing the amount the customer spent in our store Typically people would cluster using all the features columns together However we will perform separate clustering for each feature specifically four clusters for each feature Let s start by defining a function that takes parameters such as a new column name for the cluster the name of the feature to be used as the basis for clustering the multidimensional array of the separated feature from the DataFrame the DataFrame itself to add the clustering and whether the rating cluster it belongs to should be in ascending or descending order We will use the cluster as the rating As the cluster starts from and goes up to the cluster with a rating of will represent customers who have spent the least money or have been inactive for the longest time on the platform def orderCluster clusterName str target name str featureColumn DataFrame dfAppend ascending bool gt DataFrame kmeans KMeans n clusters nmrCluster dfUse dfAppend dfUse clusterName kmeans fit predict featureColumn groupbyCluster dfUse groupby clusterName target name mean reset index groupbyCluster groupbyCluster sort values by target name ascending ascending reset index drop True groupbyCluster index groupbyCluster index groupbyCluster drop columns target name inplace True dfUsageMerged pd merge dfUse groupbyCluster on clusterName dfUsageMerged drop columns clusterName inplace True dfUsageMerged rename columns index clusterName inplace True return dfUsageMergedNow we will call the orderCluster function for each feature and increment by in the dfMain DataFrame that we performed some transformations after reading the csv file finalDataframe dfMain id unique costumer recency recency cluster order approved frequency cluster agg value revenue cluster finalDataframe pontuation finalDataframe recency cluster finalDataframe frequency cluster finalDataframe revenue cluster finalDataframe segmentation Inactive finalDataframe loc finalDataframe pontuation gt segmentation Business finalDataframe loc finalDataframe pontuation gt segmentation Master finalDataframe loc finalDataframe pontuation gt segmentation Premium And then we can plot a graph to visualize the distribution of each segmentation using the features of agg value amount of money spent and recency number of days since the last purchase as well Here s the function to plot the graph def plot segment x y data sns set palette muted color codes True style whitegrid sns scatterplot x x y y hue segmentation data data sizes size order Premium Master Business Inativo plt show plot segment recency agg value finalDataframe The plotted graph With this graph it becomes clear that our customers classified as Premium obviously few have spent higher amounts than the average and made more recent purchases while the inactive ones have not spent much and haven t made purchases for some time Based on this our company can have more targeted communication by offering customized services to Premium customers and providing some type of discount coupon to encourage inactive customers to return and spend in our store Digging deeper into RFM Let s further analyze our recency cluster with the following code finalDataframe groupby recency cluster recency describe reset index Now we know that customers belonging to cluster have an average of days since their last purchase making them the cluster with the lowest recency The RFM concept creates more customer groups based on the clusters they belong to taking into account the attributes of recency frequency and monetary value For example a customer who belongs to cluster in recency cluster in frequency and cluster in monetary value means that they haven t made a recent purchase have made a reasonable number of purchases in our store and have spent a high amount of money The RFM analysis would allocate this customer to a dormant or hibernating customer segment We can implement this classification in our algorithm but I present it here as a challenge I recommend reading more about RFM and how to implement it in your business alongside unsupervised clustering Conclusion In this article we have learned how to determine the ideal number of clusters and how to cluster our customers based on the widely used RFM concept If you would like to explore more about other models data visualization data transformation I suggest checking out my GitHub repository where I frequently work on data engineering projects and related topics Thank you very much for reading Repository | 2023-06-26 01:30:43 |
海外TECH | DEV Community | FLaNK Stack Weekly on 26 June 2023 | https://dev.to/tspannhw/flank-stack-weekly-on-26-june-2023-3082 | FLaNK Stack Weekly on June June FLiPN FLaNK Stack WeeklyTim Spann PaaSDevMy friend wrote an awesome new book on streaming I highly recommend picking up a copy Join me in person for steak amp stack or virtually for FLaNK StackWednesday June at PM to Wednesday June at PM EDTAdd to calendarThe Capital Grille W Wisconsin Ave ·Milwaukee WIAlso live streamed to YoutubeThis will be a hybrid event with a Zoom The in person event will be in Milwaukee In this interactive session Tim will lead participants through how to best build streaming data pipelines He will cover how to build applications from some common use cases and highlight tips tricks best practices and patterns He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi Apache Flink Apache Kafka and Apache Iceberg If you wish to follow along please download open source projects beforehand You can also download this helpful streaming platform All source code and slides will be shared for those interested in building their own FLaNK Apps Hardware For FLaNKThe amazing team at Ampere Computing sent us a U Mt Jade We will be running some AI IoT MiNiFi NiFi Kafka Flink Pulsar Spark Iceberg Ozone HBase Kudu Hive Impala Jupyter and more workloads here UpdatesCDF PC on CDP Public CloudNew Advanced UIs The Flow Designer now supports the advanced configuration UI for UpdateAttribute The Flow Designer now supports the advanced configuration UI for JoltTransformJson New Canvas navigation The Flow Designer now supports Birdseye and Zoom controls New troubleshooting The Flow Designer now supports Processor Diagnostics with an active Test Session Multi Select The Flow Designer now supports multi selection on the canvas and bulk actions for Start Stop Enable Disable Move Change parent group Copy Paste and Delete New ReadyFlows for this release CDW IngestCDP Kafka to SnowflakeSlack to SUpdated Confluent Cloud to Snowflake using new Snowpipe processors CODE COMMUNITYPlease join my meetup group NJ NYC Philly Virtual This is Issue You may notice a version jump Linked in says we had already so I am assuming two other articles got assimilated I will go with this since is a better number Courses Videos ab channel Ververica v cZJCyLYyI Conference VideosHail Hydrate From Stream to Lake ab channel PowerShell org Articles tspann ingesting events into dockerized ibm db jdbc with apache nifi fcad tspann functions anywhere faas eeecedb EventsJune NLIT Summit Milwaukee June NiFi Meetup Milwaukee and Hybrid July Hours to Data Innovation Data FlowOctober Hours to Data Innovation Data FlowCloudera EventsMore Events Code NiFi Code Tools Tim Spann | 2023-06-26 01:10:11 |
金融 | RSS FILE - 日本証券業協会 | 債券貸借取引残高等状況 (旧債券貸借取引状況) | https://www.jsda.or.jp/shiryoshitsu/toukei/taishaku/index.html | 貸借 | 2023-06-26 01:30:00 |
マーケティング | AdverTimes | 「家庭教師のトライ」夏期講習CM 今年の夏はハイジとペーターがタイムループ | https://www.advertimes.com/20230626/article424312/ | 家庭教師のトライ | 2023-06-26 01:24:58 |
ニュース | THE BRIDGE | ChatGPTでオンライン会議まとめ「MiiTel Meetings」がGoogle Meetに対応 #IVSPRWeek | https://thebridge.jp/2023/06/online-meeting-tool-miitel-meetings-supports-google-meet-ivsprweek | ChatGPTでオンライン会議まとめ「MiiTelMeetings」がGoogleMeetに対応IVSPRWeekIVSPRWeekはスタートアップカンファレンス「IVS」とプレスリリース配信サービスのPRTIMESが企画する「StartupPRWeek」参加企業による新製品情報をお伝えします。 | 2023-06-26 01:05:03 |
コメント
コメントを投稿