From 4e58f2d643759d6628e33e8acda6d609e512fe6b Mon Sep 17 00:00:00 2001 From: yanirs Date: Mon, 9 Sep 2024 00:56:47 +0000 Subject: [PATCH] deploy: dd7b0d6e75ef6b21582eaa8979f3c4640706fa07 --- .../index.html | 2 +- .../index.html | 8 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- causal-inference-resources/index.html | 42 ++++++++++- deep-learning-resources/index.html | 75 ++++++++++++++++++- index.html | 2 +- index.xml | 43 ++++++++++- 9 files changed, 159 insertions(+), 19 deletions(-) diff --git a/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/index.html b/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/index.html index 5a8f21c21..bccebaadf 100644 --- a/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/index.html +++ b/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/index.html @@ -6,7 +6,7 @@ https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/gradient-boosting-out-of-bag-experiment-toy-dataset.png 858w," src=https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/gradient-boosting-out-of-bag-experiment-toy-dataset_hu1768420409549174028.png alt="Gradient Boosting out of bag experiment (toy dataset)" loading=lazy>

My approach (TSO) beat both 5-fold cross-validation (CV) and the GBM/scikit-learn method (SKO), as TSO obtains its minimum at the closest number of iterations to the test set’s (T) optimal value.

The next step in testing TSO’s viability was to rerun Ridgeway’s experiments from Section 3.3 of the GBM documentation (R code here). I used the same 12 UCI datasets that Ridgeway used, running 5×2 cross-validation on each one. For each dataset, the score was obtained by dividing the mean loss of the best method on the dataset by the loss of each method. Hence, all scores are between 0.0 and 1.0, with the best score being 1.0. The following figure summarises the results on the 12 datasets.

Gradient Boosting out of bag experiment (UCI datasets)

The following table shows the raw data that was used to produce the figure.

DatasetCVSKOTSO
creditrating0.99620.97711
breastcancer10.66750.4869
mushrooms0.95880.99631
abalone10.97540.9963
ionosphere0.991910.8129
diabetes10.98690.9985
autoprices10.95650.5839
autompg10.87530.9948
bostonhousing10.82990.5412
haberman10.97930.9266
cpuperformance0.99340.91601
adult10.98240.9991

The main finding is that CV remains the most reliable approach. Even when CV is not the best-performing method, it’s not much worse than the best method (this is in line with Ridgeway’s findings). TSO yielded the best results on 3/12 of the datasets, and beat SKO 7/12 times. However, TSO’s results are the most variant of the three methods: when it fails, it often yields very poor results.

In conclusion, stick to cross-validation for the best results. It’s more computationally intensive than SKO and TSO, but can be parallelised. I still think that there may be a way to avoid cross-validation, perhaps by extending SKO/TSO in more intelligent ways (see some interesting ideas by Eugene Dubossarsky here and here). Any comments/ideas are very welcome.

Subscribe +https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/gradient-boosting-out-of-bag-experiments-uci-datasets.png 591w," src=https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/gradient-boosting-out-of-bag-experiments-uci-datasets.png alt="Gradient Boosting out of bag experiment (UCI datasets)" loading=lazy>

The following table shows the raw data that was used to produce the figure.

DatasetCVSKOTSO
creditrating0.99620.97711
breastcancer10.66750.4869
mushrooms0.95880.99631
abalone10.97540.9963
ionosphere0.991910.8129
diabetes10.98690.9985
autoprices10.95650.5839
autompg10.87530.9948
bostonhousing10.82990.5412
haberman10.97930.9266
cpuperformance0.99340.91601
adult10.98240.9991

The main finding is that CV remains the most reliable approach. Even when CV is not the best-performing method, it’s not much worse than the best method (this is in line with Ridgeway’s findings). TSO yielded the best results on 3/12 of the datasets, and beat SKO 7/12 times. However, TSO’s results are the most variant of the three methods: when it fails, it often yields very poor results.

In conclusion, stick to cross-validation for the best results. It’s more computationally intensive than SKO and TSO, but can be parallelised. I still think that there may be a way to avoid cross-validation, perhaps by extending SKO/TSO in more intelligent ways (see some interesting ideas by Eugene Dubossarsky here and here). Any comments/ideas are very welcome.

Subscribe

    Public comments are closed, but I love hearing from readers. Feel free to diff --git a/2015/07/06/learning-about-deep-learning-through-album-cover-classification/index.html b/2015/07/06/learning-about-deep-learning-through-album-cover-classification/index.html index 28d094ba5..0b5ec6bef 100644 --- a/2015/07/06/learning-about-deep-learning-through-album-cover-classification/index.html +++ b/2015/07/06/learning-about-deep-learning-through-album-cover-classification/index.html @@ -7,19 +7,19 @@ --dataset-path /path/to/dataset \ --model-architecture AlexNet \ --model-params lc0_num_filters=64 -

    There are many more command line flags (possibly too many), which make it easy to both tinker with various settings, and also run more rigorous experiments. My initial tinkering with convnets didn’t yield impressive results in terms of predictive accuracy on my dataset. It turned out that this was partly due to the lack of preprocessing – the less exciting but crucial part of any predictive modelling work.

    The importance of preprocessing

    My initial focus was on getting things to work on the dataset without worrying too much about preprocessing. I haven’t done any image classification work in the past, so I had to learn about the right type of preprocessing to use. I kept it pretty simple and applied the following transformations:

    Baselines

    After building the experimental environment and a fair bit of tinkering, I decided it was time for some more serious experiments. The results of my initial games were rather disappointing – slightly better than a random baseline, which yields an accuracy score of 10%. Therefore, I ran some baselines to get an idea of what’s possible on this dataset.

    The first baseline I tried was a random forest with 1,000 trees, which yielded 15.25% accuracy. This baseline was trained directly on the pixel values without any preprocessing other than downsampling. It’s worth noting that the downsampling size didn’t make much of a difference to this baseline (I tried a few values in the range 50×50-350×350). This baseline was also not particularly sensitive to whether RGB or grayscale values were used to represent the images.

    The next experiments were with baselines that utilised pretrained Caffe models. Training a random forest with 1,000 trees on features extracted from the highest fully-connected layer (fc7) in the CaffeNet and VGGNet-19 models yielded accuracies of 16.72% and 16.40% respectively. This was pretty disappointing, as I expected these features to perform much better. The reason may be that album covers are very different from ImageNet images, and the representations in fc7 are too specific to ImageNet. Indeed, when fine-tuning the CaffeNet model (following the procedure outlined here), I got the best accuracy on the dataset: 22.60%. Using Caffe to train the same network from scratch didn’t even get close to this accuracy. However, I didn’t try to tune Caffe’s learning parameters. Instead, I went back to running experiments with my code.

    It’s worth noting that the classes identified by the CaffeNet model often have little to do with the actual content of the image. Better baseline results may be obtained by using models that were pretrained on a richer dataset than ImageNet. The following table presents three example covers together with the top-five classes identified by the CaffeNet model for each image. The tags assigned by Clarifai’s API are also presented for comparison. From this example, it looks like Clarifai’s model is more successful at identifying the correct elements than the CaffeNet model, indicating that a baseline that uses the Clarifai tags may yield competitive performance.

    AlbumCaffeNetClarifai
    October by Wille P
-hiphop_rap

    October by Wille P
    hiphop_rap

    digital clock, spotlight, jack-o’-lantern, volcano, traffic lighttree, landscape, sunset, desert, sun, sunrise, nature, evening, sky, travel

    October by Wille P
    hiphop_rap

    digital clock, spotlight, jack-o’-lantern, volcano, traffic lighttree, landscape, sunset, desert, sun, sunrise, nature, evening, sky, travel
    Demo by Blackrat
-metal

    Demo by Blackrat
    metal

    spider web, barn spider, chain, bubble, fountainskull, bone, nobody, death, vector, help, horror, medicine, black and white, tattoo

    Demo by Blackrat
    metal

    spider web, barn spider, chain, bubble, fountainskull, bone, nobody, death, vector, help, horror, medicine, black and white, tattoo
    The Kool-Aid Album by Mr. Merge
-soul

    The Kool-Aid Album by Mr. Merge
    soul

    dishrag, paper towel, honeycomb, envelope, chain mailsymbol, nobody, sign, illustration, color, flag, text, stripes, business, character

    Training from scratch

    My initial experiments were with various convnet architectures, where I manually varied the filter sizes and number of layers to have a reasonable number of parameters and ensure that the model is trainable on a GPU with 4GB of memory. As mentioned, this approach yielded unimpressive results. Following the relative success of the fine-tuned CaffeNet baseline, I decided to run more rigorous experiments on variants of AlexNet (which is very similar to CaffeNet).

    Given the large number of hyperparameters that need to be set when training deep convnets, I realised that setting values manually or via grid search is unlikely to yield the best results. To address this, I used hyperopt to search for the best configuration of values. The hyperparameters that were included in the search were the learning method (Nesterov momentum versus Adam with their respective parameters), the learning rate, whether crops are mirrored or not, the number of crops to use (1 or 5), dropout probabilities, the number of hidden units in the fully-connected layers, and the number of filters in each convolutional layer.

    Each configuration suggested by hyperopt was trained for 10 epochs, and the promising setups were trained until results stopped improving. The results of the search were rather disappointing, with the best accuracy being 17.19%. However, I learned a lot by finding hyperparameters in this manner – in the past I’ve only used a combination of manual settings with grid search.

    There are many possible reasons for why the results are so poor. It could be that there’s just too little data to train a good classifier, which is supported by the inability to beat the fine-tuned results. This is in line with the results obtained by Zeiler and Fergus (2013), who found that convnets pretrained on ImageNet performed much better on the Caltech-101 and Caltech-256 datasets than the same networks trained from scratch. However, it could also be that I just didn’t run enough experiments – I definitely feel like I haven’t explored everything as well as I’d like. In addition, I’m still building my intuition for what works and why. I should work more on visualising the way the network learns to uncover more hidden gotchas in addition to those I’ve already found. Finally, it could be that it’s just too hard to distinguish between covers from the genres I chose for the study.

    Ideas for future work

    There are many avenues for improving on the work I’ve done so far. The code could definitely be made more robust and better tested, optimised and parallelised. It would be worth investing more in hyperparameter and architecture search, including incorporation of ideas from non-vanilla convnets (e.g., GoogLeNet). This search should be guided by visualisation and a deeper understanding of the trained networks, which may also come from analysing class-level accuracy (certain genres seem to be easier to distinguish than others). In addition, more sophisticated preprocessing may yield improved results.

    If the goal were to get the best possible performance on my dataset, I’d invest in establishing the human performance baseline on the dataset by running some tests with Mechanical Turk. My guess is that humans would perform better than the algorithms tested so far due to access to external knowledge. Therefore, incorporating external knowledge in the form of manual features or additional data sources may yield the most substantial performance boosts. For example, text on an album cover may contain important clues about its genre, and models pretrained on style datasets may be more suitable than ImageNet models. In addition, it may be beneficial to use a model to detect multiple elements in images where the universe is not restricted to ImageNet classes. This approach was taken by Alexandre Passant, who used Clarifai’s API to tag and classify doom metal and K-pop album covers. Finally, using several different models in an ensemble is likely to help squeeze a bit more accuracy out of the dataset.

    Another direction that may be worth exploring is using image data for recommendation work. The reason I chose to work on this problem was my exposure to album covers through my work on Bandcamp Recommender – a music recommendation system. It is well-known that visual elements influence the way users interact with recommender systems. This is especially true in Bandcamp Recommender’s case, as users see the album covers before they choose to play them. This leads me to conjecture that considering features that describe the album covers when generating recommendations would increase user interaction with the system. However, it’s hard to tell whether it’d increase the overall relevance of the results. You can’t judge an album by its cover. Or can you…?

    Conclusion

    While I’ve learned a lot from working on this project, there’s still much more to discover. It was especially great to learn some generally-applicable lessons about hyperparameter optimisation and improvements to vanilla gradient descent. Despite the many potential ways of improving performance on my dataset, my next steps in the field would probably include working on problems for which obtaining a good solution is feasible and useful. For example, I have some ideas for applications to marine creature identification.

    Feedback and suggestions are always welcome. Please feel free to contact me privately or via the comments section.

    Acknowledgement: Thanks to Brian Basham and Diogo Moitinho de Almeida for useful tips and discussions.

    Subscribe +soul" loading=lazy>

    The Kool-Aid Album by Mr. Merge
    soul

    dishrag, paper towel, honeycomb, envelope, chain mailsymbol, nobody, sign, illustration, color, flag, text, stripes, business, character

    Training from scratch

    My initial experiments were with various convnet architectures, where I manually varied the filter sizes and number of layers to have a reasonable number of parameters and ensure that the model is trainable on a GPU with 4GB of memory. As mentioned, this approach yielded unimpressive results. Following the relative success of the fine-tuned CaffeNet baseline, I decided to run more rigorous experiments on variants of AlexNet (which is very similar to CaffeNet).

    Given the large number of hyperparameters that need to be set when training deep convnets, I realised that setting values manually or via grid search is unlikely to yield the best results. To address this, I used hyperopt to search for the best configuration of values. The hyperparameters that were included in the search were the learning method (Nesterov momentum versus Adam with their respective parameters), the learning rate, whether crops are mirrored or not, the number of crops to use (1 or 5), dropout probabilities, the number of hidden units in the fully-connected layers, and the number of filters in each convolutional layer.

    Each configuration suggested by hyperopt was trained for 10 epochs, and the promising setups were trained until results stopped improving. The results of the search were rather disappointing, with the best accuracy being 17.19%. However, I learned a lot by finding hyperparameters in this manner – in the past I’ve only used a combination of manual settings with grid search.

    There are many possible reasons for why the results are so poor. It could be that there’s just too little data to train a good classifier, which is supported by the inability to beat the fine-tuned results. This is in line with the results obtained by Zeiler and Fergus (2013), who found that convnets pretrained on ImageNet performed much better on the Caltech-101 and Caltech-256 datasets than the same networks trained from scratch. However, it could also be that I just didn’t run enough experiments – I definitely feel like I haven’t explored everything as well as I’d like. In addition, I’m still building my intuition for what works and why. I should work more on visualising the way the network learns to uncover more hidden gotchas in addition to those I’ve already found. Finally, it could be that it’s just too hard to distinguish between covers from the genres I chose for the study.

    Ideas for future work

    There are many avenues for improving on the work I’ve done so far. The code could definitely be made more robust and better tested, optimised and parallelised. It would be worth investing more in hyperparameter and architecture search, including incorporation of ideas from non-vanilla convnets (e.g., GoogLeNet). This search should be guided by visualisation and a deeper understanding of the trained networks, which may also come from analysing class-level accuracy (certain genres seem to be easier to distinguish than others). In addition, more sophisticated preprocessing may yield improved results.

    If the goal were to get the best possible performance on my dataset, I’d invest in establishing the human performance baseline on the dataset by running some tests with Mechanical Turk. My guess is that humans would perform better than the algorithms tested so far due to access to external knowledge. Therefore, incorporating external knowledge in the form of manual features or additional data sources may yield the most substantial performance boosts. For example, text on an album cover may contain important clues about its genre, and models pretrained on style datasets may be more suitable than ImageNet models. In addition, it may be beneficial to use a model to detect multiple elements in images where the universe is not restricted to ImageNet classes. This approach was taken by Alexandre Passant, who used Clarifai’s API to tag and classify doom metal and K-pop album covers. Finally, using several different models in an ensemble is likely to help squeeze a bit more accuracy out of the dataset.

    Another direction that may be worth exploring is using image data for recommendation work. The reason I chose to work on this problem was my exposure to album covers through my work on Bandcamp Recommender – a music recommendation system. It is well-known that visual elements influence the way users interact with recommender systems. This is especially true in Bandcamp Recommender’s case, as users see the album covers before they choose to play them. This leads me to conjecture that considering features that describe the album covers when generating recommendations would increase user interaction with the system. However, it’s hard to tell whether it’d increase the overall relevance of the results. You can’t judge an album by its cover. Or can you…?

    Conclusion

    While I’ve learned a lot from working on this project, there’s still much more to discover. It was especially great to learn some generally-applicable lessons about hyperparameter optimisation and improvements to vanilla gradient descent. Despite the many potential ways of improving performance on my dataset, my next steps in the field would probably include working on problems for which obtaining a good solution is feasible and useful. For example, I have some ideas for applications to marine creature identification.

    Feedback and suggestions are always welcome. Please feel free to contact me privately or via the comments section.

    Acknowledgement: Thanks to Brian Basham and Diogo Moitinho de Almeida for useful tips and discussions.

    Subscribe

      Public comments are closed, but I love hearing from readers. Feel free to diff --git a/2015/10/02/the-wonderful-world-of-recommender-systems/index.html b/2015/10/02/the-wonderful-world-of-recommender-systems/index.html index d7aaad934..2c388b8c1 100644 --- a/2015/10/02/the-wonderful-world-of-recommender-systems/index.html +++ b/2015/10/02/the-wonderful-world-of-recommender-systems/index.html @@ -7,7 +7,7 @@ 100vw" srcset="https://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/hynt-screenshot_hu17008018622414803394.png 360w, https://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/hynt-screenshot_hu16912368895272562788.png 480w, https://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/hynt-screenshot_hu9102468640053471026.png 720w, -https://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/hynt-screenshot.png 750w," src=https://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/hynt-screenshot.png alt="Hynt recommendation widget" loading=lazy>

      Hynt is a recommender-system-as-a-service for e-commerce whose development I led up until the middle of last year. The general idea is that merchants simply add a few lines of JavaScript to their shop pages and Hynt does the hard work of recommending relevant items from the store, while considering the user and page context. Going live with Hynt reaffirmed many well-known UI/UX lessons. Most notably:

      • Above the fold is better than below. Engagement with Hynt widgets that were visible without scrolling was higher than those that were lower on the page.
      • More recommendations are better than a few. Hynt widgets are responsive, adapting to the size of the container they’re placed in. Engagement was more likely when more recommendations were displayed, because users were more likely to find something they liked without scrolling through the widget.
      • Fast is better than slow. If recommendations load faster, more people see them, which increases engagement. In Hynt’s case speed was especially important because the widgets load asynchronously after the host page finishes loading.

      Another important UI/UX element is explanations. Displaying a plausible explanation next to a recommendation can do wonders, without making any changes to the underlying recommendation algorithms. The impact of explanations has been studied extensively by Nava Tintarev and Judith Masthoff. They have identified seven different aims of explanations, which are summarised in the following table (reproduced from their survey of explanations in recommender systems).

      AimDefinition
      TransparencyExplain how the system works
      ScrutabilityAllow users to tell the system it is wrong
      TrustIncrease user confidence in the system
      EffectivenessHelp users make good decisions
      PersuasivenessConvince users to try or buy
      EfficiencyHelp users make decisions faster
      SatisfactionIncrease ease of usability or enjoyment

      Explanations are ubiquitous in real-world recommender systems. For example, Amazon uses explanations like “frequently bought together”, and “customers who bought this item also bought”, while Netflix presents different lists of recommendations where each list is driven by a different reason. However, as the following Netflix example shows, it is worth making sure that the explanations you provide don’t make you look stupid.

      Hynt recommendation widget

      Hynt is a recommender-system-as-a-service for e-commerce whose development I led up until the middle of last year. The general idea is that merchants simply add a few lines of JavaScript to their shop pages and Hynt does the hard work of recommending relevant items from the store, while considering the user and page context. Going live with Hynt reaffirmed many well-known UI/UX lessons. Most notably:

      • Above the fold is better than below. Engagement with Hynt widgets that were visible without scrolling was higher than those that were lower on the page.
      • More recommendations are better than a few. Hynt widgets are responsive, adapting to the size of the container they’re placed in. Engagement was more likely when more recommendations were displayed, because users were more likely to find something they liked without scrolling through the widget.
      • Fast is better than slow. If recommendations load faster, more people see them, which increases engagement. In Hynt’s case speed was especially important because the widgets load asynchronously after the host page finishes loading.

      Another important UI/UX element is explanations. Displaying a plausible explanation next to a recommendation can do wonders, without making any changes to the underlying recommendation algorithms. The impact of explanations has been studied extensively by Nava Tintarev and Judith Masthoff. They have identified seven different aims of explanations, which are summarised in the following table (reproduced from their survey of explanations in recommender systems).

      AimDefinition
      TransparencyExplain how the system works
      ScrutabilityAllow users to tell the system it is wrong
      TrustIncrease user confidence in the system
      EffectivenessHelp users make good decisions
      PersuasivenessConvince users to try or buy
      EfficiencyHelp users make decisions faster
      SatisfactionIncrease ease of usability or enjoyment

      Explanations are ubiquitous in real-world recommender systems. For example, Amazon uses explanations like “frequently bought together”, and “customers who bought this item also bought”, while Netflix presents different lists of recommendations where each list is driven by a different reason. However, as the following Netflix example shows, it is worth making sure that the explanations you provide don’t make you look stupid.

      Amazon frequently bought together
      Hackers beware: Bootstrap sampling may be harmful | Yanir Seroussi | Data & AI for Startup Impact -

      Hackers beware: Bootstrap sampling may be harmful

      Bootstrap sampling techniques are very appealing, as they don’t require knowing much about statistics and opaque formulas. Instead, all one needs to do is resample the given data many times, and calculate the desired statistics. Therefore, bootstrapping has been promoted as an easy way of modelling uncertainty to hackers who don’t have much statistical knowledge. For example, the main thesis of the excellent Statistics for Hackers talk by Jake VanderPlas is: “If you can write a for-loop, you can do statistics”. Similar ground was covered by Erik Bernhardsson in The Hacker’s Guide to Uncertainty Estimates, which provides more use cases for bootstrapping (with code examples). However, I’ve learned in the past few weeks that there are quite a few pitfalls in bootstrapping. Much of what I’ve learned is summarised in a paper titled What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum by Tim Hesterberg. I doubt that many hackers would be motivated to read a paper with such a title, so my goal with this post is to make some of my discoveries more accessible to a wider audience. To learn more about the issues raised in this post, it’s worth reading Hesterberg’s paper and other linked resources.

      For quick reference, here’s a summary of the advice in this post:

      • Use an accurate method for estimating confidence intervals
      • Use enough resamples – at least 10-15K
      • Don’t compare confidence intervals visually
      • Ensure that the basic assumptions apply to your situation

      Pitfall #1: Inaccurate confidence intervals

      Confidence intervals are a common way of quantifying the uncertainty in an estimate of a population parameter. The percentile method is one of the simplest bootstrapping approaches for generating confidence intervals. For example, let’s say we have a data sample of size n and we want to estimate a 95% confidence interval for the population mean. We take r bootstrap resamples from the original data sample, where each resample is a sample with replacement of size n. We calculate the mean of each resample and store the means in a sorted array. We then return the 95% confidence interval as the values that fall at the 0.025r and 0.975r indices of the sorted array (i.e., the 2.5% and 97.5% percentiles). The following table shows what the first two resamples may look like for a data sample of size n=5.

      Original sampleResample #1Resample #2
      Values103020
      122020
      201230
      301230
      454530
      Mean23.423.826

      The percentile method is nice and simple. Any programmer should be able to easily implement it in their favourite programming language, assuming they can actually program. Unfortunately, this method is just not accurate enough for small sample sizes. Quoting Hesterberg (emphasis mine):

      The sample sizes needed for different intervals to satisfy the “reasonably accurate” (off by no more than 10% on each side) criterion are: n ≥ 101 for the bootstrap t, 220 for the skewness-adjusted t statistic, 2,235 for expanded percentile, 2,383 for percentile, 4,815 for ordinary t (which I have rounded up to 5,000 above), 5,063 for t with bootstrap standard errors and something over 8,000 for the reverse percentile method.

      In a shorter version of the paper cited above, Hesterberg concludes that:

      In practice, implementing some of the more accurate bootstrap methods is difficult (especially those not described here), and people should use a package rather than attempt this themselves.

      In short, make sure you’re using an accurate method for estimating confidence intervals when dealing with sample sizes of less than a few thousand values. Using a package is a great idea, but unfortunately I don’t know of any Python bootstrapping package that is feature-complete: ARCH and scikits-bootstrap support advanced confidence interval methods but don’t support analysis of two samples of uneven sizes, while bootstrapped works with samples of uneven sizes but only supports the percentile and the reverse percentile method (which Hesterberg found to be even less accurate). If you know of any better Python packages, please let me know! (I don’t use R, but I suspect the situation is better there). Update: ARCH now supports analysis of samples of uneven sizes following an issue I reported. It seems to be the best Python bootstrapping package, so I recommend using it.

      Pitfall #2: Not enough resamples

      Accurate bootstrap estimates require a large number of resamples. Many code snippets use 1,000 resamples, probably because it looks like a large number. However, seeming large isn’t enough. Quoting Hesterberg again:

      For both the bootstrap and permutation tests, the number of resamples needs to be 15,000 or more, for 95% probability that simulation-based one-sided levels fall within 10% of the true values, for 95% intervals and 5% tests. I recommend r = 10,000 for routine use, and more when accuracy matters.

      […]

      We want decisions to depend on the data, not random variation in the Monte Carlo implementation. We used r = 500,000 in the Verizon project.

      That’s right, half a million resamples! Accuracy mattered in the Verizon case, as the results of the analysis determined whether large penalties were paid or not. In short, use at least 10-15,000 resamples to be safe. Don’t use 1,000.

      Pitfall #3: Comparison of single-sample confidence intervals

      Confidence intervals are commonly used to decide if the difference between two samples is statistically significant. Bootstrapping provides a straightforward way of estimating confidence intervals without making assumptions about the way the data was generated. For example, given two samples, we can obtain confidence intervals for the mean of each sample and end up with a plot like this:

      Hackers beware: Bootstrap sampling may be harmful

      Bootstrap sampling techniques are very appealing, as they don’t require knowing much about statistics and opaque formulas. Instead, all one needs to do is resample the given data many times, and calculate the desired statistics. Therefore, bootstrapping has been promoted as an easy way of modelling uncertainty to hackers who don’t have much statistical knowledge. For example, the main thesis of the excellent Statistics for Hackers talk by Jake VanderPlas is: “If you can write a for-loop, you can do statistics”. Similar ground was covered by Erik Bernhardsson in The Hacker’s Guide to Uncertainty Estimates, which provides more use cases for bootstrapping (with code examples). However, I’ve learned in the past few weeks that there are quite a few pitfalls in bootstrapping. Much of what I’ve learned is summarised in a paper titled What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum by Tim Hesterberg. I doubt that many hackers would be motivated to read a paper with such a title, so my goal with this post is to make some of my discoveries more accessible to a wider audience. To learn more about the issues raised in this post, it’s worth reading Hesterberg’s paper and other linked resources.

      For quick reference, here’s a summary of the advice in this post:

      • Use an accurate method for estimating confidence intervals
      • Use enough resamples – at least 10-15K
      • Don’t compare confidence intervals visually
      • Ensure that the basic assumptions apply to your situation

      Pitfall #1: Inaccurate confidence intervals

      Confidence intervals are a common way of quantifying the uncertainty in an estimate of a population parameter. The percentile method is one of the simplest bootstrapping approaches for generating confidence intervals. For example, let’s say we have a data sample of size n and we want to estimate a 95% confidence interval for the population mean. We take r bootstrap resamples from the original data sample, where each resample is a sample with replacement of size n. We calculate the mean of each resample and store the means in a sorted array. We then return the 95% confidence interval as the values that fall at the 0.025r and 0.975r indices of the sorted array (i.e., the 2.5% and 97.5% percentiles). The following table shows what the first two resamples may look like for a data sample of size n=5.

      Original sampleResample #1Resample #2
      Values103020
      122020
      201230
      301230
      454530
      Mean23.423.826

      The percentile method is nice and simple. Any programmer should be able to easily implement it in their favourite programming language, assuming they can actually program. Unfortunately, this method is just not accurate enough for small sample sizes. Quoting Hesterberg (emphasis mine):

      The sample sizes needed for different intervals to satisfy the “reasonably accurate” (off by no more than 10% on each side) criterion are: n ≥ 101 for the bootstrap t, 220 for the skewness-adjusted t statistic, 2,235 for expanded percentile, 2,383 for percentile, 4,815 for ordinary t (which I have rounded up to 5,000 above), 5,063 for t with bootstrap standard errors and something over 8,000 for the reverse percentile method.

      In a shorter version of the paper cited above, Hesterberg concludes that:

      In practice, implementing some of the more accurate bootstrap methods is difficult (especially those not described here), and people should use a package rather than attempt this themselves.

      In short, make sure you’re using an accurate method for estimating confidence intervals when dealing with sample sizes of less than a few thousand values. Using a package is a great idea, but unfortunately I don’t know of any Python bootstrapping package that is feature-complete: ARCH and scikits-bootstrap support advanced confidence interval methods but don’t support analysis of two samples of uneven sizes, while bootstrapped works with samples of uneven sizes but only supports the percentile and the reverse percentile method (which Hesterberg found to be even less accurate). If you know of any better Python packages, please let me know! (I don’t use R, but I suspect the situation is better there). Update: ARCH now supports analysis of samples of uneven sizes following an issue I reported. It seems to be the best Python bootstrapping package, so I recommend using it.

      Pitfall #2: Not enough resamples

      Accurate bootstrap estimates require a large number of resamples. Many code snippets use 1,000 resamples, probably because it looks like a large number. However, seeming large isn’t enough. Quoting Hesterberg again:

      For both the bootstrap and permutation tests, the number of resamples needs to be 15,000 or more, for 95% probability that simulation-based one-sided levels fall within 10% of the true values, for 95% intervals and 5% tests. I recommend r = 10,000 for routine use, and more when accuracy matters.

      […]

      We want decisions to depend on the data, not random variation in the Monte Carlo implementation. We used r = 500,000 in the Verizon project.

      That’s right, half a million resamples! Accuracy mattered in the Verizon case, as the results of the analysis determined whether large penalties were paid or not. In short, use at least 10-15,000 resamples to be safe. Don’t use 1,000.

      Pitfall #3: Comparison of single-sample confidence intervals

      Confidence intervals are commonly used to decide if the difference between two samples is statistically significant. Bootstrapping provides a straightforward way of estimating confidence intervals without making assumptions about the way the data was generated. For example, given two samples, we can obtain confidence intervals for the mean of each sample and end up with a plot like this:

      Plumbing, Decisions, and Automation: De-hyping Data & AI | Yanir Seroussi | Data & AI for Startup Impact -

      Plumbing, Decisions, and Automation: De-hyping Data & AI

      contrasting an amateur and a professional otter; the amateur asks about tools, the professional asks about plumbing, decisions, and automation

      Data & AI health is hard to define. Recently, it occurred to me that its essence can be distilled with three questions:

      1. Plumbing: What’s the state of your data engineering lifecycles?
      2. Decisions: How do you use descriptive, predictive, and causal modelling to support decisions?
      3. Automation: How do you use AI to automate processes?

      These questions help identify gaps and opportunities. While each question focuses on the present state, it’s natural to follow up with plans for a brighter future.

      In practice, you would go deep on each area. Each question is a door that leads to a corridor with many more doors.

      Amateurs versus professionals

      If you’ve ever worked with data, you’d have a sense of what amateur and professional answers to the above questions may look like. In practice, answers are multifaceted and fall on a continuum. But here are some simplified examples from each end of the continuum:

      AmateurProfessional
      PlumbingRudimentary pipelines, manually-populated spreadsheetsAll necessary data is trustworthy and available on tap
      DecisionsRelying on one-off charts and models, along with the intuition of HiPPOs (highest-paid persons’ opinions)Relying on relevant data and modelling efforts that are proportional to the gravity of each decision
      AutomationSuperficial use of off-the-shelf toolsDeep, mindful integration of tech to replace manual work where it delivers the most value

      Going down the rabbit hole

      The three areas pretty much define my career, but there is always much more to learn. The main message of this post is that little has changed since Harrington Emerson uttered these words in 1911:

      As to methods, there may be a million and then some, but principles are few. The person who grasps principles can successfully select their own methods. The person who tries methods, ignoring principles, is sure to have trouble.

      (OK, one thing did change – Emerson used man rather than person, but I fixed it for him.)

      You can explore further with these posts:

      1. Plumbing: Fully understanding the data engineering lifecycle is more important than mastering a single tool.
      2. Decisions: According to my 2018 definition, this is what data science is all about. There’s endless depth to building descriptive, predictive, and causal models. But the key to rising above tool hype is understanding the why of data science, which is to support decisions.
      3. Automation: The term AI is around peak hype right now. This makes it easy for cynics to dismiss the over-excited claims of AI proponents. Avoid cynicism – simply think of AI as automation and understand that relentless but mindful automation is key to success in our world.

      More questions to probe the Data-to-AI health of startups

      This post is a slight detour from the series on my Data-to-AI Health Check for Startups. I figured it’s a valuable detour since I now see the triad of Plumbing, Decisions, and Automation as the essence of Data & AI health for any organisation.

      Previous posts in the series:

      You can download a guide containing all the questions as a PDF. I’m still planning to cover Processes & Project Management next – hopefully I won’t get detoured again. Feedback is always welcome!

      Subscribe +

      Plumbing, Decisions, and Automation: De-hyping Data & AI

      contrasting an amateur and a professional otter; the amateur asks about tools, the professional asks about plumbing, decisions, and automation

      Data & AI health is hard to define. Recently, it occurred to me that its essence can be distilled with three questions:

      1. Plumbing: What’s the state of your data engineering lifecycles?
      2. Decisions: How do you use descriptive, predictive, and causal modelling to support decisions?
      3. Automation: How do you use AI to automate processes?

      These questions help identify gaps and opportunities. While each question focuses on the present state, it’s natural to follow up with plans for a brighter future.

      In practice, you would go deep on each area. Each question is a door that leads to a corridor with many more doors.

      Amateurs versus professionals

      If you’ve ever worked with data, you’d have a sense of what amateur and professional answers to the above questions may look like. In practice, answers are multifaceted and fall on a continuum. But here are some simplified examples from each end of the continuum:

      AmateurProfessional
      PlumbingRudimentary pipelines, manually-populated spreadsheetsAll necessary data is trustworthy and available on tap
      DecisionsRelying on one-off charts and models, along with the intuition of HiPPOs (highest-paid persons’ opinions)Relying on relevant data and modelling efforts that are proportional to the gravity of each decision
      AutomationSuperficial use of off-the-shelf toolsDeep, mindful integration of tech to replace manual work where it delivers the most value

      Going down the rabbit hole

      The three areas pretty much define my career, but there is always much more to learn. The main message of this post is that little has changed since Harrington Emerson uttered these words in 1911:

      As to methods, there may be a million and then some, but principles are few. The person who grasps principles can successfully select their own methods. The person who tries methods, ignoring principles, is sure to have trouble.

      (OK, one thing did change – Emerson used man rather than person, but I fixed it for him.)

      You can explore further with these posts:

      1. Plumbing: Fully understanding the data engineering lifecycle is more important than mastering a single tool.
      2. Decisions: According to my 2018 definition, this is what data science is all about. There’s endless depth to building descriptive, predictive, and causal models. But the key to rising above tool hype is understanding the why of data science, which is to support decisions.
      3. Automation: The term AI is around peak hype right now. This makes it easy for cynics to dismiss the over-excited claims of AI proponents. Avoid cynicism – simply think of AI as automation and understand that relentless but mindful automation is key to success in our world.

      More questions to probe the Data-to-AI health of startups

      This post is a slight detour from the series on my Data-to-AI Health Check for Startups. I figured it’s a valuable detour since I now see the triad of Plumbing, Decisions, and Automation as the essence of Data & AI health for any organisation.

      Previous posts in the series:

      You can download a guide containing all the questions as a PDF. I’m still planning to cover Processes & Project Management next – hopefully I won’t get detoured again. Feedback is always welcome!

      Subscribe

        Public comments are closed, but I love hearing from readers. Feel free to diff --git a/causal-inference-resources/index.html b/causal-inference-resources/index.html index 9447bbbb5..18bf23f75 100644 --- a/causal-inference-resources/index.html +++ b/causal-inference-resources/index.html @@ -1,11 +1,47 @@ Causal inference resources | Yanir Seroussi | Data & AI for Startup Impact

        Causal inference resources

        This is a list of some causal inference resources, which I update from time to time. You can also check out my posts on causal inference and A/B testing.

        Books:

        Articles:

        Courses:

        Subscribe + +Causal Inference: What if by Miguel Hernán and Jamie Robins: The most practical book I’ve read. Highly recommended. +Trustworthy Online Controlled Experiments : A Practical Guide to A/B Testing by Ron Kohavi, Diane Tang, and Ya Xu: Building on the authors’ decades of industry experience, this is pretty much the bible of online experiments, which is how causal inference is often done in practice. +Why: A Guide to Finding and Using Causes by Samantha Kleinberg: A high-level intro to the topic. I discussed highlights in Why you should stop worrying about deep learning and deepen your understanding of causality instead. +Causality, Probability, and Time by Samantha Kleinberg: More technical than Kleinberg’s other book. As the title suggests, the element of time is central to the methods presented in the book. However, I’m still unsure about the practicality of those methods on real data. See my post Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptions for more details. +Causal Inference in Statistics: A Primer by Judea Pearl, Madelyn Glymour, Nicholas P. Jewell: A fairly accessible introduction to Judea Pearl’s work. I didn’t find it that practical, but I believe it helped me understand the graphical modelling parts of Causal Inference by Hernán and Robins. +Elements of Causal Inference: Foundations and Learning Algorithms by Jonas Peters, Dominik Janzing, and Bernhard Schölkopf: The name of the book is an obvious reference to the classic book The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unfortunately, the Elements of Causal Inference isn’t as widely applicable as Hastie et al.’s book – it contains some interesting ideas, but it appears that algorithms for causal learning from data with minimal assumptions aren’t yet scalable enough for practical use. This will probably change in the future. +Mostly Harmless Econometrics by Joshua D. Angrist and Jörn-Steffen Pischke: I started reading this book on my Kindle and was put off by some formatting issues. It also seemed like a less-general version of Pearl’s work. I may get back to it one day. +Causality: Models, Reasoning, and Inference by Judea Pearl: I haven’t read it, and I doubt it’d be very practical given the opinions of people who have. But maybe I’ll get to it one day. +The Book of Why: The New Science of Cause and Effect by Judea Pearl and Dana Mackenzie: An accessible overview of the field, focusing on Pearl’s contributions, but with plenty of historical background. Worth reading to get excited about the causal revolution. +Causal Machine Learning by Robert Osazuwa Ness: Still a draft as of September 2022, but it looks promising. + +Articles:">

        Causal inference resources

        This is a list of some causal inference resources, which I update from time to time. You can also check out my posts on causal inference and A/B testing.

        Books:

        Articles:

        Courses:

        Subscribe

          Public comments are closed, but I love hearing from readers. Feel free to diff --git a/deep-learning-resources/index.html b/deep-learning-resources/index.html index 88314c516..13c087f51 100644 --- a/deep-learning-resources/index.html +++ b/deep-learning-resources/index.html @@ -1,8 +1,77 @@ Deep learning resources | Yanir Seroussi | Data & AI for Startup Impact

          Deep learning resources

          This page summarises the deep learning resources I’ve consulted in my album cover classification project.

          Tutorials and blog posts

          Papers

          Subscribe +Tutorials and blog posts + +Convolutional Neural Networks for Visual Recognition Stanford course notes: an excellent resource, very up-to-date and useful, despite still being a work in progress +DeepLearning.net’s Theano-based tutorials: not as up-to-date as the Stanford course notes, but still a good introduction to some of the theory and general Theano usage +Lasagne’s documentation and tutorials: still a bit lacking, but good when you know what you’re looking for +lasagne4newbs: Lasagne’s convnet example with richer comments +Using convolutional neural nets to detect facial keypoints tutorial: the resource that made me want to use Lasagne +Classifying plankton with deep neural networks: an epic post, which I found while looking for Lasagne examples +Various Wikipedia pages: a bit disappointing – the above resources are much better + +Papers + +Adam: a method for stochastic optimization (Kingma and Ba, 2015): an improvement over SGD with Nesterov momentum, AdaGrad and RMSProp, which I found to be useful in practice +Algorithms for Hyper-Parameter Optimization (Bergstra et al., 2011): the work behind Hyperopt – pretty useful stuff, not only for deep learning +Convolutional Neural Networks at Constrained Time Cost (He and Sun, 2014): interesting experimental work on the tradeoffs between number of filters, filter sizes, and depth – deeper is better (but with diminishing returns); smaller filter sizes are better; delayed subsampling and spatial pyramid pooling are helpful +Deep Learning in Neural Networks: An Overview (Schmidhuber, 2014): 88 pages and 888 references (35 content pages) – good for finding references, but a bit hard to follow; not so good for understanding how the various methods work and how to use or implement them +Going deeper with convolutions (Szegedy et al., 2014): the GoogLeNet paper – interesting and compelling results, especially given the improvement in performance while reducing computational complexity +ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., 2012): the classic paper that arguably started (or significantly boosted) the recent buzz around deep learning – many interesting ideas; fairly accesible +On the importance of initialization and momentum in deep learning (Sutskever et al., 2013): applying Nesterov momentum to deep learning – good read, simple concept, interesting results +Random Search for Hyper-Parameter Optimization (Bergstra and Bengio, 2012): very compelling reasoning and experiments showing that random search outperforms grid search in many cases +Recognizing Image Style (Karayev et al., 2014): identifying image style, which is similar to album genre – found that using models pretrained on ImageNet yielded the best results in some cases +Very deep convolutional networks for large scale image recognition (Simonyan and Zisserman, 2014): VGGNet paper – interesting experiments and architectures – deep and homogeneous +Visualizing and Understanding Convolutional Networks (Zeiler and Fergus, 2013): interesting work on visualisation, but I’ll need to apply it to understand it better +">

          Deep learning resources

          This page summarises the deep learning resources I’ve consulted in my album cover classification project.

          Tutorials and blog posts

          Papers

          Subscribe

            Public comments are closed, but I love hearing from readers. Feel free to diff --git a/index.html b/index.html index 1fbee2df6..de0db72d8 100644 --- a/index.html +++ b/index.html @@ -1,4 +1,4 @@ -Yanir Seroussi | Data & AI for Startup Impact +Yanir Seroussi | Data & AI for Startup Impact Yanir Seroussi | Data & AI for Startup Impacthttps://yanirseroussi.com/Recent content on Yanir Seroussi | Data & AI for Startup ImpactHugo -- gohugo.ioen-auText and figures licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by [Yanir Seroussi](https://yanirseroussi.com/about/), except where noted otherwiseMon, 02 Sep 2024 02:30:00 +0000Juggling delivery, admin, and leads: Monthly biz recaphttps://yanirseroussi.com/2024/09/02/juggling-delivery-admin-and-leads-monthly-biz-recap/Mon, 02 Sep 2024 02:30:00 +0000https://yanirseroussi.com/2024/09/02/juggling-delivery-admin-and-leads-monthly-biz-recap/Highlights and lessons from my solo expertise biz, including value pricing, fractional cash flow, and distractions from admin &amp; politics.AI hype, AI bullshit, and the real dealhttps://yanirseroussi.com/2024/08/26/ai-hype-ai-bullshit-and-the-real-deal/Mon, 26 Aug 2024 01:00:00 +0000https://yanirseroussi.com/2024/08/26/ai-hype-ai-bullshit-and-the-real-deal/My views on separating AI hype and bullshit from the real deal. The general ideas apply to past and future hype waves in tech.Giving up on the minimum viable data stackhttps://yanirseroussi.com/2024/08/19/giving-up-on-the-minimum-viable-data-stack/Mon, 19 Aug 2024 03:30:00 +0000https://yanirseroussi.com/2024/08/19/giving-up-on-the-minimum-viable-data-stack/Exploring why universal advice on startup data stacks is challenging, and the importance of context-specific decisions in data infrastructure.Keep learning: Your career is never truly donehttps://yanirseroussi.com/2024/08/12/keep-learning-your-career-is-never-truly-done/Mon, 12 Aug 2024 01:30:00 +0000https://yanirseroussi.com/2024/08/12/keep-learning-your-career-is-never-truly-done/Podcast chat on my career journey from software engineering to data science and independent consulting.First year lessons from a solo expertise biz in Data & AIhttps://yanirseroussi.com/2024/08/05/first-year-lessons-from-a-solo-expertise-biz-in-data-and-ai/Mon, 05 Aug 2024 08:45:00 +0000https://yanirseroussi.com/2024/08/05/first-year-lessons-from-a-solo-expertise-biz-in-data-and-ai/Reflections on building a solo expertise business in Data &amp; AI, focusing on climate tech startups. Lessons learned from the first year of transition.AI/ML lifecycle models versus real-world messhttps://yanirseroussi.com/2024/07/29/ai-ml-lifecycle-models-versus-real-world-mess/Mon, 29 Jul 2024 06:00:00 +0000https://yanirseroussi.com/2024/07/29/ai-ml-lifecycle-models-versus-real-world-mess/The real world of AI/ML doesn&rsquo;t fit into a neat diagram, so I created another diagram and a maturity heatmap to model the mess.Your first Data-to-AI hire: Run a lovable processhttps://yanirseroussi.com/2024/07/22/your-first-data-to-ai-hire-run-a-lovable-process/Mon, 22 Jul 2024 01:00:00 +0000https://yanirseroussi.com/2024/07/22/your-first-data-to-ai-hire-run-a-lovable-process/Video and key points from the second part of a webinar on a startup&rsquo;s first data hire, covering tips for defining the role and running the process.Learn about Dataland to avoid expensive hiring mistakeshttps://yanirseroussi.com/2024/07/15/learn-about-dataland-to-avoid-expensive-hiring-mistakes/Mon, 15 Jul 2024 05:30:00 +0000https://yanirseroussi.com/2024/07/15/learn-about-dataland-to-avoid-expensive-hiring-mistakes/Video and key points from the first part of a webinar on a startup&rsquo;s first data hire, covering data &amp; AI definitions and high-level recommendations.Exploring an AI product idea with the latest ChatGPT, Claude, and Geminihttps://yanirseroussi.com/2024/07/08/exploring-an-ai-product-idea-with-the-latest-chatgpt-claude-and-gemini/Mon, 08 Jul 2024 02:45:00 +0000https://yanirseroussi.com/2024/07/08/exploring-an-ai-product-idea-with-the-latest-chatgpt-claude-and-gemini/Asking identical questions about my MagicGrantMaker idea yielded near-identical responses from the top chatbot models.Stay alert! Security is everyone's responsibilityhttps://yanirseroussi.com/2024/07/01/stay-alert-security-is-everyones-responsibility/Mon, 01 Jul 2024 02:00:00 +0000https://yanirseroussi.com/2024/07/01/stay-alert-security-is-everyones-responsibility/Questions to assess the security posture of a startup, focusing on basic hygiene and handling of sensitive data.Five team-building mistakes, according to Patty McCordhttps://yanirseroussi.com/til/2024/06/26/five-team-building-mistakes-according-to-patty-mccord/Wed, 26 Jun 2024 00:00:00 +0000https://yanirseroussi.com/til/2024/06/26/five-team-building-mistakes-according-to-patty-mccord/Takeaways from an interview with Patty McCord on The Startup Podcast.Is your tech stack ready for data-intensive applications?https://yanirseroussi.com/2024/06/24/is-your-tech-stack-ready-for-data-intensive-applications/Mon, 24 Jun 2024 02:00:00 +0000https://yanirseroussi.com/2024/06/24/is-your-tech-stack-ready-for-data-intensive-applications/Questions to assess the quality of tech stacks and lifecycles, with a focus on artificial intelligence, machine learning, and analytics.Dealing with endless data changeshttps://yanirseroussi.com/til/2024/06/22/dealing-with-endless-data-changes/Sat, 22 Jun 2024 22:50:00 +0000https://yanirseroussi.com/til/2024/06/22/dealing-with-endless-data-changes/Quotes from Demetrios Brinkmann on the relationship between MLOps and DevOps, with MLOps allowing for managing changes that come from data.AI ain't gonna save you from bad datahttps://yanirseroussi.com/2024/06/17/ai-aint-gonna-save-you-from-bad-data/Mon, 17 Jun 2024 02:00:00 +0000https://yanirseroussi.com/2024/06/17/ai-aint-gonna-save-you-from-bad-data/Since we&rsquo;re far from a utopia where data issues are fully handled by AI, this post presents six questions humans can use to assess data projects.The rules of the passion economyhttps://yanirseroussi.com/til/2024/06/12/the-rules-of-the-passion-economy/Wed, 12 Jun 2024 02:50:00 +0000https://yanirseroussi.com/til/2024/06/12/the-rules-of-the-passion-economy/Summary of the main messages from the book The Passion Economy by Adam Davidson.Startup data health starts with healthy event trackinghttps://yanirseroussi.com/2024/06/10/startup-data-health-starts-with-healthy-event-tracking/Mon, 10 Jun 2024 04:00:00 +0000https://yanirseroussi.com/2024/06/10/startup-data-health-starts-with-healthy-event-tracking/Expanding on the startup health check question of tracking Kukuyeva&rsquo;s five business aspects as wide events.How to avoid startups with poor development processeshttps://yanirseroussi.com/2024/06/03/how-to-avoid-startups-with-poor-development-processes/Mon, 03 Jun 2024 02:45:00 +0000https://yanirseroussi.com/2024/06/03/how-to-avoid-startups-with-poor-development-processes/Questions that prospective data specialists and engineers should ask about development processes before accepting a startup role.Plumbing, Decisions, and Automation: De-hyping Data & AIhttps://yanirseroussi.com/2024/05/27/plumbing-decisions-and-automation-de-hyping-data-and-ai/Mon, 27 May 2024 02:00:00 +0000https://yanirseroussi.com/2024/05/27/plumbing-decisions-and-automation-de-hyping-data-and-ai/Three essential questions to understand where an organisation stands when it comes to Data &amp; AI (with zero hype).Adapting to the economy of algorithmshttps://yanirseroussi.com/til/2024/05/25/adapting-to-the-economy-of-algorithms/Sat, 25 May 2024 00:00:00 +0000https://yanirseroussi.com/til/2024/05/25/adapting-to-the-economy-of-algorithms/Overview of the book The Economy of Algorithms by Marek Kowalkiewicz.Question startup culture before accepting a data-to-AI rolehttps://yanirseroussi.com/2024/05/20/question-startup-culture-before-accepting-a-data-to-ai-role/Mon, 20 May 2024 02:25:00 +0000https://yanirseroussi.com/2024/05/20/question-startup-culture-before-accepting-a-data-to-ai-role/Eight questions that prospective data-to-AI employees should ask about a startup&rsquo;s work and data culture.Probing the People aspects of an early-stage startuphttps://yanirseroussi.com/2024/05/13/probing-the-people-aspects-of-an-early-stage-startup/Mon, 13 May 2024 02:00:00 +0000https://yanirseroussi.com/2024/05/13/probing-the-people-aspects-of-an-early-stage-startup/Ten questions that prospective employees should ask about a startup&rsquo;s team, especially for data-centric roles.Business questions to ask before taking a startup data rolehttps://yanirseroussi.com/2024/05/06/business-questions-to-ask-before-taking-a-startup-data-role/Mon, 06 May 2024 04:30:00 +0000https://yanirseroussi.com/2024/05/06/business-questions-to-ask-before-taking-a-startup-data-role/Fourteen questions that prospective employees should ask about a startup&rsquo;s business model and product, especially for data-focused roles.Mentorship and the art of actionable advicehttps://yanirseroussi.com/2024/04/29/mentorship-and-the-art-of-actionable-advice/Mon, 29 Apr 2024 06:30:00 +0000https://yanirseroussi.com/2024/04/29/mentorship-and-the-art-of-actionable-advice/Reflections on what it takes to package expertise and deliver timely, actionable advice outside the context of employee relationships.Assessing a startup's data-to-AI healthhttps://yanirseroussi.com/2024/04/22/assessing-a-startups-data-to-ai-health/Mon, 22 Apr 2024 06:00:00 +0000https://yanirseroussi.com/2024/04/22/assessing-a-startups-data-to-ai-health/Reviewing the areas that should be assessed to determine a startup&rsquo;s opportunities and challenges on the data/AI/ML front.AI does not obviate the need for testing and observabilityhttps://yanirseroussi.com/2024/04/15/ai-does-not-obviate-the-need-for-testing-and-observability/Mon, 15 Apr 2024 05:00:00 +0000https://yanirseroussi.com/2024/04/15/ai-does-not-obviate-the-need-for-testing-and-observability/It&rsquo;s easy to prototype with AI, but production-grade AI apps require even more thorough testing and observability than traditional software.LinkedIn is a teachable skillhttps://yanirseroussi.com/til/2024/04/11/linkedin-is-a-teachable-skill/Thu, 11 Apr 2024 01:45:25 +0000https://yanirseroussi.com/til/2024/04/11/linkedin-is-a-teachable-skill/An high-level overview of things I learned from Justin Welsh&rsquo;s LinkedIn Operating System course.My experience as a Data Tech Lead with Work on Climatehttps://yanirseroussi.com/2024/04/08/my-experience-as-a-data-tech-lead-with-work-on-climate/Mon, 08 Apr 2024 02:00:00 +0000https://yanirseroussi.com/2024/04/08/my-experience-as-a-data-tech-lead-with-work-on-climate/The story of how I joined Work on Climate as a volunteer and became its data tech lead, with lessons applied to consulting &amp; fractional work.The data engineering lifecycle is not going anywherehttps://yanirseroussi.com/til/2024/04/05/the-data-engineering-lifecycle-is-not-going-anywhere/Fri, 05 Apr 2024 01:00:00 +0000https://yanirseroussi.com/til/2024/04/05/the-data-engineering-lifecycle-is-not-going-anywhere/My key takeaways from reading Fundamentals of Data Engineering by Joe Reis and Matt Housley.Artificial intelligence, automation, and the art of counting fishhttps://yanirseroussi.com/2024/04/01/artificial-intelligence-automation-and-the-art-of-counting-fish/Mon, 01 Apr 2024 06:00:00 +0000https://yanirseroussi.com/2024/04/01/artificial-intelligence-automation-and-the-art-of-counting-fish/Discussing the use of AI to automate underwater marine surveys as an example of the uneven distribution of technological advancement.Atomic Habits is full of actionable advicehttps://yanirseroussi.com/til/2024/03/12/atomic-habits-is-full-of-actionable-advice/Tue, 12 Mar 2024 06:19:31 +0000https://yanirseroussi.com/til/2024/03/12/atomic-habits-is-full-of-actionable-advice/I put the book to use after the first listen, and will definitely revisit it in the future to form better habits.Questions to consider when using AI for PDF data extractionhttps://yanirseroussi.com/2024/03/11/questions-to-consider-when-using-ai-for-pdf-data-extraction/Mon, 11 Mar 2024 00:00:00 +0000https://yanirseroussi.com/2024/03/11/questions-to-consider-when-using-ai-for-pdf-data-extraction/Discussing considerations that arise when attempting to automate the extraction of structured data from PDFs and similar documents.Two types of startup data problemshttps://yanirseroussi.com/2024/03/04/two-types-of-startup-data-problems/Mon, 04 Mar 2024 02:00:00 +0000https://yanirseroussi.com/2024/03/04/two-types-of-startup-data-problems/Classifying startups as ML-centric or non-ML is a helpful exercise to uncover the data challenges they&rsquo;re likely to face.Avoiding AI complexity: First, write no codehttps://yanirseroussi.com/2024/02/26/avoiding-ai-complexity-first-write-no-code/Mon, 26 Feb 2024 01:45:00 +0000https://yanirseroussi.com/2024/02/26/avoiding-ai-complexity-first-write-no-code/Two stories of getting AI functionality to production, which demonstrate the risks inherent in custom development versus starting with a no-code approach.Building your startup's minimum viable data stackhttps://yanirseroussi.com/2024/02/19/building-your-startups-minimum-viable-data-stack/Mon, 19 Feb 2024 00:00:00 +0000https://yanirseroussi.com/2024/02/19/building-your-startups-minimum-viable-data-stack/First post in a series on building a minimum viable data stack for startups, introducing key definitions, components, and considerations.The three Cs of indie consulting: Confidence, Cash, and Connectionshttps://yanirseroussi.com/til/2024/02/17/the-three-cs-of-indie-consulting-confidence-cash-and-connections/Sat, 17 Feb 2024 02:00:00 +0000https://yanirseroussi.com/til/2024/02/17/the-three-cs-of-indie-consulting-confidence-cash-and-connections/Jonathan Stark makes a compelling argument why you should have the three Cs before quitting your job to go solo consulting.Nudging ChatGPT to invent books you have no time to readhttps://yanirseroussi.com/2024/02/12/nudging-chatgpt-to-invent-books-you-have-no-time-to-read/Mon, 12 Feb 2024 05:00:00 +0000https://yanirseroussi.com/2024/02/12/nudging-chatgpt-to-invent-books-you-have-no-time-to-read/Getting ChatGPT Plus to elaborate on possible book content and produce a PDF cheatsheet, with the goal of learning about its capabilities.Future software development may require fewer humanshttps://yanirseroussi.com/til/2024/02/06/future-software-development-may-require-fewer-humans/Tue, 06 Feb 2024 06:15:00 +0000https://yanirseroussi.com/til/2024/02/06/future-software-development-may-require-fewer-humans/Reflecting on an interview with Jason Warner, CEO of poolside.Substance over titles: Your first data hire may be a data scientisthttps://yanirseroussi.com/2024/02/05/substance-over-titles-your-first-data-hire-may-be-a-data-scientist/Mon, 05 Feb 2024 02:45:00 +0000https://yanirseroussi.com/2024/02/05/substance-over-titles-your-first-data-hire-may-be-a-data-scientist/Advice for hiring a startup&rsquo;s first data person: match skills to business needs, consider contractors, and get help from data people.New decade, new tagline: Data & AI for Impacthttps://yanirseroussi.com/2024/01/19/new-decade-new-tagline-data-and-ai-for-impact/Fri, 19 Jan 2024 00:00:00 +0000https://yanirseroussi.com/2024/01/19/new-decade-new-tagline-data-and-ai-for-impact/Shifting focus to &lsquo;Data &amp; AI for Impact&rsquo;, with more startup-related content, increased posting frequency, and deeper audience engagement.Psychographic specialisations may work for discipline generalistshttps://yanirseroussi.com/til/2024/01/09/psychographic-specialisations-may-work-for-discipline-generalists/Tue, 09 Jan 2024 03:00:00 +0000https://yanirseroussi.com/til/2024/01/09/psychographic-specialisations-may-work-for-discipline-generalists/When focusing on a market segment defined by personal beliefs, it&rsquo;s often fine to position yourself as a generalist in your craft.The power of parasocial relationshipshttps://yanirseroussi.com/til/2024/01/08/the-power-of-parasocial-relationships/Mon, 08 Jan 2024 06:00:00 +0000https://yanirseroussi.com/til/2024/01/08/the-power-of-parasocial-relationships/Repeated exposure to media personas creates relationships that help justify premium fees.Positioning is a common problem for data scientistshttps://yanirseroussi.com/til/2023/12/18/positioning-is-a-common-problem-for-data-scientists/Mon, 18 Dec 2023 00:30:00 +0000https://yanirseroussi.com/til/2023/12/18/positioning-is-a-common-problem-for-data-scientists/With the commodification of data scientists, the problem of positioning has become more common: My takeaways from Genevieve Hayes interviewing Jonathan Stark.Transfer learning applies to energy market biddinghttps://yanirseroussi.com/til/2023/12/14/transfer-learning-applies-to-energy-market-bidding/Thu, 14 Dec 2023 00:15:00 +0000https://yanirseroussi.com/til/2023/12/14/transfer-learning-applies-to-energy-market-bidding/An interesting approach to bidding of energy storage assets, showing that training on New York data is transferable to Queensland.Supporting volunteer monitoring of marine biodiversity with modern web and data toolshttps://yanirseroussi.com/2023/11/29/supporting-volunteer-monitoring-of-marine-biodiversity-with-modern-web-and-data-tools/Wed, 29 Nov 2023 02:00:00 +0000https://yanirseroussi.com/2023/11/29/supporting-volunteer-monitoring-of-marine-biodiversity-with-modern-web-and-data-tools/Summarising the work Uri Seroussi and I did to improve Reef Life Survey&rsquo;s Reef Species of the World app.Our Blue Machine is changing, but we are not helplesshttps://yanirseroussi.com/til/2023/11/28/our-blue-machine-is-changing-but-we-are-not-helpless/Tue, 28 Nov 2023 06:40:00 +0000https://yanirseroussi.com/til/2023/11/28/our-blue-machine-is-changing-but-we-are-not-helpless/One of my many highlights from Helen Czerski&rsquo;s Blue Machine.You don't need a proprietary API for static mapshttps://yanirseroussi.com/til/2023/11/21/you-dont-need-a-proprietary-api-for-static-maps/Tue, 21 Nov 2023 06:00:00 +0000https://yanirseroussi.com/til/2023/11/21/you-dont-need-a-proprietary-api-for-static-maps/For many use cases, libraries like cartopy are better than the likes of Mapbox and Google Maps.Lessons from reluctant data engineeringhttps://yanirseroussi.com/2023/10/25/lessons-from-reluctant-data-engineering/Wed, 25 Oct 2023 04:45:00 +0000https://yanirseroussi.com/2023/10/25/lessons-from-reluctant-data-engineering/Video and summary of a talk I gave at DataEngBytes Brisbane on what I learned from doing data engineering as part of every data science role I had.Artificial intelligence was a marketing term all along – just call it automationhttps://yanirseroussi.com/til/2023/10/06/artificial-intelligence-was-a-marketing-term-all-along-just-call-it-automation/Fri, 06 Oct 2023 05:00:00 +0000https://yanirseroussi.com/til/2023/10/06/artificial-intelligence-was-a-marketing-term-all-along-just-call-it-automation/Replacing &lsquo;artificial intelligence&rsquo; with &lsquo;automation&rsquo; is a useful trick for cutting through the hype.The lines between solo consulting and product building are blurryhttps://yanirseroussi.com/til/2023/09/25/the-lines-between-solo-consulting-and-product-building-are-blurry/Mon, 25 Sep 2023 00:00:00 +0000https://yanirseroussi.com/til/2023/09/25/the-lines-between-solo-consulting-and-product-building-are-blurry/It turns out that problems like finding a niche and defining the ideal clients are key to any solo business.Google's Rules of Machine Learning still apply in the age of large language modelshttps://yanirseroussi.com/til/2023/09/21/googles-rules-of-machine-learning-still-apply-in-the-age-of-large-language-models/Thu, 21 Sep 2023 21:30:00 +0000https://yanirseroussi.com/til/2023/09/21/googles-rules-of-machine-learning-still-apply-in-the-age-of-large-language-models/Despite the excitement around large language models, building with machine learning remains an engineering problem with established best practices.My rediscovery of quiet writing on the open webhttps://yanirseroussi.com/2023/08/28/my-rediscovery-of-quiet-writing-on-the-open-web/Mon, 28 Aug 2023 05:30:00 +0000https://yanirseroussi.com/2023/08/28/my-rediscovery-of-quiet-writing-on-the-open-web/Reflections on publishing on this website: Writing publicly to share thoughts and documentation beats chasing views and likes.The Minimalist Entrepreneur is too prescriptive for mehttps://yanirseroussi.com/til/2023/08/21/the-minimalist-entrepreneur-is-too-prescriptive-for-me/Mon, 21 Aug 2023 03:15:00 +0000https://yanirseroussi.com/til/2023/08/21/the-minimalist-entrepreneur-is-too-prescriptive-for-me/While I found the story of Gumroad interesting, The Minimalist Entrepreneur seems to over-generalise from the founder&rsquo;s experience.Revisiting Start Small, Stay Small in 2023 (Chapter 2)https://yanirseroussi.com/til/2023/08/17/revisiting-start-small-stay-small-in-2023-chapter-2/Thu, 17 Aug 2023 07:45:00 +0000https://yanirseroussi.com/til/2023/08/17/revisiting-start-small-stay-small-in-2023-chapter-2/A summary of the second chapter of Rob Walling&rsquo;s Start Small, Stay Small, along with my thoughts &amp; reflections.Revisiting Start Small, Stay Small in 2023 (Chapter 1)https://yanirseroussi.com/til/2023/08/16/revisiting-start-small-stay-small-in-2023-chapter-1/Wed, 16 Aug 2023 05:45:00 +0000https://yanirseroussi.com/til/2023/08/16/revisiting-start-small-stay-small-in-2023-chapter-1/A summary of the first chapter of Rob Walling&rsquo;s Start Small, Stay Small, along with my thoughts &amp; reflections.Email notifications on public GitHub commitshttps://yanirseroussi.com/til/2023/08/14/email-notifications-on-public-github-commits/Mon, 14 Aug 2023 05:15:00 +0000https://yanirseroussi.com/til/2023/08/14/email-notifications-on-public-github-commits/GitHub publishes an Atom feed, which means you can use any RSS reader to follow commits.The rule of thirds can probably be ignoredhttps://yanirseroussi.com/til/2023/08/11/the-rule-of-thirds-can-probably-be-ignored/Fri, 11 Aug 2023 03:15:00 +0000https://yanirseroussi.com/til/2023/08/11/the-rule-of-thirds-can-probably-be-ignored/Turns out that the rule of thirds for composing visuals may not be that important.Using YubiKey for SSH accesshttps://yanirseroussi.com/til/2023/07/23/using-yubikey-for-ssh-access/Sun, 23 Jul 2023 00:07:15 +0000https://yanirseroussi.com/til/2023/07/23/using-yubikey-for-ssh-access/Some pointers for setting up SSH access with YubiKey on Ubuntu 22.04.Making a TIL section with Hugo and PaperModhttps://yanirseroussi.com/til/2023/07/17/making-a-til-section-with-hugo-and-papermod/Mon, 17 Jul 2023 00:06:15 +0000https://yanirseroussi.com/til/2023/07/17/making-a-til-section-with-hugo-and-papermod/How I added a Today I Learned section to my Hugo site with the PaperMod theme.You can't save timehttps://yanirseroussi.com/til/2023/07/11/you-cant-save-time/Tue, 11 Jul 2023 00:00:00 +0000https://yanirseroussi.com/til/2023/07/11/you-cant-save-time/Time can be spent doing different activities, but it can&rsquo;t be stored and saved for later.Was data science a failure mode of software engineering?https://yanirseroussi.com/2023/06/30/was-data-science-a-failure-mode-of-software-engineering/Fri, 30 Jun 2023 00:06:30 +0000https://yanirseroussi.com/2023/06/30/was-data-science-a-failure-mode-of-software-engineering/Yes, data science projects have suffered from classic software engineering mistakes, but the field is maturing with the rise of new engineering roles.How hackable are automated coding assessments?https://yanirseroussi.com/2023/05/26/how-hackable-are-automated-coding-assessments/Fri, 26 May 2023 00:03:00 +0000https://yanirseroussi.com/2023/05/26/how-hackable-are-automated-coding-assessments/Exploring the hackability of speed-based coding tests, using CodeSignal&rsquo;s Industry Coding Framework as a case study.Remaining relevant as a small language modelhttps://yanirseroussi.com/2023/04/21/remaining-relevant-as-a-small-language-model/Fri, 21 Apr 2023 00:06:30 +0000https://yanirseroussi.com/2023/04/21/remaining-relevant-as-a-small-language-model/Bing Chat recently quipped that humans are small language models. Here are some of my thoughts on how we small language models can remain relevant (for now).ChatGPT is transformative AIhttps://yanirseroussi.com/2022/12/11/chatgpt-is-transformative-ai/Sun, 11 Dec 2022 00:00:00 +0000https://yanirseroussi.com/2022/12/11/chatgpt-is-transformative-ai/My perspective after a week of using ChatGPT: This is a step change in finding distilled information, and it&rsquo;s only the beginning.Causal Machine Learning is off to a good start, despite some issueshttps://yanirseroussi.com/2022/09/12/causal-machine-learning-book-draft-review/Mon, 12 Sep 2022 02:45:00 +0000https://yanirseroussi.com/2022/09/12/causal-machine-learning-book-draft-review/Reviewing the first three chapters of the book Causal Machine Learning by Robert Osazuwa Ness.The mission matters: Moving to climate tech as a data scientisthttps://yanirseroussi.com/2022/06/06/the-mission-matters-moving-to-climate-tech-as-a-data-scientist/Mon, 06 Jun 2022 00:00:00 +0000https://yanirseroussi.com/2022/06/06/the-mission-matters-moving-to-climate-tech-as-a-data-scientist/Discussing my recent career move into climate tech as a way of doing more to help mitigate dangerous climate change.Building useful machine learning tools keeps getting easier: A fish ID case studyhttps://yanirseroussi.com/2022/03/20/building-useful-machine-learning-tools-keeps-getting-easier-a-fish-id-case-study/Sun, 20 Mar 2022 04:30:00 +0000https://yanirseroussi.com/2022/03/20/building-useful-machine-learning-tools-keeps-getting-easier-a-fish-id-case-study/Lessons learned building a fish ID web app with fast.ai and Streamlit, in an attempt to reduce my fear of missing out on the latest deep learning developments.Analysis strategies in online A/B experiments: Intention-to-treat, per-protocol, and other lessons from clinical trialshttps://yanirseroussi.com/2022/01/14/analysis-strategies-in-online-a-b-experiments/Fri, 14 Jan 2022 00:05:40 +0000https://yanirseroussi.com/2022/01/14/analysis-strategies-in-online-a-b-experiments/Epidemiologists analyse clinical trials to estimate the intention-to-treat and per-protocol effects. This post applies their strategies to online experiments.Use your human brain to avoid artificial intelligence disastershttps://yanirseroussi.com/2021/11/22/use-your-human-brain-to-avoid-artificial-intelligence-disasters/Mon, 22 Nov 2021 03:45:00 +0000https://yanirseroussi.com/2021/11/22/use-your-human-brain-to-avoid-artificial-intelligence-disasters/Overview of a talk I gave at a deep learning course, focusing on AI ethics as the need for humans to think on the context and consequences of applying AI.Migrating from WordPress.com to Hugo on GitHub + Cloudflarehttps://yanirseroussi.com/2021/11/10/migrating-from-wordpress-com-to-hugo-on-github-cloudflare/Wed, 10 Nov 2021 06:30:00 +0000https://yanirseroussi.com/2021/11/10/migrating-from-wordpress-com-to-hugo-on-github-cloudflare/My reasons for switching from WordPress.com to Hugo on GitHub + Cloudflare, along with a summary of the solution components and migration process.My work with Automattichttps://yanirseroussi.com/2021/10/07/my-work-with-automattic/Thu, 07 Oct 2021 00:00:00 +0000https://yanirseroussi.com/2021/10/07/my-work-with-automattic/Back-dated meta-post that gathers my posts on Automattic blogs into a summary of the work I&rsquo;ve done with the company.Some highlights from 2020https://yanirseroussi.com/2021/04/05/some-highlights-from-2020/Mon, 05 Apr 2021 06:41:48 +0000https://yanirseroussi.com/2021/04/05/some-highlights-from-2020/Sharing remote teamwork insights, my climate &amp; sustainability activism, Reef Life Survey publications, and progress on Automattic&rsquo;s Experimentation Platform.Many is not enough: Counting simulations to bootstrap the right wayhttps://yanirseroussi.com/2020/08/24/many-is-not-enough-counting-simulations-to-bootstrap-the-right-way/Mon, 24 Aug 2020 01:35:17 +0000https://yanirseroussi.com/2020/08/24/many-is-not-enough-counting-simulations-to-bootstrap-the-right-way/Going deeper into correct testing of different methods for bootstrap estimation of confidence intervals.Software commodities are eating interesting data science workhttps://yanirseroussi.com/2020/01/11/software-commodities-are-eating-interesting-data-science-work/Sat, 11 Jan 2020 09:22:35 +0000https://yanirseroussi.com/2020/01/11/software-commodities-are-eating-interesting-data-science-work/Being a data scientist can sometimes feel like a race against software commodities that replace interesting work. What can one do to remain relevant?A day in the life of a remote data scientisthttps://yanirseroussi.com/2019/12/12/a-day-in-the-life-of-a-remote-data-scientist/Wed, 11 Dec 2019 22:06:19 +0000https://yanirseroussi.com/2019/12/12/a-day-in-the-life-of-a-remote-data-scientist/Video of a talk I gave on remote data science work at the Data Science Sydney meetup.Bootstrapping the right way?https://yanirseroussi.com/2019/10/06/bootstrapping-the-right-way/Sun, 06 Oct 2019 06:48:07 +0000https://yanirseroussi.com/2019/10/06/bootstrapping-the-right-way/Video and summary of a talk I gave at YOW! Data on bootstrap estimation of confidence intervals.Hackers beware: Bootstrap sampling may be harmfulhttps://yanirseroussi.com/2019/01/08/hackers-beware-bootstrap-sampling-may-be-harmful/Mon, 07 Jan 2019 21:07:56 +0000https://yanirseroussi.com/2019/01/08/hackers-beware-bootstrap-sampling-may-be-harmful/Bootstrap sampling has been promoted as an easy way of modelling uncertainty to hackers without much statistical knowledge. But things aren&rsquo;t that simple.The most practical causal inference book I’ve read (is still a draft)https://yanirseroussi.com/2018/12/24/the-most-practical-causal-inference-book-ive-read-is-still-a-draft/Mon, 24 Dec 2018 02:37:50 +0000https://yanirseroussi.com/2018/12/24/the-most-practical-causal-inference-book-ive-read-is-still-a-draft/Causal Inference by Miguel Hernán and Jamie Robins is a must-read for anyone interested in the area.Reflections on remote data science workhttps://yanirseroussi.com/2018/11/03/reflections-on-remote-data-science-work/Sat, 03 Nov 2018 06:33:13 +0000https://yanirseroussi.com/2018/11/03/reflections-on-remote-data-science-work/Discussing the pluses and minuses of remote work eighteen months after joining Automattic as a data scientist.Defining data science in 2018https://yanirseroussi.com/2018/07/22/defining-data-science-in-2018/Sun, 22 Jul 2018 08:27:43 +0000https://yanirseroussi.com/2018/07/22/defining-data-science-in-2018/Updating my definition of data science to match changes in the field. It is now broader than before, but its ultimate goal is still to support decisions.Advice for aspiring data scientists and other FAQshttps://yanirseroussi.com/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/Sun, 15 Oct 2017 09:15:25 +0000https://yanirseroussi.com/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/Frequently asked questions by visitors to this site, especially around entering the data science field.State of Bandcamp Recommender, Late 2017https://yanirseroussi.com/2017/09/02/state-of-bandcamp-recommender/Sat, 02 Sep 2017 10:19:02 +0000https://yanirseroussi.com/2017/09/02/state-of-bandcamp-recommender/Call for BCRecommender maintainers followed by a decision to shut it down, as I don&rsquo;t have enough time and Bandcamp now offers recommendations.My 10-step path to becoming a remote data scientist with Automattichttps://yanirseroussi.com/2017/07/29/my-10-step-path-to-becoming-a-remote-data-scientist-with-automattic/Sat, 29 Jul 2017 05:39:26 +0000https://yanirseroussi.com/2017/07/29/my-10-step-path-to-becoming-a-remote-data-scientist-with-automattic/I wanted a well-paid data science-y remote job with an established company that offers a good life balance and makes products I care about. I got it eventually.Exploring and visualising Reef Life Survey datahttps://yanirseroussi.com/2017/06/03/exploring-and-visualising-reef-life-survey-data/Sat, 03 Jun 2017 00:49:05 +0000https://yanirseroussi.com/2017/06/03/exploring-and-visualising-reef-life-survey-data/Web tools I built to visualise Reef Life Survey data and assist citizen scientists in underwater visual census work.Customer lifetime value and the proliferation of misinformation on the internethttps://yanirseroussi.com/2017/01/08/customer-lifetime-value-and-the-proliferation-of-misinformation-on-the-internet/Sun, 08 Jan 2017 20:02:30 +0000https://yanirseroussi.com/2017/01/08/customer-lifetime-value-and-the-proliferation-of-misinformation-on-the-internet/There&rsquo;s a lot of misleading content on the estimation of customer lifetime value. Here&rsquo;s what I learned about doing it well.Ask Why! Finding motives, causes, and purpose in data sciencehttps://yanirseroussi.com/2016/09/19/ask-why-finding-motives-causes-and-purpose-in-data-science/Mon, 19 Sep 2016 21:28:44 +0000https://yanirseroussi.com/2016/09/19/ask-why-finding-motives-causes-and-purpose-in-data-science/Video and summary of a talk I gave at the Data Science Sydney meetup, about going beyond the what &amp; how of predictive modelling.If you don’t pay attention, data can drive you off a cliffhttps://yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/Sun, 21 Aug 2016 21:34:17 +0000https://yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/Seven common mistakes to avoid when working with data, such as ignoring uncertainty and confusing observed and unobserved quantities.Is Data Scientist a useless job title?https://yanirseroussi.com/2016/08/04/is-data-scientist-a-useless-job-title/Thu, 04 Aug 2016 22:26:03 +0000https://yanirseroussi.com/2016/08/04/is-data-scientist-a-useless-job-title/It seems like anyone who touches data can call themselves a data scientist, which makes the title useless. The work they do can still be useful, though.Making Bayesian A/B testing more accessiblehttps://yanirseroussi.com/2016/06/19/making-bayesian-ab-testing-more-accessible/Sun, 19 Jun 2016 10:32:15 +0000https://yanirseroussi.com/2016/06/19/making-bayesian-ab-testing-more-accessible/A web tool I built to interpret A/B test results in a Bayesian way, including prior specification, visualisations, and decision rules.Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptionshttps://yanirseroussi.com/2016/05/15/diving-deeper-into-causality-pearl-kleinberg-hill-and-untested-assumptions/Sat, 14 May 2016 19:57:03 +0000https://yanirseroussi.com/2016/05/15/diving-deeper-into-causality-pearl-kleinberg-hill-and-untested-assumptions/Discussing the need for untested assumptions and temporality in causal inference. Mostly based on Samantha Kleinberg&rsquo;s Causality, Probability, and Time.The rise of greedy robotshttps://yanirseroussi.com/2016/03/20/the-rise-of-greedy-robots/Sun, 20 Mar 2016 20:33:43 +0000https://yanirseroussi.com/2016/03/20/the-rise-of-greedy-robots/Is artificial/machine intelligence a future threat? I argue that it&rsquo;s already here, with greedy robots already dominating our lives.Why you should stop worrying about deep learning and deepen your understanding of causality insteadhttps://yanirseroussi.com/2016/02/14/why-you-should-stop-worrying-about-deep-learning-and-deepen-your-understanding-of-causality-instead/Sun, 14 Feb 2016 11:04:11 +0000https://yanirseroussi.com/2016/02/14/why-you-should-stop-worrying-about-deep-learning-and-deepen-your-understanding-of-causality-instead/Causality is often overlooked but is of much higher relevance to most data scientists than deep learning.The joys of offline data collectionhttps://yanirseroussi.com/2016/01/24/the-joys-of-offline-data-collection/Sun, 24 Jan 2016 00:32:25 +0000https://yanirseroussi.com/2016/01/24/the-joys-of-offline-data-collection/Insights on data collection and machine learning from spending a month sailing, diving, and counting fish with Reef Life Survey.This holiday season, give me real insightshttps://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/Tue, 08 Dec 2015 06:57:25 +0000https://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/Some companies present raw data or information as &ldquo;insights&rdquo;. This post surveys some examples, and discusses how they can be turned into real insights.The hardest parts of data sciencehttps://yanirseroussi.com/2015/11/23/the-hardest-parts-of-data-science/Mon, 23 Nov 2015 04:14:21 +0000https://yanirseroussi.com/2015/11/23/the-hardest-parts-of-data-science/Defining feasible problems and coming up with reasonable ways of measuring solutions is harder than building accurate models or obtaining clean data.Migrating a simple web application from MongoDB to Elasticsearchhttps://yanirseroussi.com/2015/11/04/migrating-a-simple-web-application-from-mongodb-to-elasticsearch/Wed, 04 Nov 2015 03:53:18 +0000https://yanirseroussi.com/2015/11/04/migrating-a-simple-web-application-from-mongodb-to-elasticsearch/Migrating BCRecommender from MongoDB to Elasticsearch made it possible to offer a richer search experience to users at a similar cost, among other benefits.Miscommunicating science: Simplistic models, nutritionism, and the art of storytellinghttps://yanirseroussi.com/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/Mon, 19 Oct 2015 00:02:32 +0000https://yanirseroussi.com/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/Nutritionism is a special case of misinterpretation and miscommunication of scientific results – something many data scientists encounter in their work.The wonderful world of recommender systemshttps://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/Fri, 02 Oct 2015 05:25:57 +0000https://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/Giving an overview of the field and common paradigms, and debunking five common myths about recommender systems.You don’t need a data scientist (yet)https://yanirseroussi.com/2015/08/24/you-dont-need-a-data-scientist-yet/Mon, 24 Aug 2015 08:25:30 +0000https://yanirseroussi.com/2015/08/24/you-dont-need-a-data-scientist-yet/Hiring data scientists prematurely is wasteful and frustrating. Here are some questions to ask before you hire your first data scientist.Goodbye, Parse.comhttps://yanirseroussi.com/2015/07/31/goodbye-parse-com/Fri, 31 Jul 2015 03:29:50 +0000https://yanirseroussi.com/2015/07/31/goodbye-parse-com/Migrating my web apps away from Parse.com due to reliability issues. Self-hosting is a better solution.Learning about deep learning through album cover classificationhttps://yanirseroussi.com/2015/07/06/learning-about-deep-learning-through-album-cover-classification/Mon, 06 Jul 2015 22:21:42 +0000https://yanirseroussi.com/2015/07/06/learning-about-deep-learning-through-album-cover-classification/Progress on my album cover classification project, highlighting lessons that would be useful to others who are getting started with deep learning.Deep learning resourceshttps://yanirseroussi.com/deep-learning-resources/Mon, 06 Jul 2015 00:38:44 +0000https://yanirseroussi.com/deep-learning-resources/This page summarises the deep learning resources I&rsquo;ve consulted in my album cover classification project. -Tutorials and blog posts Convolutional Neural Networks for Visual Recognition Stanford course notes: an excellent resource, very up-to-date and useful, despite still being a work in progress DeepLearning.net&rsquo;s Theano-based tutorials: not as up-to-date as the Stanford course notes, but still a good introduction to some of the theory and general Theano usage Lasagne&rsquo;s documentation and tutorials: still a bit lacking, but good when you know what you&rsquo;re looking for lasagne4newbs: Lasagne&rsquo;s convnet example with richer comments Using convolutional neural nets to detect facial keypoints tutorial: the resource that made me want to use Lasagne Classifying plankton with deep neural networks: an epic post, which I found while looking for Lasagne examples Various Wikipedia pages: a bit disappointing – the above resources are much better Papers Adam: a method for stochastic optimization (Kingma and Ba, 2015): an improvement over SGD with Nesterov momentum, AdaGrad and RMSProp, which I found to be useful in practice Algorithms for Hyper-Parameter Optimization (Bergstra et al.Hopping on the deep learning bandwagonhttps://yanirseroussi.com/2015/06/06/hopping-on-the-deep-learning-bandwagon/Sat, 06 Jun 2015 05:00:22 +0000https://yanirseroussi.com/2015/06/06/hopping-on-the-deep-learning-bandwagon/To become proficient at solving data science problems, you need to get your hands dirty. Here, I used album cover classification to learn about deep learning.First steps in data science: author-aware sentiment analysishttps://yanirseroussi.com/2015/05/02/first-steps-in-data-science-author-aware-sentiment-analysis/Sat, 02 May 2015 08:31:10 +0000https://yanirseroussi.com/2015/05/02/first-steps-in-data-science-author-aware-sentiment-analysis/I became a data scientist by doing a PhD, but the same steps can be followed without a formal education program.My divestment from fossil fuelshttps://yanirseroussi.com/2015/04/24/my-divestment-from-fossil-fuels/Fri, 24 Apr 2015 00:19:36 +0000https://yanirseroussi.com/2015/04/24/my-divestment-from-fossil-fuels/Recent choices I&rsquo;ve made to reduce my exposure to fossil fuels, including practical steps that can be taken by Australians and generally applicable lessons.My PhD workhttps://yanirseroussi.com/phd-work/Mon, 30 Mar 2015 03:23:33 +0000https://yanirseroussi.com/phd-work/An overview of my PhD in data science / artificial intelligence. Thesis title: Text Mining and Rating Prediction with Topical User Models.The long road to a lifestyle businesshttps://yanirseroussi.com/2015/03/22/the-long-road-to-a-lifestyle-business/Sun, 22 Mar 2015 09:43:47 +0000https://yanirseroussi.com/2015/03/22/the-long-road-to-a-lifestyle-business/Progress since leaving my last full-time job and setting on an independent path that includes data science consulting and work on my own projects.Learning to rank for personalised search (Yandex Search Personalisation – Kaggle Competition Summary – Part 2)https://yanirseroussi.com/2015/02/11/learning-to-rank-for-personalised-search-yandex-search-personalisation-kaggle-competition-summary-part-2/Wed, 11 Feb 2015 06:34:17 +0000https://yanirseroussi.com/2015/02/11/learning-to-rank-for-personalised-search-yandex-search-personalisation-kaggle-competition-summary-part-2/My team&rsquo;s solution to the Yandex Search Personalisation competition (finished 9th out of 194 teams).Is thinking like a search engine possible? (Yandex search personalisation – Kaggle competition summary – part 1)https://yanirseroussi.com/2015/01/29/is-thinking-like-a-search-engine-possible-yandex-search-personalisation-kaggle-competition-summary-part-1/Thu, 29 Jan 2015 10:37:39 +0000https://yanirseroussi.com/2015/01/29/is-thinking-like-a-search-engine-possible-yandex-search-personalisation-kaggle-competition-summary-part-1/Insights on search personalisation and SEO from participating in a Kaggle competition (finished 9th out of 194 teams).Automating Parse.com bulk data importshttps://yanirseroussi.com/2015/01/15/automating-parse-com-bulk-data-imports/Thu, 15 Jan 2015 04:41:16 +0000https://yanirseroussi.com/2015/01/15/automating-parse-com-bulk-data-imports/A script for importing data into the Parse backend-as-a-service.Stochastic Gradient Boosting: Choosing the Best Number of Iterationshttps://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/Mon, 29 Dec 2014 02:30:06 +0000https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/Exploring an approach to choosing the optimal number of iterations in stochastic gradient boosting, following a bug I found in scikit-learn.SEO: Mostly about showing up?https://yanirseroussi.com/2014/12/15/seo-mostly-about-showing-up/Mon, 15 Dec 2014 04:25:25 +0000https://yanirseroussi.com/2014/12/15/seo-mostly-about-showing-up/Increasing SEO traffic to BCRecommender by adding content and opening up more pages for crawling. It turns out that thin content is better than no content.Fitting noise: Forecasting the sale price of bulldozers (Kaggle competition summary)https://yanirseroussi.com/2014/11/19/fitting-noise-forecasting-the-sale-price-of-bulldozers-kaggle-competition-summary/Wed, 19 Nov 2014 09:17:34 +0000https://yanirseroussi.com/2014/11/19/fitting-noise-forecasting-the-sale-price-of-bulldozers-kaggle-competition-summary/Summary of a Kaggle competition to forecast bulldozer sale price, where I finished 9th out of 476 teams.BCRecommender Traction Updatehttps://yanirseroussi.com/2014/11/05/bcrecommender-traction-update/Wed, 05 Nov 2014 02:29:35 +0000https://yanirseroussi.com/2014/11/05/bcrecommender-traction-update/Update on BCRecommender traction using three channels: blogger outreach, search engine optimisation, and content marketing.What is data science?https://yanirseroussi.com/2014/10/23/what-is-data-science/Thu, 23 Oct 2014 03:22:08 +0000https://yanirseroussi.com/2014/10/23/what-is-data-science/Data science has been a hot term in the past few years. Still, there isn&rsquo;t a single definition of the field. This post discusses my favourite definition.Greek Media Monitoring Kaggle competition: My approachhttps://yanirseroussi.com/2014/10/07/greek-media-monitoring-kaggle-competition-my-approach/Tue, 07 Oct 2014 03:21:35 +0000https://yanirseroussi.com/2014/10/07/greek-media-monitoring-kaggle-competition-my-approach/Summary of my approach to the Greek Media Monitoring Kaggle competition, where I finished 6th out of 120 teams.Applying the Traction Book’s Bullseye framework to BCRecommenderhttps://yanirseroussi.com/2014/09/24/applying-the-traction-books-bullseye-framework-to-bcrecommender/Wed, 24 Sep 2014 04:57:39 +0000https://yanirseroussi.com/2014/09/24/applying-the-traction-books-bullseye-framework-to-bcrecommender/Ranking 19 channels with the goal of getting traction for BCRecommender.Bandcamp recommendation and discovery algorithmshttps://yanirseroussi.com/2014/09/19/bandcamp-recommendation-and-discovery-algorithms/Fri, 19 Sep 2014 14:26:55 +0000https://yanirseroussi.com/2014/09/19/bandcamp-recommendation-and-discovery-algorithms/The recommendation backend for my BCRecommender service for personalised Bandcamp music discovery.Building a recommender system on a shoestring budget (or: BCRecommender part 2 – general system layout)https://yanirseroussi.com/2014/09/07/building-a-recommender-system-on-a-shoestring-budget/Sun, 07 Sep 2014 10:48:44 +0000https://yanirseroussi.com/2014/09/07/building-a-recommender-system-on-a-shoestring-budget/Iterating on my BCRecommender service with the goal of keeping costs low while providing a valuable music recommendation service.Building a Bandcamp recommender system (part 1 – motivation)https://yanirseroussi.com/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/Sat, 30 Aug 2014 08:11:38 +0000https://yanirseroussi.com/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/My motivation behind building BCRecommender, a free recommendation &amp; discovery service for Bandcamp music.How to (almost) win Kaggle competitionshttps://yanirseroussi.com/2014/08/24/how-to-almost-win-kaggle-competitions/Sun, 24 Aug 2014 12:40:53 +0000https://yanirseroussi.com/2014/08/24/how-to-almost-win-kaggle-competitions/Summary of a talk I gave at the Data Science Sydney meetup with ten tips on almost-winning Kaggle competitions.Data’s hierarchy of needshttps://yanirseroussi.com/2014/08/17/datas-hierarchy-of-needs/Sun, 17 Aug 2014 13:09:30 +0000https://yanirseroussi.com/2014/08/17/datas-hierarchy-of-needs/Discussing the hierarchy of needs proposed by Jay Kreps. Key takeaway: Data-driven algorithms &amp; insights can only be as good as the underlying data.Kaggle competition tips and summarieshttps://yanirseroussi.com/kaggle/Sat, 05 Apr 2014 23:46:10 +0000https://yanirseroussi.com/kaggle/Pointers to all my Kaggle advice posts and competition summaries.Kaggle beginner tipshttps://yanirseroussi.com/2014/01/19/kaggle-beginner-tips/Sun, 19 Jan 2014 10:34:28 +0000https://yanirseroussi.com/2014/01/19/kaggle-beginner-tips/First post! An email I sent to members of the Data Science Sydney Meetup with tips on how to get started with Kaggle competitions.About Yanir: Startup Data & AI Consultanthttps://yanirseroussi.com/about/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/about/About Yanir Seroussi, a hands-on data tech lead with over a decade of experience. Yanir helps climate/nature tech startups ship data-intensive solutions.Book a free fifteen-minute callhttps://yanirseroussi.com/free-intro-call/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/free-intro-call/Booking form for a quick intro call with Yanir Seroussi.Causal inference resourceshttps://yanirseroussi.com/causal-inference-resources/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/causal-inference-resources/This is a list of some causal inference resources, which I update from time to time. You can also check out my posts on causal inference and A/B testing. -Books: -Causal Inference: What if by Miguel Hernán and Jamie Robins: The most practical book I&rsquo;ve read. Highly recommended. Trustworthy Online Controlled Experiments : A Practical Guide to A/B Testing by Ron Kohavi, Diane Tang, and Ya Xu: Building on the authors&rsquo; decades of industry experience, this is pretty much the bible of online experiments, which is how causal inference is often done in practice.Free Guide: Data-to-AI Health Check for Startupshttps://yanirseroussi.com/data-to-ai-health-check/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/data-to-ai-health-check/Download a free PDF guide that helps you assess a startup&rsquo;s Data-to-AI health by probing eight key areas.Helping climate & nature tech startups ship data-intensive solutionshttps://yanirseroussi.com/consult/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/consult/Consulting for climate &amp; nature tech startups: Strategic advice, implementation of Data/AI/ML solutions, and hiring help by an experienced tech leader.Speaking engagements by Yanir: Startup Data & AI Consultanthttps://yanirseroussi.com/talks/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/talks/Yanir Seroussi speaks on data science, artificial intelligence, machine learning, and career journey.Stay in touchhttps://yanirseroussi.com/contact/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/contact/Contact me or subscribe to the mailing list. \ No newline at end of file +Yanir Seroussi | Data & AI for Startup Impacthttps://yanirseroussi.com/Recent content on Yanir Seroussi | Data & AI for Startup ImpactHugo -- gohugo.ioen-auText and figures licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by [Yanir Seroussi](https://yanirseroussi.com/about/), except where noted otherwiseMon, 02 Sep 2024 02:30:00 +0000Juggling delivery, admin, and leads: Monthly biz recaphttps://yanirseroussi.com/2024/09/02/juggling-delivery-admin-and-leads-monthly-biz-recap/Mon, 02 Sep 2024 02:30:00 +0000https://yanirseroussi.com/2024/09/02/juggling-delivery-admin-and-leads-monthly-biz-recap/Highlights and lessons from my solo expertise biz, including value pricing, fractional cash flow, and distractions from admin &amp; politics.AI hype, AI bullshit, and the real dealhttps://yanirseroussi.com/2024/08/26/ai-hype-ai-bullshit-and-the-real-deal/Mon, 26 Aug 2024 01:00:00 +0000https://yanirseroussi.com/2024/08/26/ai-hype-ai-bullshit-and-the-real-deal/My views on separating AI hype and bullshit from the real deal. The general ideas apply to past and future hype waves in tech.Giving up on the minimum viable data stackhttps://yanirseroussi.com/2024/08/19/giving-up-on-the-minimum-viable-data-stack/Mon, 19 Aug 2024 03:30:00 +0000https://yanirseroussi.com/2024/08/19/giving-up-on-the-minimum-viable-data-stack/Exploring why universal advice on startup data stacks is challenging, and the importance of context-specific decisions in data infrastructure.Keep learning: Your career is never truly donehttps://yanirseroussi.com/2024/08/12/keep-learning-your-career-is-never-truly-done/Mon, 12 Aug 2024 01:30:00 +0000https://yanirseroussi.com/2024/08/12/keep-learning-your-career-is-never-truly-done/Podcast chat on my career journey from software engineering to data science and independent consulting.First year lessons from a solo expertise biz in Data & AIhttps://yanirseroussi.com/2024/08/05/first-year-lessons-from-a-solo-expertise-biz-in-data-and-ai/Mon, 05 Aug 2024 08:45:00 +0000https://yanirseroussi.com/2024/08/05/first-year-lessons-from-a-solo-expertise-biz-in-data-and-ai/Reflections on building a solo expertise business in Data &amp; AI, focusing on climate tech startups. Lessons learned from the first year of transition.AI/ML lifecycle models versus real-world messhttps://yanirseroussi.com/2024/07/29/ai-ml-lifecycle-models-versus-real-world-mess/Mon, 29 Jul 2024 06:00:00 +0000https://yanirseroussi.com/2024/07/29/ai-ml-lifecycle-models-versus-real-world-mess/The real world of AI/ML doesn&rsquo;t fit into a neat diagram, so I created another diagram and a maturity heatmap to model the mess.Your first Data-to-AI hire: Run a lovable processhttps://yanirseroussi.com/2024/07/22/your-first-data-to-ai-hire-run-a-lovable-process/Mon, 22 Jul 2024 01:00:00 +0000https://yanirseroussi.com/2024/07/22/your-first-data-to-ai-hire-run-a-lovable-process/Video and key points from the second part of a webinar on a startup&rsquo;s first data hire, covering tips for defining the role and running the process.Learn about Dataland to avoid expensive hiring mistakeshttps://yanirseroussi.com/2024/07/15/learn-about-dataland-to-avoid-expensive-hiring-mistakes/Mon, 15 Jul 2024 05:30:00 +0000https://yanirseroussi.com/2024/07/15/learn-about-dataland-to-avoid-expensive-hiring-mistakes/Video and key points from the first part of a webinar on a startup&rsquo;s first data hire, covering data &amp; AI definitions and high-level recommendations.Exploring an AI product idea with the latest ChatGPT, Claude, and Geminihttps://yanirseroussi.com/2024/07/08/exploring-an-ai-product-idea-with-the-latest-chatgpt-claude-and-gemini/Mon, 08 Jul 2024 02:45:00 +0000https://yanirseroussi.com/2024/07/08/exploring-an-ai-product-idea-with-the-latest-chatgpt-claude-and-gemini/Asking identical questions about my MagicGrantMaker idea yielded near-identical responses from the top chatbot models.Stay alert! Security is everyone's responsibilityhttps://yanirseroussi.com/2024/07/01/stay-alert-security-is-everyones-responsibility/Mon, 01 Jul 2024 02:00:00 +0000https://yanirseroussi.com/2024/07/01/stay-alert-security-is-everyones-responsibility/Questions to assess the security posture of a startup, focusing on basic hygiene and handling of sensitive data.Five team-building mistakes, according to Patty McCordhttps://yanirseroussi.com/til/2024/06/26/five-team-building-mistakes-according-to-patty-mccord/Wed, 26 Jun 2024 00:00:00 +0000https://yanirseroussi.com/til/2024/06/26/five-team-building-mistakes-according-to-patty-mccord/Takeaways from an interview with Patty McCord on The Startup Podcast.Is your tech stack ready for data-intensive applications?https://yanirseroussi.com/2024/06/24/is-your-tech-stack-ready-for-data-intensive-applications/Mon, 24 Jun 2024 02:00:00 +0000https://yanirseroussi.com/2024/06/24/is-your-tech-stack-ready-for-data-intensive-applications/Questions to assess the quality of tech stacks and lifecycles, with a focus on artificial intelligence, machine learning, and analytics.Dealing with endless data changeshttps://yanirseroussi.com/til/2024/06/22/dealing-with-endless-data-changes/Sat, 22 Jun 2024 22:50:00 +0000https://yanirseroussi.com/til/2024/06/22/dealing-with-endless-data-changes/Quotes from Demetrios Brinkmann on the relationship between MLOps and DevOps, with MLOps allowing for managing changes that come from data.AI ain't gonna save you from bad datahttps://yanirseroussi.com/2024/06/17/ai-aint-gonna-save-you-from-bad-data/Mon, 17 Jun 2024 02:00:00 +0000https://yanirseroussi.com/2024/06/17/ai-aint-gonna-save-you-from-bad-data/Since we&rsquo;re far from a utopia where data issues are fully handled by AI, this post presents six questions humans can use to assess data projects.The rules of the passion economyhttps://yanirseroussi.com/til/2024/06/12/the-rules-of-the-passion-economy/Wed, 12 Jun 2024 02:50:00 +0000https://yanirseroussi.com/til/2024/06/12/the-rules-of-the-passion-economy/Summary of the main messages from the book The Passion Economy by Adam Davidson.Startup data health starts with healthy event trackinghttps://yanirseroussi.com/2024/06/10/startup-data-health-starts-with-healthy-event-tracking/Mon, 10 Jun 2024 04:00:00 +0000https://yanirseroussi.com/2024/06/10/startup-data-health-starts-with-healthy-event-tracking/Expanding on the startup health check question of tracking Kukuyeva&rsquo;s five business aspects as wide events.How to avoid startups with poor development processeshttps://yanirseroussi.com/2024/06/03/how-to-avoid-startups-with-poor-development-processes/Mon, 03 Jun 2024 02:45:00 +0000https://yanirseroussi.com/2024/06/03/how-to-avoid-startups-with-poor-development-processes/Questions that prospective data specialists and engineers should ask about development processes before accepting a startup role.Plumbing, Decisions, and Automation: De-hyping Data & AIhttps://yanirseroussi.com/2024/05/27/plumbing-decisions-and-automation-de-hyping-data-and-ai/Mon, 27 May 2024 02:00:00 +0000https://yanirseroussi.com/2024/05/27/plumbing-decisions-and-automation-de-hyping-data-and-ai/Three essential questions to understand where an organisation stands when it comes to Data &amp; AI (with zero hype).Adapting to the economy of algorithmshttps://yanirseroussi.com/til/2024/05/25/adapting-to-the-economy-of-algorithms/Sat, 25 May 2024 00:00:00 +0000https://yanirseroussi.com/til/2024/05/25/adapting-to-the-economy-of-algorithms/Overview of the book The Economy of Algorithms by Marek Kowalkiewicz.Question startup culture before accepting a data-to-AI rolehttps://yanirseroussi.com/2024/05/20/question-startup-culture-before-accepting-a-data-to-ai-role/Mon, 20 May 2024 02:25:00 +0000https://yanirseroussi.com/2024/05/20/question-startup-culture-before-accepting-a-data-to-ai-role/Eight questions that prospective data-to-AI employees should ask about a startup&rsquo;s work and data culture.Probing the People aspects of an early-stage startuphttps://yanirseroussi.com/2024/05/13/probing-the-people-aspects-of-an-early-stage-startup/Mon, 13 May 2024 02:00:00 +0000https://yanirseroussi.com/2024/05/13/probing-the-people-aspects-of-an-early-stage-startup/Ten questions that prospective employees should ask about a startup&rsquo;s team, especially for data-centric roles.Business questions to ask before taking a startup data rolehttps://yanirseroussi.com/2024/05/06/business-questions-to-ask-before-taking-a-startup-data-role/Mon, 06 May 2024 04:30:00 +0000https://yanirseroussi.com/2024/05/06/business-questions-to-ask-before-taking-a-startup-data-role/Fourteen questions that prospective employees should ask about a startup&rsquo;s business model and product, especially for data-focused roles.Mentorship and the art of actionable advicehttps://yanirseroussi.com/2024/04/29/mentorship-and-the-art-of-actionable-advice/Mon, 29 Apr 2024 06:30:00 +0000https://yanirseroussi.com/2024/04/29/mentorship-and-the-art-of-actionable-advice/Reflections on what it takes to package expertise and deliver timely, actionable advice outside the context of employee relationships.Assessing a startup's data-to-AI healthhttps://yanirseroussi.com/2024/04/22/assessing-a-startups-data-to-ai-health/Mon, 22 Apr 2024 06:00:00 +0000https://yanirseroussi.com/2024/04/22/assessing-a-startups-data-to-ai-health/Reviewing the areas that should be assessed to determine a startup&rsquo;s opportunities and challenges on the data/AI/ML front.AI does not obviate the need for testing and observabilityhttps://yanirseroussi.com/2024/04/15/ai-does-not-obviate-the-need-for-testing-and-observability/Mon, 15 Apr 2024 05:00:00 +0000https://yanirseroussi.com/2024/04/15/ai-does-not-obviate-the-need-for-testing-and-observability/It&rsquo;s easy to prototype with AI, but production-grade AI apps require even more thorough testing and observability than traditional software.LinkedIn is a teachable skillhttps://yanirseroussi.com/til/2024/04/11/linkedin-is-a-teachable-skill/Thu, 11 Apr 2024 01:45:25 +0000https://yanirseroussi.com/til/2024/04/11/linkedin-is-a-teachable-skill/An high-level overview of things I learned from Justin Welsh&rsquo;s LinkedIn Operating System course.My experience as a Data Tech Lead with Work on Climatehttps://yanirseroussi.com/2024/04/08/my-experience-as-a-data-tech-lead-with-work-on-climate/Mon, 08 Apr 2024 02:00:00 +0000https://yanirseroussi.com/2024/04/08/my-experience-as-a-data-tech-lead-with-work-on-climate/The story of how I joined Work on Climate as a volunteer and became its data tech lead, with lessons applied to consulting &amp; fractional work.The data engineering lifecycle is not going anywherehttps://yanirseroussi.com/til/2024/04/05/the-data-engineering-lifecycle-is-not-going-anywhere/Fri, 05 Apr 2024 01:00:00 +0000https://yanirseroussi.com/til/2024/04/05/the-data-engineering-lifecycle-is-not-going-anywhere/My key takeaways from reading Fundamentals of Data Engineering by Joe Reis and Matt Housley.Artificial intelligence, automation, and the art of counting fishhttps://yanirseroussi.com/2024/04/01/artificial-intelligence-automation-and-the-art-of-counting-fish/Mon, 01 Apr 2024 06:00:00 +0000https://yanirseroussi.com/2024/04/01/artificial-intelligence-automation-and-the-art-of-counting-fish/Discussing the use of AI to automate underwater marine surveys as an example of the uneven distribution of technological advancement.Atomic Habits is full of actionable advicehttps://yanirseroussi.com/til/2024/03/12/atomic-habits-is-full-of-actionable-advice/Tue, 12 Mar 2024 06:19:31 +0000https://yanirseroussi.com/til/2024/03/12/atomic-habits-is-full-of-actionable-advice/I put the book to use after the first listen, and will definitely revisit it in the future to form better habits.Questions to consider when using AI for PDF data extractionhttps://yanirseroussi.com/2024/03/11/questions-to-consider-when-using-ai-for-pdf-data-extraction/Mon, 11 Mar 2024 00:00:00 +0000https://yanirseroussi.com/2024/03/11/questions-to-consider-when-using-ai-for-pdf-data-extraction/Discussing considerations that arise when attempting to automate the extraction of structured data from PDFs and similar documents.Two types of startup data problemshttps://yanirseroussi.com/2024/03/04/two-types-of-startup-data-problems/Mon, 04 Mar 2024 02:00:00 +0000https://yanirseroussi.com/2024/03/04/two-types-of-startup-data-problems/Classifying startups as ML-centric or non-ML is a helpful exercise to uncover the data challenges they&rsquo;re likely to face.Avoiding AI complexity: First, write no codehttps://yanirseroussi.com/2024/02/26/avoiding-ai-complexity-first-write-no-code/Mon, 26 Feb 2024 01:45:00 +0000https://yanirseroussi.com/2024/02/26/avoiding-ai-complexity-first-write-no-code/Two stories of getting AI functionality to production, which demonstrate the risks inherent in custom development versus starting with a no-code approach.Building your startup's minimum viable data stackhttps://yanirseroussi.com/2024/02/19/building-your-startups-minimum-viable-data-stack/Mon, 19 Feb 2024 00:00:00 +0000https://yanirseroussi.com/2024/02/19/building-your-startups-minimum-viable-data-stack/First post in a series on building a minimum viable data stack for startups, introducing key definitions, components, and considerations.The three Cs of indie consulting: Confidence, Cash, and Connectionshttps://yanirseroussi.com/til/2024/02/17/the-three-cs-of-indie-consulting-confidence-cash-and-connections/Sat, 17 Feb 2024 02:00:00 +0000https://yanirseroussi.com/til/2024/02/17/the-three-cs-of-indie-consulting-confidence-cash-and-connections/Jonathan Stark makes a compelling argument why you should have the three Cs before quitting your job to go solo consulting.Nudging ChatGPT to invent books you have no time to readhttps://yanirseroussi.com/2024/02/12/nudging-chatgpt-to-invent-books-you-have-no-time-to-read/Mon, 12 Feb 2024 05:00:00 +0000https://yanirseroussi.com/2024/02/12/nudging-chatgpt-to-invent-books-you-have-no-time-to-read/Getting ChatGPT Plus to elaborate on possible book content and produce a PDF cheatsheet, with the goal of learning about its capabilities.Future software development may require fewer humanshttps://yanirseroussi.com/til/2024/02/06/future-software-development-may-require-fewer-humans/Tue, 06 Feb 2024 06:15:00 +0000https://yanirseroussi.com/til/2024/02/06/future-software-development-may-require-fewer-humans/Reflecting on an interview with Jason Warner, CEO of poolside.Substance over titles: Your first data hire may be a data scientisthttps://yanirseroussi.com/2024/02/05/substance-over-titles-your-first-data-hire-may-be-a-data-scientist/Mon, 05 Feb 2024 02:45:00 +0000https://yanirseroussi.com/2024/02/05/substance-over-titles-your-first-data-hire-may-be-a-data-scientist/Advice for hiring a startup&rsquo;s first data person: match skills to business needs, consider contractors, and get help from data people.New decade, new tagline: Data & AI for Impacthttps://yanirseroussi.com/2024/01/19/new-decade-new-tagline-data-and-ai-for-impact/Fri, 19 Jan 2024 00:00:00 +0000https://yanirseroussi.com/2024/01/19/new-decade-new-tagline-data-and-ai-for-impact/Shifting focus to &lsquo;Data &amp; AI for Impact&rsquo;, with more startup-related content, increased posting frequency, and deeper audience engagement.Psychographic specialisations may work for discipline generalistshttps://yanirseroussi.com/til/2024/01/09/psychographic-specialisations-may-work-for-discipline-generalists/Tue, 09 Jan 2024 03:00:00 +0000https://yanirseroussi.com/til/2024/01/09/psychographic-specialisations-may-work-for-discipline-generalists/When focusing on a market segment defined by personal beliefs, it&rsquo;s often fine to position yourself as a generalist in your craft.The power of parasocial relationshipshttps://yanirseroussi.com/til/2024/01/08/the-power-of-parasocial-relationships/Mon, 08 Jan 2024 06:00:00 +0000https://yanirseroussi.com/til/2024/01/08/the-power-of-parasocial-relationships/Repeated exposure to media personas creates relationships that help justify premium fees.Positioning is a common problem for data scientistshttps://yanirseroussi.com/til/2023/12/18/positioning-is-a-common-problem-for-data-scientists/Mon, 18 Dec 2023 00:30:00 +0000https://yanirseroussi.com/til/2023/12/18/positioning-is-a-common-problem-for-data-scientists/With the commodification of data scientists, the problem of positioning has become more common: My takeaways from Genevieve Hayes interviewing Jonathan Stark.Transfer learning applies to energy market biddinghttps://yanirseroussi.com/til/2023/12/14/transfer-learning-applies-to-energy-market-bidding/Thu, 14 Dec 2023 00:15:00 +0000https://yanirseroussi.com/til/2023/12/14/transfer-learning-applies-to-energy-market-bidding/An interesting approach to bidding of energy storage assets, showing that training on New York data is transferable to Queensland.Supporting volunteer monitoring of marine biodiversity with modern web and data toolshttps://yanirseroussi.com/2023/11/29/supporting-volunteer-monitoring-of-marine-biodiversity-with-modern-web-and-data-tools/Wed, 29 Nov 2023 02:00:00 +0000https://yanirseroussi.com/2023/11/29/supporting-volunteer-monitoring-of-marine-biodiversity-with-modern-web-and-data-tools/Summarising the work Uri Seroussi and I did to improve Reef Life Survey&rsquo;s Reef Species of the World app.Our Blue Machine is changing, but we are not helplesshttps://yanirseroussi.com/til/2023/11/28/our-blue-machine-is-changing-but-we-are-not-helpless/Tue, 28 Nov 2023 06:40:00 +0000https://yanirseroussi.com/til/2023/11/28/our-blue-machine-is-changing-but-we-are-not-helpless/One of my many highlights from Helen Czerski&rsquo;s Blue Machine.You don't need a proprietary API for static mapshttps://yanirseroussi.com/til/2023/11/21/you-dont-need-a-proprietary-api-for-static-maps/Tue, 21 Nov 2023 06:00:00 +0000https://yanirseroussi.com/til/2023/11/21/you-dont-need-a-proprietary-api-for-static-maps/For many use cases, libraries like cartopy are better than the likes of Mapbox and Google Maps.Lessons from reluctant data engineeringhttps://yanirseroussi.com/2023/10/25/lessons-from-reluctant-data-engineering/Wed, 25 Oct 2023 04:45:00 +0000https://yanirseroussi.com/2023/10/25/lessons-from-reluctant-data-engineering/Video and summary of a talk I gave at DataEngBytes Brisbane on what I learned from doing data engineering as part of every data science role I had.Artificial intelligence was a marketing term all along – just call it automationhttps://yanirseroussi.com/til/2023/10/06/artificial-intelligence-was-a-marketing-term-all-along-just-call-it-automation/Fri, 06 Oct 2023 05:00:00 +0000https://yanirseroussi.com/til/2023/10/06/artificial-intelligence-was-a-marketing-term-all-along-just-call-it-automation/Replacing &lsquo;artificial intelligence&rsquo; with &lsquo;automation&rsquo; is a useful trick for cutting through the hype.The lines between solo consulting and product building are blurryhttps://yanirseroussi.com/til/2023/09/25/the-lines-between-solo-consulting-and-product-building-are-blurry/Mon, 25 Sep 2023 00:00:00 +0000https://yanirseroussi.com/til/2023/09/25/the-lines-between-solo-consulting-and-product-building-are-blurry/It turns out that problems like finding a niche and defining the ideal clients are key to any solo business.Google's Rules of Machine Learning still apply in the age of large language modelshttps://yanirseroussi.com/til/2023/09/21/googles-rules-of-machine-learning-still-apply-in-the-age-of-large-language-models/Thu, 21 Sep 2023 21:30:00 +0000https://yanirseroussi.com/til/2023/09/21/googles-rules-of-machine-learning-still-apply-in-the-age-of-large-language-models/Despite the excitement around large language models, building with machine learning remains an engineering problem with established best practices.My rediscovery of quiet writing on the open webhttps://yanirseroussi.com/2023/08/28/my-rediscovery-of-quiet-writing-on-the-open-web/Mon, 28 Aug 2023 05:30:00 +0000https://yanirseroussi.com/2023/08/28/my-rediscovery-of-quiet-writing-on-the-open-web/Reflections on publishing on this website: Writing publicly to share thoughts and documentation beats chasing views and likes.The Minimalist Entrepreneur is too prescriptive for mehttps://yanirseroussi.com/til/2023/08/21/the-minimalist-entrepreneur-is-too-prescriptive-for-me/Mon, 21 Aug 2023 03:15:00 +0000https://yanirseroussi.com/til/2023/08/21/the-minimalist-entrepreneur-is-too-prescriptive-for-me/While I found the story of Gumroad interesting, The Minimalist Entrepreneur seems to over-generalise from the founder&rsquo;s experience.Revisiting Start Small, Stay Small in 2023 (Chapter 2)https://yanirseroussi.com/til/2023/08/17/revisiting-start-small-stay-small-in-2023-chapter-2/Thu, 17 Aug 2023 07:45:00 +0000https://yanirseroussi.com/til/2023/08/17/revisiting-start-small-stay-small-in-2023-chapter-2/A summary of the second chapter of Rob Walling&rsquo;s Start Small, Stay Small, along with my thoughts &amp; reflections.Revisiting Start Small, Stay Small in 2023 (Chapter 1)https://yanirseroussi.com/til/2023/08/16/revisiting-start-small-stay-small-in-2023-chapter-1/Wed, 16 Aug 2023 05:45:00 +0000https://yanirseroussi.com/til/2023/08/16/revisiting-start-small-stay-small-in-2023-chapter-1/A summary of the first chapter of Rob Walling&rsquo;s Start Small, Stay Small, along with my thoughts &amp; reflections.Email notifications on public GitHub commitshttps://yanirseroussi.com/til/2023/08/14/email-notifications-on-public-github-commits/Mon, 14 Aug 2023 05:15:00 +0000https://yanirseroussi.com/til/2023/08/14/email-notifications-on-public-github-commits/GitHub publishes an Atom feed, which means you can use any RSS reader to follow commits.The rule of thirds can probably be ignoredhttps://yanirseroussi.com/til/2023/08/11/the-rule-of-thirds-can-probably-be-ignored/Fri, 11 Aug 2023 03:15:00 +0000https://yanirseroussi.com/til/2023/08/11/the-rule-of-thirds-can-probably-be-ignored/Turns out that the rule of thirds for composing visuals may not be that important.Using YubiKey for SSH accesshttps://yanirseroussi.com/til/2023/07/23/using-yubikey-for-ssh-access/Sun, 23 Jul 2023 00:07:15 +0000https://yanirseroussi.com/til/2023/07/23/using-yubikey-for-ssh-access/Some pointers for setting up SSH access with YubiKey on Ubuntu 22.04.Making a TIL section with Hugo and PaperModhttps://yanirseroussi.com/til/2023/07/17/making-a-til-section-with-hugo-and-papermod/Mon, 17 Jul 2023 00:06:15 +0000https://yanirseroussi.com/til/2023/07/17/making-a-til-section-with-hugo-and-papermod/How I added a Today I Learned section to my Hugo site with the PaperMod theme.You can't save timehttps://yanirseroussi.com/til/2023/07/11/you-cant-save-time/Tue, 11 Jul 2023 00:00:00 +0000https://yanirseroussi.com/til/2023/07/11/you-cant-save-time/Time can be spent doing different activities, but it can&rsquo;t be stored and saved for later.Was data science a failure mode of software engineering?https://yanirseroussi.com/2023/06/30/was-data-science-a-failure-mode-of-software-engineering/Fri, 30 Jun 2023 00:06:30 +0000https://yanirseroussi.com/2023/06/30/was-data-science-a-failure-mode-of-software-engineering/Yes, data science projects have suffered from classic software engineering mistakes, but the field is maturing with the rise of new engineering roles.How hackable are automated coding assessments?https://yanirseroussi.com/2023/05/26/how-hackable-are-automated-coding-assessments/Fri, 26 May 2023 00:03:00 +0000https://yanirseroussi.com/2023/05/26/how-hackable-are-automated-coding-assessments/Exploring the hackability of speed-based coding tests, using CodeSignal&rsquo;s Industry Coding Framework as a case study.Remaining relevant as a small language modelhttps://yanirseroussi.com/2023/04/21/remaining-relevant-as-a-small-language-model/Fri, 21 Apr 2023 00:06:30 +0000https://yanirseroussi.com/2023/04/21/remaining-relevant-as-a-small-language-model/Bing Chat recently quipped that humans are small language models. Here are some of my thoughts on how we small language models can remain relevant (for now).ChatGPT is transformative AIhttps://yanirseroussi.com/2022/12/11/chatgpt-is-transformative-ai/Sun, 11 Dec 2022 00:00:00 +0000https://yanirseroussi.com/2022/12/11/chatgpt-is-transformative-ai/My perspective after a week of using ChatGPT: This is a step change in finding distilled information, and it&rsquo;s only the beginning.Causal Machine Learning is off to a good start, despite some issueshttps://yanirseroussi.com/2022/09/12/causal-machine-learning-book-draft-review/Mon, 12 Sep 2022 02:45:00 +0000https://yanirseroussi.com/2022/09/12/causal-machine-learning-book-draft-review/Reviewing the first three chapters of the book Causal Machine Learning by Robert Osazuwa Ness.The mission matters: Moving to climate tech as a data scientisthttps://yanirseroussi.com/2022/06/06/the-mission-matters-moving-to-climate-tech-as-a-data-scientist/Mon, 06 Jun 2022 00:00:00 +0000https://yanirseroussi.com/2022/06/06/the-mission-matters-moving-to-climate-tech-as-a-data-scientist/Discussing my recent career move into climate tech as a way of doing more to help mitigate dangerous climate change.Building useful machine learning tools keeps getting easier: A fish ID case studyhttps://yanirseroussi.com/2022/03/20/building-useful-machine-learning-tools-keeps-getting-easier-a-fish-id-case-study/Sun, 20 Mar 2022 04:30:00 +0000https://yanirseroussi.com/2022/03/20/building-useful-machine-learning-tools-keeps-getting-easier-a-fish-id-case-study/Lessons learned building a fish ID web app with fast.ai and Streamlit, in an attempt to reduce my fear of missing out on the latest deep learning developments.Analysis strategies in online A/B experiments: Intention-to-treat, per-protocol, and other lessons from clinical trialshttps://yanirseroussi.com/2022/01/14/analysis-strategies-in-online-a-b-experiments/Fri, 14 Jan 2022 00:05:40 +0000https://yanirseroussi.com/2022/01/14/analysis-strategies-in-online-a-b-experiments/Epidemiologists analyse clinical trials to estimate the intention-to-treat and per-protocol effects. This post applies their strategies to online experiments.Use your human brain to avoid artificial intelligence disastershttps://yanirseroussi.com/2021/11/22/use-your-human-brain-to-avoid-artificial-intelligence-disasters/Mon, 22 Nov 2021 03:45:00 +0000https://yanirseroussi.com/2021/11/22/use-your-human-brain-to-avoid-artificial-intelligence-disasters/Overview of a talk I gave at a deep learning course, focusing on AI ethics as the need for humans to think on the context and consequences of applying AI.Migrating from WordPress.com to Hugo on GitHub + Cloudflarehttps://yanirseroussi.com/2021/11/10/migrating-from-wordpress-com-to-hugo-on-github-cloudflare/Wed, 10 Nov 2021 06:30:00 +0000https://yanirseroussi.com/2021/11/10/migrating-from-wordpress-com-to-hugo-on-github-cloudflare/My reasons for switching from WordPress.com to Hugo on GitHub + Cloudflare, along with a summary of the solution components and migration process.My work with Automattichttps://yanirseroussi.com/2021/10/07/my-work-with-automattic/Thu, 07 Oct 2021 00:00:00 +0000https://yanirseroussi.com/2021/10/07/my-work-with-automattic/Back-dated meta-post that gathers my posts on Automattic blogs into a summary of the work I&rsquo;ve done with the company.Some highlights from 2020https://yanirseroussi.com/2021/04/05/some-highlights-from-2020/Mon, 05 Apr 2021 06:41:48 +0000https://yanirseroussi.com/2021/04/05/some-highlights-from-2020/Sharing remote teamwork insights, my climate &amp; sustainability activism, Reef Life Survey publications, and progress on Automattic&rsquo;s Experimentation Platform.Many is not enough: Counting simulations to bootstrap the right wayhttps://yanirseroussi.com/2020/08/24/many-is-not-enough-counting-simulations-to-bootstrap-the-right-way/Mon, 24 Aug 2020 01:35:17 +0000https://yanirseroussi.com/2020/08/24/many-is-not-enough-counting-simulations-to-bootstrap-the-right-way/Going deeper into correct testing of different methods for bootstrap estimation of confidence intervals.Software commodities are eating interesting data science workhttps://yanirseroussi.com/2020/01/11/software-commodities-are-eating-interesting-data-science-work/Sat, 11 Jan 2020 09:22:35 +0000https://yanirseroussi.com/2020/01/11/software-commodities-are-eating-interesting-data-science-work/Being a data scientist can sometimes feel like a race against software commodities that replace interesting work. What can one do to remain relevant?A day in the life of a remote data scientisthttps://yanirseroussi.com/2019/12/12/a-day-in-the-life-of-a-remote-data-scientist/Wed, 11 Dec 2019 22:06:19 +0000https://yanirseroussi.com/2019/12/12/a-day-in-the-life-of-a-remote-data-scientist/Video of a talk I gave on remote data science work at the Data Science Sydney meetup.Bootstrapping the right way?https://yanirseroussi.com/2019/10/06/bootstrapping-the-right-way/Sun, 06 Oct 2019 06:48:07 +0000https://yanirseroussi.com/2019/10/06/bootstrapping-the-right-way/Video and summary of a talk I gave at YOW! Data on bootstrap estimation of confidence intervals.Hackers beware: Bootstrap sampling may be harmfulhttps://yanirseroussi.com/2019/01/08/hackers-beware-bootstrap-sampling-may-be-harmful/Mon, 07 Jan 2019 21:07:56 +0000https://yanirseroussi.com/2019/01/08/hackers-beware-bootstrap-sampling-may-be-harmful/Bootstrap sampling has been promoted as an easy way of modelling uncertainty to hackers without much statistical knowledge. But things aren&rsquo;t that simple.The most practical causal inference book I’ve read (is still a draft)https://yanirseroussi.com/2018/12/24/the-most-practical-causal-inference-book-ive-read-is-still-a-draft/Mon, 24 Dec 2018 02:37:50 +0000https://yanirseroussi.com/2018/12/24/the-most-practical-causal-inference-book-ive-read-is-still-a-draft/Causal Inference by Miguel Hernán and Jamie Robins is a must-read for anyone interested in the area.Reflections on remote data science workhttps://yanirseroussi.com/2018/11/03/reflections-on-remote-data-science-work/Sat, 03 Nov 2018 06:33:13 +0000https://yanirseroussi.com/2018/11/03/reflections-on-remote-data-science-work/Discussing the pluses and minuses of remote work eighteen months after joining Automattic as a data scientist.Defining data science in 2018https://yanirseroussi.com/2018/07/22/defining-data-science-in-2018/Sun, 22 Jul 2018 08:27:43 +0000https://yanirseroussi.com/2018/07/22/defining-data-science-in-2018/Updating my definition of data science to match changes in the field. It is now broader than before, but its ultimate goal is still to support decisions.Advice for aspiring data scientists and other FAQshttps://yanirseroussi.com/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/Sun, 15 Oct 2017 09:15:25 +0000https://yanirseroussi.com/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/Frequently asked questions by visitors to this site, especially around entering the data science field.State of Bandcamp Recommender, Late 2017https://yanirseroussi.com/2017/09/02/state-of-bandcamp-recommender/Sat, 02 Sep 2017 10:19:02 +0000https://yanirseroussi.com/2017/09/02/state-of-bandcamp-recommender/Call for BCRecommender maintainers followed by a decision to shut it down, as I don&rsquo;t have enough time and Bandcamp now offers recommendations.My 10-step path to becoming a remote data scientist with Automattichttps://yanirseroussi.com/2017/07/29/my-10-step-path-to-becoming-a-remote-data-scientist-with-automattic/Sat, 29 Jul 2017 05:39:26 +0000https://yanirseroussi.com/2017/07/29/my-10-step-path-to-becoming-a-remote-data-scientist-with-automattic/I wanted a well-paid data science-y remote job with an established company that offers a good life balance and makes products I care about. I got it eventually.Exploring and visualising Reef Life Survey datahttps://yanirseroussi.com/2017/06/03/exploring-and-visualising-reef-life-survey-data/Sat, 03 Jun 2017 00:49:05 +0000https://yanirseroussi.com/2017/06/03/exploring-and-visualising-reef-life-survey-data/Web tools I built to visualise Reef Life Survey data and assist citizen scientists in underwater visual census work.Customer lifetime value and the proliferation of misinformation on the internethttps://yanirseroussi.com/2017/01/08/customer-lifetime-value-and-the-proliferation-of-misinformation-on-the-internet/Sun, 08 Jan 2017 20:02:30 +0000https://yanirseroussi.com/2017/01/08/customer-lifetime-value-and-the-proliferation-of-misinformation-on-the-internet/There&rsquo;s a lot of misleading content on the estimation of customer lifetime value. Here&rsquo;s what I learned about doing it well.Ask Why! Finding motives, causes, and purpose in data sciencehttps://yanirseroussi.com/2016/09/19/ask-why-finding-motives-causes-and-purpose-in-data-science/Mon, 19 Sep 2016 21:28:44 +0000https://yanirseroussi.com/2016/09/19/ask-why-finding-motives-causes-and-purpose-in-data-science/Video and summary of a talk I gave at the Data Science Sydney meetup, about going beyond the what &amp; how of predictive modelling.If you don’t pay attention, data can drive you off a cliffhttps://yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/Sun, 21 Aug 2016 21:34:17 +0000https://yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/Seven common mistakes to avoid when working with data, such as ignoring uncertainty and confusing observed and unobserved quantities.Is Data Scientist a useless job title?https://yanirseroussi.com/2016/08/04/is-data-scientist-a-useless-job-title/Thu, 04 Aug 2016 22:26:03 +0000https://yanirseroussi.com/2016/08/04/is-data-scientist-a-useless-job-title/It seems like anyone who touches data can call themselves a data scientist, which makes the title useless. The work they do can still be useful, though.Making Bayesian A/B testing more accessiblehttps://yanirseroussi.com/2016/06/19/making-bayesian-ab-testing-more-accessible/Sun, 19 Jun 2016 10:32:15 +0000https://yanirseroussi.com/2016/06/19/making-bayesian-ab-testing-more-accessible/A web tool I built to interpret A/B test results in a Bayesian way, including prior specification, visualisations, and decision rules.Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptionshttps://yanirseroussi.com/2016/05/15/diving-deeper-into-causality-pearl-kleinberg-hill-and-untested-assumptions/Sat, 14 May 2016 19:57:03 +0000https://yanirseroussi.com/2016/05/15/diving-deeper-into-causality-pearl-kleinberg-hill-and-untested-assumptions/Discussing the need for untested assumptions and temporality in causal inference. Mostly based on Samantha Kleinberg&rsquo;s Causality, Probability, and Time.The rise of greedy robotshttps://yanirseroussi.com/2016/03/20/the-rise-of-greedy-robots/Sun, 20 Mar 2016 20:33:43 +0000https://yanirseroussi.com/2016/03/20/the-rise-of-greedy-robots/Is artificial/machine intelligence a future threat? I argue that it&rsquo;s already here, with greedy robots already dominating our lives.Why you should stop worrying about deep learning and deepen your understanding of causality insteadhttps://yanirseroussi.com/2016/02/14/why-you-should-stop-worrying-about-deep-learning-and-deepen-your-understanding-of-causality-instead/Sun, 14 Feb 2016 11:04:11 +0000https://yanirseroussi.com/2016/02/14/why-you-should-stop-worrying-about-deep-learning-and-deepen-your-understanding-of-causality-instead/Causality is often overlooked but is of much higher relevance to most data scientists than deep learning.The joys of offline data collectionhttps://yanirseroussi.com/2016/01/24/the-joys-of-offline-data-collection/Sun, 24 Jan 2016 00:32:25 +0000https://yanirseroussi.com/2016/01/24/the-joys-of-offline-data-collection/Insights on data collection and machine learning from spending a month sailing, diving, and counting fish with Reef Life Survey.This holiday season, give me real insightshttps://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/Tue, 08 Dec 2015 06:57:25 +0000https://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/Some companies present raw data or information as &ldquo;insights&rdquo;. This post surveys some examples, and discusses how they can be turned into real insights.The hardest parts of data sciencehttps://yanirseroussi.com/2015/11/23/the-hardest-parts-of-data-science/Mon, 23 Nov 2015 04:14:21 +0000https://yanirseroussi.com/2015/11/23/the-hardest-parts-of-data-science/Defining feasible problems and coming up with reasonable ways of measuring solutions is harder than building accurate models or obtaining clean data.Migrating a simple web application from MongoDB to Elasticsearchhttps://yanirseroussi.com/2015/11/04/migrating-a-simple-web-application-from-mongodb-to-elasticsearch/Wed, 04 Nov 2015 03:53:18 +0000https://yanirseroussi.com/2015/11/04/migrating-a-simple-web-application-from-mongodb-to-elasticsearch/Migrating BCRecommender from MongoDB to Elasticsearch made it possible to offer a richer search experience to users at a similar cost, among other benefits.Miscommunicating science: Simplistic models, nutritionism, and the art of storytellinghttps://yanirseroussi.com/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/Mon, 19 Oct 2015 00:02:32 +0000https://yanirseroussi.com/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/Nutritionism is a special case of misinterpretation and miscommunication of scientific results – something many data scientists encounter in their work.The wonderful world of recommender systemshttps://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/Fri, 02 Oct 2015 05:25:57 +0000https://yanirseroussi.com/2015/10/02/the-wonderful-world-of-recommender-systems/Giving an overview of the field and common paradigms, and debunking five common myths about recommender systems.You don’t need a data scientist (yet)https://yanirseroussi.com/2015/08/24/you-dont-need-a-data-scientist-yet/Mon, 24 Aug 2015 08:25:30 +0000https://yanirseroussi.com/2015/08/24/you-dont-need-a-data-scientist-yet/Hiring data scientists prematurely is wasteful and frustrating. Here are some questions to ask before you hire your first data scientist.Goodbye, Parse.comhttps://yanirseroussi.com/2015/07/31/goodbye-parse-com/Fri, 31 Jul 2015 03:29:50 +0000https://yanirseroussi.com/2015/07/31/goodbye-parse-com/Migrating my web apps away from Parse.com due to reliability issues. Self-hosting is a better solution.Learning about deep learning through album cover classificationhttps://yanirseroussi.com/2015/07/06/learning-about-deep-learning-through-album-cover-classification/Mon, 06 Jul 2015 22:21:42 +0000https://yanirseroussi.com/2015/07/06/learning-about-deep-learning-through-album-cover-classification/Progress on my album cover classification project, highlighting lessons that would be useful to others who are getting started with deep learning.Deep learning resourceshttps://yanirseroussi.com/deep-learning-resources/Mon, 06 Jul 2015 00:38:44 +0000https://yanirseroussi.com/deep-learning-resources/<p>This page summarises the deep learning resources I&rsquo;ve consulted in <a href="https://yanirseroussi.com/2015/06/06/hopping-on-the-deep-learning-bandwagon/">my album cover classification project</a>.</p> +<h3 id="tutorials-and-blog-posts">Tutorials and blog posts</h3> +<ul> +<li><a href="http://cs231n.github.io/" target="_blank" rel="noopener">Convolutional Neural Networks for Visual Recognition Stanford course notes</a>: an excellent resource, very up-to-date and useful, despite still being a work in progress</li> +<li><a href="http://deeplearning.net/tutorial/" target="_blank" rel="noopener">DeepLearning.net&rsquo;s Theano-based tutorials</a>: not as up-to-date as the Stanford course notes, but still a good introduction to some of the theory and general Theano usage</li> +<li><a href="http://lasagne.readthedocs.org/en/latest/" target="_blank" rel="noopener">Lasagne&rsquo;s documentation and tutorials</a>: still a bit lacking, but good when you know what you&rsquo;re looking for</li> +<li><a href="https://github.com/enlitic/lasagne4newbs" target="_blank" rel="noopener">lasagne4newbs</a>: Lasagne&rsquo;s convnet example with richer comments</li> +<li><a href="http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/" target="_blank" rel="noopener">Using convolutional neural nets to detect facial keypoints tutorial</a>: the resource that made me want to use Lasagne</li> +<li><a href="http://benanne.github.io/2015/03/17/plankton.html" target="_blank" rel="noopener">Classifying plankton with deep neural networks</a>: an epic post, which I found while looking for Lasagne examples</li> +<li><a href="https://en.wikipedia.org/wiki/Main_Page" target="_blank" rel="noopener">Various Wikipedia pages</a>: a bit disappointing – the above resources are much better</li> +</ul> +<h3 id="papers">Papers</h3> +<ul> +<li><a href="http://arxiv.org/abs/1412.6980" target="_blank" rel="noopener">Adam: a method for stochastic optimization (Kingma and Ba, 2015)</a>: an improvement over SGD with Nesterov momentum, AdaGrad and RMSProp, which I found to be useful in practice</li> +<li><a href="http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization" target="_blank" rel="noopener">Algorithms for Hyper-Parameter Optimization (Bergstra et al., 2011)</a>: the work behind <a href="https://github.com/hyperopt/hyperopt" target="_blank" rel="noopener">Hyperopt</a> – pretty useful stuff, not only for deep learning</li> +<li><a href="http://arxiv.org/abs/1412.1710" target="_blank" rel="noopener">Convolutional Neural Networks at Constrained Time Cost (He and Sun, 2014)</a>: interesting experimental work on the tradeoffs between number of filters, filter sizes, and depth – deeper is better (but with diminishing returns); smaller filter sizes are better; delayed subsampling and spatial pyramid pooling are helpful</li> +<li><a href="http://arxiv.org/abs/1404.7828" target="_blank" rel="noopener">Deep Learning in Neural Networks: An Overview (Schmidhuber, 2014)</a>: 88 pages and 888 references (35 content pages) – good for finding references, but a bit hard to follow; not so good for understanding how the various methods work and how to use or implement them</li> +<li><a href="http://arxiv.org/abs/1409.4842" target="_blank" rel="noopener">Going deeper with convolutions (Szegedy et al., 2014)</a>: the GoogLeNet paper – interesting and compelling results, especially given the improvement in performance while reducing computational complexity</li> +<li><a href="http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks" target="_blank" rel="noopener">ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., 2012)</a>: the classic paper that arguably started (or significantly boosted) the recent buzz around deep learning – many interesting ideas; fairly accesible</li> +<li><a href="http://www.cs.toronto.edu/~gdahl/papers/momentumNesterovDeepLearning.pdf" target="_blank" rel="noopener">On the importance of initialization and momentum in deep learning (Sutskever et al., 2013)</a>: applying Nesterov momentum to deep learning – good read, simple concept, interesting results</li> +<li><a href="http://jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf" target="_blank" rel="noopener">Random Search for Hyper-Parameter Optimization (Bergstra and Bengio, 2012)</a>: very compelling reasoning and experiments showing that random search outperforms grid search in many cases</li> +<li><a href="http://sergeykarayev.com/files/1311.3715v3.pdf" target="_blank" rel="noopener">Recognizing Image Style (Karayev et al., 2014)</a>: identifying image style, which is similar to album genre – found that using models pretrained on ImageNet yielded the best results in some cases</li> +<li><a href="http://arxiv.org/abs/1409.1556" target="_blank" rel="noopener">Very deep convolutional networks for large scale image recognition (Simonyan and Zisserman, 2014)</a>: VGGNet paper – interesting experiments and architectures – deep and homogeneous</li> +<li><a href="http://arxiv.org/abs/1311.2901" target="_blank" rel="noopener">Visualizing and Understanding Convolutional Networks (Zeiler and Fergus, 2013)</a>: interesting work on visualisation, but I&rsquo;ll need to apply it to understand it better</li> +</ul>Hopping on the deep learning bandwagonhttps://yanirseroussi.com/2015/06/06/hopping-on-the-deep-learning-bandwagon/Sat, 06 Jun 2015 05:00:22 +0000https://yanirseroussi.com/2015/06/06/hopping-on-the-deep-learning-bandwagon/To become proficient at solving data science problems, you need to get your hands dirty. Here, I used album cover classification to learn about deep learning.First steps in data science: author-aware sentiment analysishttps://yanirseroussi.com/2015/05/02/first-steps-in-data-science-author-aware-sentiment-analysis/Sat, 02 May 2015 08:31:10 +0000https://yanirseroussi.com/2015/05/02/first-steps-in-data-science-author-aware-sentiment-analysis/I became a data scientist by doing a PhD, but the same steps can be followed without a formal education program.My divestment from fossil fuelshttps://yanirseroussi.com/2015/04/24/my-divestment-from-fossil-fuels/Fri, 24 Apr 2015 00:19:36 +0000https://yanirseroussi.com/2015/04/24/my-divestment-from-fossil-fuels/Recent choices I&rsquo;ve made to reduce my exposure to fossil fuels, including practical steps that can be taken by Australians and generally applicable lessons.My PhD workhttps://yanirseroussi.com/phd-work/Mon, 30 Mar 2015 03:23:33 +0000https://yanirseroussi.com/phd-work/An overview of my PhD in data science / artificial intelligence. Thesis title: Text Mining and Rating Prediction with Topical User Models.The long road to a lifestyle businesshttps://yanirseroussi.com/2015/03/22/the-long-road-to-a-lifestyle-business/Sun, 22 Mar 2015 09:43:47 +0000https://yanirseroussi.com/2015/03/22/the-long-road-to-a-lifestyle-business/Progress since leaving my last full-time job and setting on an independent path that includes data science consulting and work on my own projects.Learning to rank for personalised search (Yandex Search Personalisation – Kaggle Competition Summary – Part 2)https://yanirseroussi.com/2015/02/11/learning-to-rank-for-personalised-search-yandex-search-personalisation-kaggle-competition-summary-part-2/Wed, 11 Feb 2015 06:34:17 +0000https://yanirseroussi.com/2015/02/11/learning-to-rank-for-personalised-search-yandex-search-personalisation-kaggle-competition-summary-part-2/My team&rsquo;s solution to the Yandex Search Personalisation competition (finished 9th out of 194 teams).Is thinking like a search engine possible? (Yandex search personalisation – Kaggle competition summary – part 1)https://yanirseroussi.com/2015/01/29/is-thinking-like-a-search-engine-possible-yandex-search-personalisation-kaggle-competition-summary-part-1/Thu, 29 Jan 2015 10:37:39 +0000https://yanirseroussi.com/2015/01/29/is-thinking-like-a-search-engine-possible-yandex-search-personalisation-kaggle-competition-summary-part-1/Insights on search personalisation and SEO from participating in a Kaggle competition (finished 9th out of 194 teams).Automating Parse.com bulk data importshttps://yanirseroussi.com/2015/01/15/automating-parse-com-bulk-data-imports/Thu, 15 Jan 2015 04:41:16 +0000https://yanirseroussi.com/2015/01/15/automating-parse-com-bulk-data-imports/A script for importing data into the Parse backend-as-a-service.Stochastic Gradient Boosting: Choosing the Best Number of Iterationshttps://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/Mon, 29 Dec 2014 02:30:06 +0000https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/Exploring an approach to choosing the optimal number of iterations in stochastic gradient boosting, following a bug I found in scikit-learn.SEO: Mostly about showing up?https://yanirseroussi.com/2014/12/15/seo-mostly-about-showing-up/Mon, 15 Dec 2014 04:25:25 +0000https://yanirseroussi.com/2014/12/15/seo-mostly-about-showing-up/Increasing SEO traffic to BCRecommender by adding content and opening up more pages for crawling. It turns out that thin content is better than no content.Fitting noise: Forecasting the sale price of bulldozers (Kaggle competition summary)https://yanirseroussi.com/2014/11/19/fitting-noise-forecasting-the-sale-price-of-bulldozers-kaggle-competition-summary/Wed, 19 Nov 2014 09:17:34 +0000https://yanirseroussi.com/2014/11/19/fitting-noise-forecasting-the-sale-price-of-bulldozers-kaggle-competition-summary/Summary of a Kaggle competition to forecast bulldozer sale price, where I finished 9th out of 476 teams.BCRecommender Traction Updatehttps://yanirseroussi.com/2014/11/05/bcrecommender-traction-update/Wed, 05 Nov 2014 02:29:35 +0000https://yanirseroussi.com/2014/11/05/bcrecommender-traction-update/Update on BCRecommender traction using three channels: blogger outreach, search engine optimisation, and content marketing.What is data science?https://yanirseroussi.com/2014/10/23/what-is-data-science/Thu, 23 Oct 2014 03:22:08 +0000https://yanirseroussi.com/2014/10/23/what-is-data-science/Data science has been a hot term in the past few years. Still, there isn&rsquo;t a single definition of the field. This post discusses my favourite definition.Greek Media Monitoring Kaggle competition: My approachhttps://yanirseroussi.com/2014/10/07/greek-media-monitoring-kaggle-competition-my-approach/Tue, 07 Oct 2014 03:21:35 +0000https://yanirseroussi.com/2014/10/07/greek-media-monitoring-kaggle-competition-my-approach/Summary of my approach to the Greek Media Monitoring Kaggle competition, where I finished 6th out of 120 teams.Applying the Traction Book’s Bullseye framework to BCRecommenderhttps://yanirseroussi.com/2014/09/24/applying-the-traction-books-bullseye-framework-to-bcrecommender/Wed, 24 Sep 2014 04:57:39 +0000https://yanirseroussi.com/2014/09/24/applying-the-traction-books-bullseye-framework-to-bcrecommender/Ranking 19 channels with the goal of getting traction for BCRecommender.Bandcamp recommendation and discovery algorithmshttps://yanirseroussi.com/2014/09/19/bandcamp-recommendation-and-discovery-algorithms/Fri, 19 Sep 2014 14:26:55 +0000https://yanirseroussi.com/2014/09/19/bandcamp-recommendation-and-discovery-algorithms/The recommendation backend for my BCRecommender service for personalised Bandcamp music discovery.Building a recommender system on a shoestring budget (or: BCRecommender part 2 – general system layout)https://yanirseroussi.com/2014/09/07/building-a-recommender-system-on-a-shoestring-budget/Sun, 07 Sep 2014 10:48:44 +0000https://yanirseroussi.com/2014/09/07/building-a-recommender-system-on-a-shoestring-budget/Iterating on my BCRecommender service with the goal of keeping costs low while providing a valuable music recommendation service.Building a Bandcamp recommender system (part 1 – motivation)https://yanirseroussi.com/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/Sat, 30 Aug 2014 08:11:38 +0000https://yanirseroussi.com/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/My motivation behind building BCRecommender, a free recommendation &amp; discovery service for Bandcamp music.How to (almost) win Kaggle competitionshttps://yanirseroussi.com/2014/08/24/how-to-almost-win-kaggle-competitions/Sun, 24 Aug 2014 12:40:53 +0000https://yanirseroussi.com/2014/08/24/how-to-almost-win-kaggle-competitions/Summary of a talk I gave at the Data Science Sydney meetup with ten tips on almost-winning Kaggle competitions.Data’s hierarchy of needshttps://yanirseroussi.com/2014/08/17/datas-hierarchy-of-needs/Sun, 17 Aug 2014 13:09:30 +0000https://yanirseroussi.com/2014/08/17/datas-hierarchy-of-needs/Discussing the hierarchy of needs proposed by Jay Kreps. Key takeaway: Data-driven algorithms &amp; insights can only be as good as the underlying data.Kaggle competition tips and summarieshttps://yanirseroussi.com/kaggle/Sat, 05 Apr 2014 23:46:10 +0000https://yanirseroussi.com/kaggle/Pointers to all my Kaggle advice posts and competition summaries.Kaggle beginner tipshttps://yanirseroussi.com/2014/01/19/kaggle-beginner-tips/Sun, 19 Jan 2014 10:34:28 +0000https://yanirseroussi.com/2014/01/19/kaggle-beginner-tips/First post! An email I sent to members of the Data Science Sydney Meetup with tips on how to get started with Kaggle competitions.About Yanir: Startup Data & AI Consultanthttps://yanirseroussi.com/about/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/about/About Yanir Seroussi, a hands-on data tech lead with over a decade of experience. Yanir helps climate/nature tech startups ship data-intensive solutions.Book a free fifteen-minute callhttps://yanirseroussi.com/free-intro-call/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/free-intro-call/Booking form for a quick intro call with Yanir Seroussi.Causal inference resourceshttps://yanirseroussi.com/causal-inference-resources/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/causal-inference-resources/<p>This is a list of some causal inference resources, which I update from time to time. You can also check out my posts on <a href="https://yanirseroussi.com/tags/causal-inference/">causal inference</a> and <a href="https://yanirseroussi.com/tags/a/b-testing/">A/B testing</a>.</p> +<p><strong>Books</strong>:</p> +<ul> +<li><a href="https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/" target="_blank" rel="noopener"><em>Causal Inference: What if</em></a> by Miguel Hernán and Jamie Robins: <a href="https://yanirseroussi.com/2018/12/24/the-most-practical-causal-inference-book-ive-read-is-still-a-draft/">The most practical book I&rsquo;ve read</a>. Highly recommended.</li> +<li><a href="https://experimentguide.com/" target="_blank" rel="noopener"><em>Trustworthy Online Controlled Experiments : A Practical Guide to A/B Testing</em></a> by Ron Kohavi, Diane Tang, and Ya Xu: Building on the authors&rsquo; decades of industry experience, this is pretty much the bible of online experiments, which is how causal inference is often done in practice.</li> +<li><a href="http://www.skleinberg.org/why/" target="_blank" rel="noopener"><em>Why: A Guide to Finding and Using Causes</em></a> by Samantha Kleinberg: A high-level intro to the topic. I discussed highlights in <a href="https://yanirseroussi.com/2016/02/14/why-you-should-stop-worrying-about-deep-learning-and-deepen-your-understanding-of-causality-instead/"><em>Why you should stop worrying about deep learning and deepen your understanding of causality instead</em></a>.</li> +<li><a href="http://www.skleinberg.org/causality_book/index.html" target="_blank" rel="noopener"><em>Causality, Probability, and Time</em></a> by Samantha Kleinberg: More technical than Kleinberg&rsquo;s other book. As the title suggests, the element of time is central to the methods presented in the book. However, I&rsquo;m still unsure about the practicality of those methods on real data. See my post <a href="https://yanirseroussi.com/2016/05/15/diving-deeper-into-causality-pearl-kleinberg-hill-and-untested-assumptions/"><em>Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptions</em></a> for more details.</li> +<li><a href="http://bayes.cs.ucla.edu/PRIMER/" target="_blank" rel="noopener"><em>Causal Inference in Statistics: A Primer</em></a> by Judea Pearl, Madelyn Glymour, Nicholas P. Jewell: A fairly accessible introduction to Judea Pearl&rsquo;s work. I didn&rsquo;t find it that practical, but I believe it helped me understand the graphical modelling parts of <em>Causal Inference</em> by Hernán and Robins.</li> +<li><a href="https://mitpress.mit.edu/books/elements-causal-inference" target="_blank" rel="noopener"><em>Elements of Causal Inference: Foundations and Learning Algorithms</em></a> by Jonas Peters, Dominik Janzing, and Bernhard Schölkopf: The name of the book is an obvious reference to the classic book <a href="https://web.stanford.edu/~hastie/ElemStatLearn/" target="_blank" rel="noopener"><em>The Elements of Statistical Learning</em></a> by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unfortunately, the <em>Elements of Causal Inference</em> isn&rsquo;t as widely applicable as Hastie et al.&rsquo;s book – it contains some interesting ideas, but it appears that algorithms for causal learning from data with minimal assumptions aren&rsquo;t yet scalable enough for practical use. This will probably change in the future.</li> +<li><a href="http://www.mostlyharmlesseconometrics.com/" target="_blank" rel="noopener"><em>Mostly Harmless Econometrics</em></a> by Joshua D. Angrist and Jörn-Steffen Pischke: I started reading this book on my Kindle and was put off by some formatting issues. It also seemed like a less-general version of Pearl&rsquo;s work. I may get back to it one day.</li> +<li><a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html" target="_blank" rel="noopener"><em>Causality: Models, Reasoning, and Inference</em></a> by Judea Pearl: I haven&rsquo;t read it, and I doubt it&rsquo;d be very practical given <a href="https://www.reddit.com/r/statistics/comments/8lu1sr/causal_inference_book_recommendations/" target="_blank" rel="noopener">the opinions of people who have</a>. But maybe I&rsquo;ll get to it one day.</li> +<li><a href="http://bayes.cs.ucla.edu/WHY/" target="_blank" rel="noopener"><em>The Book of Why: The New Science of Cause and Effect</em></a> by Judea Pearl and Dana Mackenzie: An accessible overview of the field, focusing on Pearl&rsquo;s contributions, but with plenty of historical background. Worth reading to get excited about the causal revolution.</li> +<li><a href="https://www.manning.com/books/causal-machine-learning" target="_blank" rel="noopener"><em>Causal Machine Learning</em></a> by Robert Osazuwa Ness: Still a draft as of September 2022, but <a href="https://yanirseroussi.com/2022/09/12/causal-machine-learning-book-draft-review/">it looks promising</a>.</li> +</ul> +<p><strong>Articles</strong>:</p>Free Guide: Data-to-AI Health Check for Startupshttps://yanirseroussi.com/data-to-ai-health-check/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/data-to-ai-health-check/Download a free PDF guide that helps you assess a startup&rsquo;s Data-to-AI health by probing eight key areas.Helping climate & nature tech startups ship data-intensive solutionshttps://yanirseroussi.com/consult/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/consult/Consulting for climate &amp; nature tech startups: Strategic advice, implementation of Data/AI/ML solutions, and hiring help by an experienced tech leader.Speaking engagements by Yanir: Startup Data & AI Consultanthttps://yanirseroussi.com/talks/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/talks/Yanir Seroussi speaks on data science, artificial intelligence, machine learning, and career journey.Stay in touchhttps://yanirseroussi.com/contact/Mon, 01 Jan 0001 00:00:00 +0000https://yanirseroussi.com/contact/Contact me or subscribe to the mailing list. \ No newline at end of file