diff --git a/2014/01/19/kaggle-beginner-tips/index.html b/2014/01/19/kaggle-beginner-tips/index.html index 09ff6f5d6..3e60da7f5 100644 --- a/2014/01/19/kaggle-beginner-tips/index.html +++ b/2014/01/19/kaggle-beginner-tips/index.html @@ -2,7 +2,7 @@

Kaggle beginner tips

These are few points from an email I sent to members of the Data Science Sydney Meetup. I suppose other Kaggle beginners may find it useful.

My first steps when working on a new competition are:

  • Read all the instructions carefully to understand the problem. One important thing to look at is what measure is being optimised. For example, minimising the mean absolute error (MAE) may require a different approach from minimising the mean square error (MSE).
  • Read messages on the forum. Especially when joining a competition late, you can learn a lot from the problems other people had. And sometimes there’s even code to get you started (though code quality may vary and it’s not worth relying on).
  • Download the data and look at it a bit to understand it better, noting any insights you may have and things you would like to try. Even if you don’t know how to model something, knowing what you want to model is half of the solution. For example, in the DSG Hackathon (predicting air quality), we noticed that even though we had to produce hourly predictions for pollutant levels, the measured levels don’t change every hour (probably due to limitations in the measuring equipment). This led us to try a simple “model” for the first few hours, where we predicted exactly the last measured value, which proved to be one of our most valuable insights. Stupid and uninspiring, but we did finish 6th :-). The main message is: look at the data!
  • Set up a local validation environment. This will allow you to iterate quickly without making submissions, and will increase the accuracy of your model. For those with some programming experience: local validation is your private development environment, the public leaderboard is staging, and the private leaderboard is production.
    What you use for local validation depends on the type of problem. For example, for classic prediction problems you may use one of the classic cross-validation techniques. For forecasting problems, you should try and have a local setup that is as close as possible to the setup of the leaderboard. In the Yandex competition, the leaderboard is based on data from the last three days of search activity. You should use a similar split for the training data (and of course, use exactly the same local setup for all the team members so you can compare results).
  • Get the submission format right. Make sure that you can reproduce the baseline results locally.

Now, the way things often work is:

  • You try many different approaches and ideas. Most of them lead to nothing. Hopefully some lead to something.
  • Create ensembles of the various approaches.
  • Repeat until you run out of time.
  • Win. Hopefully.

Note that in many competitions, the differences between the top results are not statistically significant, so winning may depend on luck. But getting one of the top results also depends to a large degree on your persistence. To avoid disappointment, I think the main goal should be to learn things, so spend time trying to understand how the methods that you’re using work. Libraries like sklearn make it really easy to try a bunch of models without understanding how they work, but you’re better off trying less things and developing the ability to reason about why they work or not work.

An analogy for programmers: while you can use an array, a linked list, a binary tree, and a hash table interchangeably in some situations, understanding when to use each one can make a world of difference in terms of performance. It’s pretty similar for predictive models (though they are often not as well-behaved as data structures).

Finally, it’s worth watching this video by Phil Brierley, who won a bunch of Kaggle competitions. It’s really good, and doesn’t require much understanding of R.

Any comments are welcome!

Subscribe
    -

    Public comments are closed, but I love hearing from readers. Feel free to +

    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

    Hi Yanir!

    I have a question.

    When you say: “For example, minimising the mean absolute error (MAE) may require a different approach from minimising the mean square error (MSE).” can you explain what kind of approach (or methods, or rules of thumb) that your get to minimising MAE or MSE in machine learning?

    Thanks for your time in advance!

    Regards,

    Flavio

    Hi Flavio!

    The optimisation approach depends on the data and method you’re using.

    A basic example is when you don’t have any features, only a sample of target values. In that case, if you want to minimise the MAE you should choose the sample median, and if you want to minimise the MSE you should choose the sample mean. Here’s proof why: https://www.dropbox.com/s/b1195thcqebnxyn/mae-vs-rmse.pdf

    For more complex problems, if you’re using a machine learning package you can often specify the type of loss function to minimise (see https://en.wikipedia.org/wiki/Loss_function#Selecting_a_loss_function). But even if your measure isn’t directly optimised (e.g., MAE is harder to minimise than MSE because it’s not differentiable at zero), you can always do cross-validation to find the parameters that optimise it.

    I hope this helps.

    Hi Yanir!

    appreciate your work! I need to know should I directly jump into machine learning algorithm, programming etc or to first master math and statistics ? I am new in this field.

    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

    Can you elaborate what you mean in Tip 5 by stating “The main scenarios when you should skip local validation is when the data is too small …”. What I experienced is that with too little observations, the leaderboard becomes very misleading, so my intuition would be to use more local validation for small datasets, not less.

    Good point. What I was referring to are scenarios where local validation is unreliable.

    For example, in the Arabic writer identification competition (http://blog.kaggle.com/2012/04/29/on-diffusion-kernels-histograms-and-arabic-writer-identification/), each of the 204 writers had only two training paragraphs (all containing the same text), while the test/leaderboard instances were a third paragraph with different content. I tried many forms of local validation but none of them yielded results that were consistent with the leaderboard, so I ended up relying on the leaderboard score.

    Ah, thanks, that clarifies what you meant. The (currently still running) Africa Soil Property contest (https://www.kaggle.com/c/afsis-soil-properties) seems a bit similar. I won’t put much more energy into that contest, but I am curious how it will work out in the end, and what things will have worked for the winners (maybe not much except pure luck).
    Could you provide some tips on #3(‘Getting to Know your data’) with respect to best practice visualisations to gain insights from data - especially considering the fact that datasets always have a large number of features. Plotting feature vs. label graphs do seem to be helpful, but for a large number of features will be impractical. So how should one go about data analysis via visualisation?

    It really depends on the dataset. For personal use, I don’t worry too much about pretty visualisations. Often just printing some summary statistics works well.

    Most text classification problems are hard to visualise. If, for example, you use bag of words (or n-grams) as your feature set, you could just print the top words for each label, or the top words that vary between labels. Another thing to look at would be commonalities between misclassified instances – these could be dependent on the content of the texts or their length.

    Examples:

    • In the Greek Media Monitoring competition (http://yanirseroussi.com/2014/10/07/greek-media-monitoring-kaggle-competition-my-approach/), I found that ‘Despite being manually annotated, the data isn’t very clean. Issues include identical texts that have different labels, empty articles, and articles with very few words. For example, the training set includes ten “articles” with a single word. Five of these articles have the word 68839, but each of these five was given a different label.’ – this was discovered by just printing some summary statistics and looking at misclassified instances
    • Looking into the raw data behind one of the widely-used sentiment analysis datasets, I found an issue that was overlooked by many other people who used the dataset: http://www.cs.cornell.edu/people/pabo/movie-review-data/ (look for the comment with my name – found four years after the original dataset was published)

    I hope this helps.

    Thanks a lot, So to summarize the following could be 3

    1. Using summary statistics such as means/stds/variance on the data, looking out for outliers,etc in the data
    2. Looking at misclassified instances during validation to find some sort of pattern in them
    3. Looking at label-specific raw data I apologize for the long overdue response, and thanks for these tips. This will surely be useful in my next Kaggle competition.
    Reblogged this on Dr. Manhattan's Diary and commented: diff --git a/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/index.html b/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/index.html index 9f1641aea..7a108d329 100644 --- a/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/index.html +++ b/2014/08/30/building-a-bandcamp-recommender-system-part-1-motivation/index.html @@ -2,7 +2,7 @@

    Building a Bandcamp recommender system (part 1 – motivation)

    I’ve been a Bandcamp user for a few years now. I love the fact that they pay out a significant share of the revenue directly to the artists, unlike other services. In addition, despite the fact that fans may stream all the music for free and even easily rip it, almost $80M were paid out to artists through Bandcamp to date (including almost $3M in the last month) – serving as strong evidence that the traditional music industry’s fight against piracy is a waste of resources and time.

    One thing I’ve been struggling with since starting to use Bandcamp is the discovery of new music. Originally (in 2011), I used the browse-by-tag feature, but it is often too broad to find music that I like. A newer feature is the Discoverinator, which is meant to emulate the experience of browsing through covers at a record store – sadly, I could never find much stuff I liked using that method. Last year, Bandcamp announced Bandcamp for fans, which includes the ability to wishlist items and discover new music by stalking/following other fans. In addition, they released a mobile app, which made the music purchased on Bandcamp much easier to access.

    All these new features definitely increased my engagement and helped me find more stuff to listen to, but I still feel that Bandcamp music discovery could be much better. Specifically, I would love to be served personalised recommendations and be able to browse music that is similar to specific tracks and albums that I like. Rather than waiting for Bandcamp to implement these features, I decided to do it myself. Visit BCRecommender – Bandcamp recommendations based on your fan account to see where this effort stands at the moment.

    While BCRecommender has already helped me discover new music to add to my collection, building it gave me many more ideas on how it can be improved, so it’s definitely a work in progress. I’ll probably tinker with the underlying algorithms as I go, so recommendations may occasionally seem weird (but this always seems to be the case with recommender systems in the real world). In subsequent posts I’ll discuss some of the technical details and where I’d like to take this project.


    It’s probably worth noting that BCRecommender is not associated with or endorsed by Bandcamp, but I doubt they would mind since it was built using publicly-available information, and is full of links to buy the music back on their site.

    Subscribe
      -

      Public comments are closed, but I love hearing from readers. Feel free to +

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Hi!

      I just found these articles a few years after their publication… I saw that the BCRecommender seems not active anymore and that the last post is from 2015.

      Any update? I’m interested to have your feedback.

      Thanks,

      Clément

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      So true. Thanks for saying it so well.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

      Subscribe
        -

        Public comments are closed, but I love hearing from readers. Feel free to +

        Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

        Hi, very nice trick! Trying to implement this as we speak, does this code still work? I get to the collections page, but I don’t think the upload is working. I’m new to Phantomjs. Thanks!
        Hi Walter! Yeah, the code stopped working when Parse redesigned their website. I never fixed it because I ended up porting my projects away from Parse. If you fix it let me know and I’ll update this post. By the way, you may find it easier to use Selenium (or something similar) as a wrapper around PhantomJS, as it should result in cleaner code. For example, check out Python’s Selenium bindings: http://selenium.googlecode.com/svn/trunk/docs/api/py/index.html
        Subscribe
          -

          Public comments are closed, but I love hearing from readers. Feel free to +

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          I do not understand how your featureset helped the model to learn anything. For example, user_num_query_actions: number of queries performed by the user

          How will it affect the order of search results for a new/test query.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          “What I really wanted was a stable part-time gig.”: They’re remarkably hard to find. It’s an absurdity of our time that many people are overemployed - selling more of their time than they want for more money than they need - even while many other people are underemployed - unable to sell enough of their time for enough money to live comfortably.
          That’s very true. The interesting thing is that it’s a problem that is not unique to this century. It was discussed by Thoreau in Walden (1854), Bertrand Russell in In Praise of Idleness (1932), and David Graeber in On the Phenomenon of Bullshit Jobs (2013), to name a few. People seem to be worried about robots taking their jobs, but the scarier thought is that robots will never take our jobs, because we’ll keep coming up with ways of staying employed rather than enjoy the affluence afforded by technological advancements.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          Thanks for sharing your standpoint on this.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          Thanks for the stimulation. I’m still fascinated by the lure to extract sentiment from text, but it seems like so often the sentiment that the author intended never fully came to expression in the text. Maybe an interdisciplinary approach will be required to teach machines to parse the intentions implicit in text, and, like other media phenomena, a loop will have to form: perhaps the awareness that explicit intentions and sentiment are of benefit to authors in a world that (one day) automates the sorting of all its documents will cause writing styles to adapt. The effect of the best we can on what we’re doing now is one of those things you begin to see a pattern in. Here’s an API that correlates patterns of unstructured info: dev.keywordmeme.com Would love your feedback. Let me know if it’s useful to you or if you have any comments. Well done on the carbon post, btw. Glad I found your blog.

          Thank you for the comment! I agree that analysing sentiment is very tricky due to the fact that people often don’t express themselves so well. If I remember correctly, inter-annotator agreement on some sentiment analysis tasks is only 70-80%, so it’s likely that we will ever have perfect performance by machines.

          dev.keywordmeme.com redirects to a github page – where is the API?

          You bet. I’m fascinated to see what seems to be a real live push toward an interdisciplinary approach. That 70-80% performance might be pushed over the hump by humans with special training until such a time as the process can be formalized. It looks like auditing ML-driven processes could be a new category of employment through this next technological plateau. The human-machine relationship in a friendly old configuration! Sorry about the link. This should work: http://www.keywordmeme.com/. It makes you register, just a heads up. Hit the engineers up on github if you have any questions or if things aren’t working. Which is possible. Take care! :)

          Hi Yanir

          Thank you very much for this post. Helpul for somebody like me seeking to be a data scientist.

          I’m a software engineer, currently master data architect.

          I’m taking MOOCs in order to fill the gaps, so let’s say I’m on a good track :)

          However, once problem found and hands got dirty, how to find a mentor ? afterwards, get published ?

          I think this would be hard via academic

          Finding a mentor depends on where you are. Good places to start would be your current workplace (if you work with data scientists), or local meetups (if there are any in your area). Another option would be to contribute to open source projects in the field as a way of getting to know people and getting feedback. Finally, there are courses like the one by Thinkful, where you can pay to be mentored.

          Regarding getting published, I agree that it’d be hard to get published in many academic venues without help from people who know how it’s done. However, you can always start your own blog and link to it from places like Reddit and DataTau. Even if you don’t get any feedback, publishing often forces you to think more deeply about the subject of your article.

          At the workplace, it will be a bit hard.

          I Live in Paris, meetups would be a good option.

          You’re right, publishing forces to think more deeply, feedbacks from readers are also good way to learn.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          I think it’s all about what you expect. We used parse for prototypes and it’s been working great for us so far. I actually think we still have one of the prototypes running over it which we haven’t touched in almost year (a mobile/web strategy game now only available on FB - https://apps.facebook.com/foresttribes/). It’s been great since it also saved us the need to develop a backend admin tool to manage/balance the game or add additional content.

          Over all I never had a real live public product running on parse to comment on the experience, but for prototypes I’m perfectly happy with the service.

          Agreed, it’s perfectly fine for prototypes, but a bit too unreliable for public-facing live products. If Parse were more robust, it’d be perfect for many use cases.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          I enjoyed the post - though I offer some contrary points to consider:

          I have learned that if it is clear that you will need a data scientists (by someone who knows what they do), then you should get them as soon as possible. Don’t wait. Data Scientists work best when they have full context for the problem they are here to solve. Getting them in early allows them to help frame the problem. This framing is critical. If the framing is off, it takes a very long time (sometimes never) to get it back on track. A late-to-the-game data scientists can be too influenced by the the existing framing they are given. They tend to think within that box, when in reality, the box was never the right way to approach the problem. Even if they do see outside of it, it can be very difficult to convince the original framers that there is a better way to do things (people can get quite attached to their vision).

          It also can be wise to NOT WAIT till there is data to analyze. Too often, data is an afterthought. Its important for the data scientist to get in early on the initiative so he or she can help define the needed instrumentation and data acquisition strategy. They can even guide the needs of the data warehouse and other repositories where the newly captured data will reside.

          Further, it is often the case that it is the data scientist that identifies the specific problem to solve. At my company, I estimate that over half of the ideas for new data products, features, and services come from the data science team – not the business. This is intuitive as the data scientists are the folks that are most intimate with the data and are least constrained by what is possible to do with data. Give them business context and they will come up with problems/solutions that no one has thought of.

          Finally, I find heuristics to be dangerous. At best they are suboptimal, and more often than not, they are just plain wrong (those with extensive A/B testing experience can attest to the fact that our intuition fails us again and again). Undoing a bad heuristics can be very painful - in the technical work, the coordinate work, and in the resetting of expectations. Its hard to get people to not walk on a paved path … even if that path is the long way or a dead-end.

          I totally agree with “Q5: Are you committed to being data-driven?”. This comes down to business model and culture. Is your business model one where data science can be the source of strategic differentiation? Is your culture able to support empiricism? The answer to both of these has to be ‘yes’ in order to commit to being data-driven.

          Thank you for your thoughtful comments, Eric!

          I generally agree that it can be beneficial to involve data scientists early on and to avoid thoughtless heuristics, but that it all depends on having a supportive data-driven environment and on resource constraints. As mentioned under Q2, getting advice from a data scientist in the early stages of the product is worthwhile, so it may be smart to pay for a few days of consulting, but not necessarily a good idea to hire a full-timer. A lot of it depends on the general product vision.

          Another note regarding heuristics and intuition: While some may be dangerous, you can view many modelling decisions as heuristics. For example, when building a predictive model, you have to make some intuition-driven choices around features (no model uses all the knowledge in the world), learning algorithms and their hyperparameters. You just can’t test everything, so there’s a need for compromises if you aim to ever deliver anything.

          Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

          How did you arrive at the conclusion that accuracy doesn’t matter? The Netflix quote and chart don’t seem connected to me. The quote refers to the massive ensembling used to achieve the challenge score threshold. The chart seems to show that you can go a long way from the baseline by improving features and models.

          I’d say accuracy, or more generally, score used for evaluation, doesn’t matter as long as it’s good enough. However, it’s not that easy to arrive at “good enough”. Consider Spotify. I find their daily recommendations abysmal. Discover Weekly is much better, but still has room to improve.

          I agree. I said that predictive accuracy has some importance, but it is not the only thing that matters. You’re right about it needing to be good enough, where the definition of good enough is domain-dependent.

          Daniel Lopresti said it well years ago (he spoke about web search but it applies to recommendation scenarios where suggestions are browsable):

          Browsing is a comfortable and powerful paradigm (the serendipity effect). Search results don't have to be very good. Recall? Not important (as long as you get at least some good hits). Precision? Not important (as long as at least some of the hits on the first page you return are good).
          it does not clear to me why accuracy is not important in recommender and searching??

          It is important, but its importance tends to be exaggerated to the exclusion of all other metrics. As I said in the post, things like the way you present your results (UI/UX) and novelty/serendipity are also very important. In addition, the goal of the system is often to optimise a different goal from offline accuracy, such as revenue or engagement. In such cases it is best to focus on what you want to improve rather than offline accuracy.

          By the way, I attended a talk by Ted Dunning a few months ago, where he said that one of the most important tweaks in real-life recommender system is adding random recommendations (essentially decreasing offline accuracy). This allows the system to learn from user feedback on a wider range of items, improving performance in the long run.

          Thank you very much for your fast response. diff --git a/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/index.html b/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/index.html index d803111d7..81bd266b5 100644 --- a/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/index.html +++ b/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/index.html @@ -5,7 +5,7 @@ https://yanirseroussi.com/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/phd-comics-science-news-cycle.gif 600w," src=https://yanirseroussi.com/2015/10/19/nutritionism-and-the-need-for-complex-models-to-explain-complex-phenomena/phd-comics-science-news-cycle.gif alt="PHD Comics: Science News Cycle" loading=lazy>

          Selling your model with simple explanations

          People like simple explanations for complex phenomena. If you work as a data scientist, or if you are planning to become/hire one, you’ve probably seen storytelling listed as one of the key skills that data scientists should have. Unlike “real” scientists that work in academia and have to explain their results mostly to peers who can handle technical complexities, data scientists in industry have to deal with non-technical stakeholders who want to understand how the models work. However, these stakeholders rarely have the time or patience to understand how things truly work. What they want is a simple hand-wavy explanation to make them feel as if they understand the matter – they want a story, not a technical report (an aside: don’t feel too smug, there is a lot of knowledge out there and in matters that fall outside of our main interests we are all non-technical stakeholders who get fed simple stories).

          One of the simplest stories that most people can understand is the story of correlation. Going back to the running example of predicting health based on diet, it is well-known that excessive consumption of certain fats under certain conditions is correlated with an increase in likelihood of certain diseases. This is simplified in some stories to “consuming more fat increases your chance of disease”, which leads to the conclusion that consuming no fat at all decreases the chance of disease to zero. While this may sound ridiculous, it’s the sad reality. According to a recent survey, while the image of fat has improved over the past few years, 42% of Americans still try to limit or avoid all fats.

          A slightly more involved story is that of linear models – looking at the effect of the most important factors, rather than presenting a single factor’s contribution. This storytelling technique is commonly used even with non-linear models, where the most important features are identified using various techniques. The problem is that people still tend to interpret this form of presentation as a simple linear relationship. Expanding on the previous example, this approach goes from a single-minded focus on fat to the need to consume less fat and sugar, but more calcium, protein and vitamin D. Unfortunately, even linear models with tens of variables are hard for people to use and follow. In the case of nutrition, few people really track the intake of all the nutrients covered by recommended daily intakes.

          Few interesting relationships are linear

          Complex phenomena tend to be explained by complex non-linear models. For example, it’s not enough to consume the “right” amount of calcium – you also need vitamin D to absorb it, but popping a few vitamin D pills isn’t going to work well if you don’t consume them with fat, though over-consumption of certain fats is likely to lead to health issues. This list of human-friendly rules can go on and on, but reality is much more complex. It is naive to think that it is possible to predict something as complex as human health with a simple linear model that is based on daily nutrient intake. That being said, some relationships do lend themselves to simple rules of thumb. For example, if you don’t have enough vitamin C, you’re very likely to get scurvy, and people who don’t consume enough vitamin B1 may contract beriberi. However, when it comes to cancers and other diseases that take years to develop, linear models are inadequate.

          An accurate model to predict human health based on diet would be based on thousands to millions of variables, and would consider many non-linear relationships. It is fairly safe to assume that there is no magic bullet that simply explains how diet affects our health, and no superfood is going to save us from the complexity of our nutritional needs. It is likely that even if we had such a model, it would not be completely accurate. All models are wrong, but some models are useful. For example, the vitamin C versus scurvy model is very useful, but it is often wrong when it comes to predicting overall health. Predictions made by useful complex models can be very hard to reason about and explain, but it doesn’t mean we shouldn’t use them.

          The ongoing quest for sellable complex models

          All of the above should be pretty obvious to any modern data scientist. The culture of preferring complex models with high predictive accuracy to simplistic models with questionable predictive power is now prevalent (see Leo Breiman’s 2001 paper for a discussion of these two cultures of statistical modelling). This is illustrated by the focus of many Kaggle competitions on producing accurate models and the recent successes of deep learning for computer vision. Especially with deep learning for vision, no one expects a handful of variables (pixels) to be predictive, so traditional explanations of variable importance are useless. This does lead to a general suspicion of such models, as they are too complex for us to reason about or fully explain. However, it is very hard to argue with the empirical success of accurate modelling techniques.

          Nonetheless, many data scientists still work in environments that require simple explanations. This may lead some data scientists to settle for simple models that are easier to sell. In my opinion, it is better to make up a simple explanation for an accurate complex model than settle for a simple model that doesn’t really work. That being said, some situations do call for simple or inflexible models due to a lack of data or the need to enforce strong prior assumptions. In Albert Einstein’s words, “it can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience”. Make things as simple as possible, but not simpler, and always consider the interests of people who try to sell you simplistic (or unnecessarily complex) explanations.

          Subscribe
            -

            Public comments are closed, but I love hearing from readers. Feel free to +

            Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

            Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

            Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

            Excellent excellent common sense article which seems to be very uncommon nowadays in a hype-filled world. Thanks for reminding us that accurate problem description should trump everything else!

            Thank you for a great article. Yes, well defined problems and well defined performance evaluation are keys to designing any data driven model.

            I also found that sometimes we have the question we want to pursue, but getting to an answer is not straight forward. For instance, I’m trying to find affinity between food ingredients using only data analytics. One may think that this problem is trivial. In fact, to find this, one has to totally rethink how to represent data (having ingredients in a table or a dataset produced nothing.) Yes finding affinity between 2 ingredients is trivial, but when the number grows up, one has to change the setting. In my case, I had to think of ingredients as part of a complete network, where the network is a recipe. It is then and only then, that I was able to find affinity between many ingredients.

            Yes, many good points here, thanks for this. There is even another difficulty apart from problem definition and solution measurement: the semantics of the data itself. Are the definitions real (referring to other concepts) or nominal (“a cheeseburger is a burger with cheese”)? Scope and context can easily be lost, and can only be put back by a human being taking a decision, no amount of empirical modelling can re-discover this. Also precision and accuracy of the data may be unknown and/or insufficient to solve the problem posed. If you have detected these issues, sometimes you can re-formulate the problem,but typically its not clear from the column headings alone (if you even have these). Even worse, the definitions may be incoherent or nonsensical: eg in classical econometric modelling the definition of a rational agent entails that the agent have knowledge of the future!
            Thanks Andrew! I agree that often what you can do is very limited by the data. I’ve also encountered cases where I had to infer meaning from cryptic column names. In many cases the small arbitrary decisions that we make along the way can have a major influence on the final results!

            Is this article a Poe? The amount of muddled priors throughout it is disturbing. The word “sophistry” keeps leaping to mind. E.g.:

            > For instance, if a government embarks on the building of a pyramid […] s/“a government”/Paris/g diff --git a/2015/12/08/this-holiday-season-give-me-real-insights/index.html b/2015/12/08/this-holiday-season-give-me-real-insights/index.html index 1f755fa50..a78ef0cb9 100644 --- a/2015/12/08/this-holiday-season-give-me-real-insights/index.html +++ b/2015/12/08/this-holiday-season-give-me-real-insights/index.html @@ -14,7 +14,7 @@ https://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/linkedin-profile-views.png 983w," src=https://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/linkedin-profile-views_hu9bee2f11dc6b74171edf6cb72c9c5a93_38858_800x0_resize_box_3.png alt="LinkedIn profile views" loading=lazy>

            What would real LinkedIn insights look like? First, I think that the focus on profile views is somewhat misguided. It’s not that hard to artificially generate profile views – simply view other people’s profiles. There is no intrinsic value in someone having viewed your profile – the value comes from a connection that leads to an interesting offer or conversation. Second, LinkedIn is about professional networking that is based on real-world activity. As such, it only forms a small part of the world of professional networking by allowing people to have an online presence that makes them contactable by people they don’t already know. When it comes to insights, it’d be useful to know the true causal factors that lead to interesting connections – much more useful than suggestions such as add software development as a skill on your profile to get up to 3% more profile views.

            Summary: Real insights are about the why

            There are many other examples of pseudo-insights out there. The reason is probably that the field of analytics is becoming increasingly commoditised, and it is easier to rebrand an analytics dashboard as an insights dashboard than to provide real insights. Providing real insights requires moving up the DIKW pyramid from data and information to knowledge and wisdom – from describing the past to learning general lessons that allow you to influence the future. Providing real insights can be very hard, as it often requires inferring the causes of events – the why that comes after the what and how. More on this later – I have just started reading Samantha Kleinberg’s Why: A Guide to Finding and Using Causes and will report (hopefully real) insights on causality in future posts.

            Subscribe
              -

              Public comments are closed, but I love hearing from readers. Feel free to +

              Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

              Nice post. Mostly agree. Automated insights are hard to automate though, but we (the WordPress.com Data Team) are working on it.

              Some of the things we’ve found that have the biggest impact on successful blogging are:

              • Turning on publicize so your posts are pushed out to various social media channels
              • Regularly publishing. Doesn’t have to be daily, but does need to be regular. We still don’t understand how the periodicity plays into this.
              • Images in posts are correlated with more traffic.

              There’s still a lot to learn here. Interested in helping? https://automattic.com/work-with-us/data-wrangler/ :)

              Thanks Greg! All those factors make sense. Personally, I prefer sharing posts manually to turning on Publicize, but I suppose it has the same effect. My guess is that one of the reasons why images are important is that having at least one image makes posts stick out when shared on social media.

              By the way, I did apply for the data wrangler position a couple of months ago but never heard back. It’s probably too late now, as I have a different position (and a few other options) lined up when I get home from vacation next month :)

              Hey Yanir, that’s embarrassing. :)

              Sorry we haven’t gotten back to you yet. I do see you in our queue. Its been a busy two months so we’re a bit backed up, but getting back on track in the next week or two. Certainly understand if that doesn’t fit into your own timeline. Sorry if that ends up being the case.

              Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

              Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

              It seems to me that causality is another of our thought conveniences, just one more attempt at linearising our frustratingly non-linear existence, akin to teaching with Newtonian physics, segue to Einstein’s relativistic mechanics when the kids are ready (if ever). Cyclic systems can self-perpetuate in non-repeating cycles (chaos theory) but also respond with or resist change arising from external inputs. I believe when people speak of causality, what they are really thinking about (and desiring) is a conversation around stability versus volatility.

              Hey Yanir - great post.

              If you’ve not already, you should read Mostly Harmless Econometrics. They take quite a different approach to causality than Pearl (though there is a lot of conceptual overlap). It definitely helps build intuition for the topic. It’s also worth reading the relevant mid-70s papers from Rubin.

              Thanks for the pointers, Jim! I’ll check those resources out.
              It’s not a different approach. The notation is different but the two frameworks (Pearl’s and Neyman-Rubin) have been proved equivalent.
              I took a look at the Amazon sample for Causality, Probability and Time but I doubt if I’ll buy it just yet. I’ve got Judea Pearl’s Probabilistic Reasoning in Intelligent Systems already and think I want to work through that in a programming language (R is my first choice) before buying any more books. ;-)
              I appreciate this post. I teach General Psychology, and this is a central issue that I present to my students. In the meantime, I regularly come across articles, in peer-reviewed as wells as mainstream publications, which discuss correlational data as if it were supporting a causal relationship. As I tell my students, one of the difficulties is the use of the word “factor” in both types of discussions. In correlation, factors are pieces of information which give you a more likely guess about an unknown piece of information. In causation, factors are things that contribute to something else existing. Both concepts feed the mind’s desire to find patterns in the relevant world which inform our decisions/behaviors so that we can continue living, hopefully in a pleasant state. We are often tricked by these patterns (illusions, etc.), but most of the time they pan out in a beneficial way. Making the leap from “this is how things tend to work in my immediate experience” to “this is how things work everywhere for everyone” is where theories are born, where science lives, and where we often make mistakes along the way. Proceed with caution from observation to theory, but by all means, proceed!
              Really great read! This is something many of my colleagues have discussed in the past. Here’s an article that might help us get closer to causality with observational data: http://goo.gl/MP7WQo, and here is a video about it: https://www.youtube.com/watch?v=uhONGgfx8Do
              The problem with the search for causality (or, more generally, explainability) is that in many cases, it is “not interesting”. If I click on Google search results, neither me nor Google algorithm developers are truly interested how the algorithm decided to rank Page A before Page B. It is OK for me, as an end user, not to care about those details, as much as I don’t care hydraulics every time I take a shower. Is it OK for me, as a data scientist, not to care about the reasons behind my models? Honestly, I don’t yet know.

              I agree that in many cases the reasoning behind models isn’t interesting, as long as the models produce satisfactory results. Web search is actually a good example. Yes, many end users don’t really care how Google ranks pages, but SEO practitioners go to great lengths to understand search algorithms and get pages to rank well (see https://moz.com/search-ranking-factors for example).

              As data scientists, it’s important to consider model stability in production. Sculley et al. said it well in their paper on machine learning technical debt (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43146.pdf): “Machine learning systems often have a difficult time distinguishing the impact of correlated features. This may not seem like a major problem: if two features are always correlated, but only one is truly causal, it may still seem okay to ascribe credit to both and rely on their observed co-occurrence. However, if the world suddenly stops making these features co-occur, prediction behavior may change significantly.”

              Finally, in many cases what we really care about is interventionality. I don’t think it’s a real word, but what it means is that you don’t really care whether A causes B, you want to know whether intervening to change A would change B. These inferences are critical in fields like medicine and marketing, but we can look at an example from the world of blogging, which is probably more relevant to you. Many bloggers would like to attract more readers. A possible costly intervention would be to switch platforms from WordPress to Medium. Cheaper interventions may be changing the site’s layout, writing titles that get people interested, and posting links to your content on relevant channels. Another intervention would be trying to post at different times (as implied by WordPress insights and discussed in http://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/). Obviously, one would like to apply the interventions with the highest return on investment first, and data that helps with ranking the interventions is very interesting.

              James Woodward’s Making Things Happen gives a fantastic, relatively non-technical analysis of causation that fits well with Pearl’s approach.
              Thanks! I’ll check it out.
              I’ve been thinking about this lately quite a bit. The fact that I can type this comment and send it across the internet rests on the ability to create a completely controlled causal environment. Inside the computer, all noise and randomness is kept below the threshold of the data, and every process is completely causal. Meanwhile, outside the computer, most measurements are mostly noise, and extracting any sort of causal relation is very difficult and often impossible. My mind seems to have some sort of idea of cause as something like the interaction of balls on a pool table. The que ball strikes the eight ball and knocks it into the corner pocket, etc. But when one tries to measure things, mostly one finds nothing like this. Instead, one finds that some measurements tend to be found with other measurements most of the time, but not all of the time. Cause thus seems a statistical thing, and in no way absolute. I have difficulty reconciling the two views. One thing that occured to me to investigate, was the manner in which several huge internet outages developed involving the BGP protocol. It seemed to me that every individual packet must experience a completely causal path, but the aggregate turns into the statistical causal form we most usually deal with. I haven’t followed up with this idea so far, however
              Interesting. I think that one of the dividing factors between traditional software engineering and data science is the attitude towards uncertainty. Whereas, as you say, coding is all about creating a controlled deterministic environment, data science and statistics thrive on uncertainty. It’s similar with computer networks as well, where there is always a non-deterministic element (e.g., packets may be lost, arrive out-of-order, or come in bursts).

              Causation, Prediction, and Search is also seminal (https://www.cs.cmu.edu/afs/cs.cmu.edu/project/learn-43/lib/photoz/.g/scottd/fullbook.pdf).

              Disclosure: I did my PhD just down the street from the authors of Causation, Prediction, and Search, and Woodward was on my thesis committee.

              There is a subtle difference between Woodward’s approach and that of Pearl and of Spirtes et al., which Glymour discusses in the following places:

              https://www.ncbi.nlm.nih.gov/pubmed/24887161 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1280&context=philosophy

              Basically, Woodward starts with the notion of an intervention on a variable and defines other concepts (e.g. direct cause) in terms of it, whereas Pearl and Spirtes et al. start with the notion of direct cause. One consequence of this difference is that properties like sex and race that cannot be intervened upon in a straightforward way cannot be causes for Woodward, strictly speaking, but can be for Pearl and Spirtes et al. This is a fine point, however, and it’s very nearly true that they simply provide alternative formulations of the same theory, with Woodward focusing on conceptual issues and the others focus on methodology.

              Thanks for all the pointers, Greg! I’ll definitely check them out. Personally, I have a slight bias towards Pearl, as he is my academic grandfather (he was my advisor’s advisor), but I’m keen on learning as much as possible on all the different approaches to causality. It is a fascinating area!
              “Thinking, Fast & Slow” touches on some of this in later chapters. Some algebra is used to help illustrate the deception causality and efforts towards finding it can cause.
              Thanks! That book has been on my to-read list for a while now.
              Great Post, Thanks Yanir. I have been afraid of being lost in Big data ie swarmed by such a vast amount of correlations. diff --git a/2016/03/20/the-rise-of-greedy-robots/index.html b/2016/03/20/the-rise-of-greedy-robots/index.html index b30d124c6..ab61bdd8e 100644 --- a/2016/03/20/the-rise-of-greedy-robots/index.html +++ b/2016/03/20/the-rise-of-greedy-robots/index.html @@ -2,7 +2,7 @@

              The rise of greedy robots

              Given the impressive advancement of machine intelligence in recent years, many people have been speculating on what the future holds when it comes to the power and roles of robots in our society. Some have even called for regulation of machine intelligence before it’s too late. My take on this issue is that there is no need to speculate – machine intelligence is already here, with greedy robots already dominating our lives.

              Machine intelligence or artificial intelligence?

              The problem with talking about artificial intelligence is that it creates an inflated expectation of machines that would be completely human-like – we won’t have true artificial intelligence until we can create machines that are indistinguishable from humans. While the goal of mimicking human intelligence is certainly interesting, it is clear that we are very far from achieving it. We currently can’t even fully simulate C. elegans, a 1mm worm with 302 neurons. However, we do have machines that can perform tasks that require intelligence, where intelligence is defined as the ability to learn or understand things or to deal with new or difficult situations. Unlike artificial intelligence, there is no doubt that machine intelligence already exists.

              Airplanes provide a famous example: we don’t commonly think of them as performing artificial flight – they are machines that fly faster than any bird. Likewise, computers are super-intelligent machines. They can perform calculations that humans can’t, store and recall enormous amounts of information, translate text, play Go, drive cars, and much more – all without requiring rest or food. The robots are here, and they are becoming increasingly useful and powerful.

              Who are those greedy robots?

              Greed is defined as a selfish desire to have more of something (especially money). It is generally seen as a negative trait in humans. However, we have been cultivating an environment where greedy entities – for-profit organisations – thrive. The primary goal of for-profit organisations is to generate profit for their shareholders. If these organisations were human, they would be seen as the embodiment of greed, as they are focused on making money and little else. Greedy organisations “live” among us and have been enjoying a plethora of legal rights and protections for hundreds of years. These entities, which were formed and shaped by humans, now form and shape human lives.

              Humans running for-profit organisations have little choice but to play by their rules. For example, many people acknowledge that corporate tax avoidance is morally wrong, as revenue from taxes supports the infrastructure and society that enable corporate profits. However, any executive of a public company who refuses to do everything they legally can to minimise their tax bill is likely to lose their job. Despite being separate from the greedy organisations we run, humans have to act greedily to effectively serve their employers.

              The relationship between greedy organisations and greedy robots is clear. Much of the funding that goes into machine intelligence research comes from for-profit organisations, with the end goal of producing profit for these entities. In the words of Jeffrey Hammerbacher: The best minds of my generation are thinking about how to make people click ads. Hammerbacher, an early Facebook employee, was referring to Facebook’s business model, where considerable resources are dedicated to getting people to engage with advertising – the main driver of Facebook’s revenue. Indeed, Facebook has hired Yann LeCun (a prominent machine intelligence researcher) to head its artificial intelligence research efforts. While LeCun’s appointment will undoubtedly result in general research advancements, Facebook’s motivation is clear – they see machine intelligence as a key driver of future profits. They, and other companies, use machine intelligence to build greedy robots, whose sole goal is to increase profits.

              Greedy robots are all around us. Advertising-driven companies like Facebook and Google use sophisticated algorithms to get people to click on ads. Retail companies like Amazon use machine intelligence to mine through people’s shopping history and generate product recommendations. Banks and mutual funds utilise algorithmic trading to drive their investments. None of this is science fiction, and it doesn’t take much of a leap to imagine a world where greedy robots are even more dominant. Just like we have allowed greedy legal entities to dominate our world and shape our lives, we are allowing greedy robots to do the same, just more efficiently and pervasively.

              Will robots take your job?

              The growing range of machine intelligence capabilities gives rise to the question of whether robots are going to take over human jobs. One salient example is that of self-driving cars, that are projected to render millions of professional drivers obsolete in the next few decades. The potential impact of machine intelligence on jobs was summarised very well by CGP Grey in his video Humans Need Not Apply. The main message of the video is that machines will soon be able to perform any job better or more cost-effectively than any human, thereby making humans unemployable for economic reasons. The video ends with a call to society to consider how to deal with a future where there are simply no jobs for a large part of the population.

              Despite all the technological advancements since the start of the industrial revolution, the prevailing mode of wealth distribution remains paid labour, i.e., jobs. The implication of this is that much of the work we do is unnecessary or harmful – people work because they have no other option, but their work doesn’t necessarily benefit society. This isn’t a new insight, as the following quotes demonstrate:

              • “Most men appear never to have considered what a house is, and are actually though needlessly poor all their lives because they think that they must have such a one as their neighbors have. […] For more than five years I maintained myself thus solely by the labor of my hands, and I found that, by working about six weeks in a year, I could meet all the expenses of living.” – Henry David Thoreau, Walden (1854)
              • “I think that there is far too much work done in the world, that immense harm is caused by the belief that work is virtuous, and that what needs to be preached in modern industrial countries is quite different from what always has been preached. […] Modern technique has made it possible to diminish enormously the amount of labor required to secure the necessaries of life for everyone. […] If, at the end of the war, the scientific organization, which had been created in order to liberate men for fighting and munition work, had been preserved, and the hours of the week had been cut down to four, all would have been well. Instead of that the old chaos was restored, those whose work was demanded were made to work long hours, and the rest were left to starve as unemployed.” – Bertrand Russell, In Praise of Idleness (1932)
              • “In the year 1930, John Maynard Keynes predicted that technology would have advanced sufficiently by century’s end that countries like Great Britain or the United States would achieve a 15-hour work week. There’s every reason to believe he was right. In technological terms, we are quite capable of this. And yet it didn’t happen. Instead, technology has been marshaled, if anything, to figure out ways to make us all work more. In order to achieve this, jobs have had to be created that are, effectively, pointless. Huge swathes of people, in Europe and North America in particular, spend their entire working lives performing tasks they secretly believe do not really need to be performed. The moral and spiritual damage that comes from this situation is profound. It is a scar across our collective soul. Yet virtually no one talks about it.” – David Graeber, On the Phenomenon of Bullshit Jobs (2013)

              This leads to the conclusion that we are unlikely to experience the utopian future in which intelligent machines do all our work, leaving us ample time for leisure. Yes, people will lose their jobs. But it is not unlikely that new unnecessary jobs will be invented to keep people busy, or worse, many people will simply be unemployed and will not get to enjoy the wealth provided by technology. Stephen Hawking summarised it well recently:

              If machines produce everything we need, the outcome will depend on how things are distributed. Everyone can enjoy a life of luxurious leisure if the machine-produced wealth is shared, or most people can end up miserably poor if the machine-owners successfully lobby against wealth redistribution. So far, the trend seems to be toward the second option, with technology driving ever-increasing inequality.

              Where to from here?

              Many people believe that the existence of powerful greedy entities is good for society. Indeed, there is no doubt that we owe many beneficial technological breakthroughs to competition between for-profit companies. However, a single-minded focus on profit means that in many cases companies do what they can to reduce their responsibility for harmful side-effects of their activities. Examples include environmental pollution, multinational tax evasion, and health effects of products like tobacco and junk food. As history shows us, in truly unregulated markets, companies would happily utilise slavery and child labour to reduce their costs. Clearly, some regulation of greedy entities is required to obtain the best results for society.

              With machine intelligence becoming increasingly powerful every day, some people think that to produce the best outcomes, we just need to wait for robots to be intelligent enough to completely run our lives. However, as anyone who has actually built intelligent systems knows, the outputs of such systems are strongly dependent on the inputs and goals set by system designers. Machine intelligence is just a tool – a very powerful tool. Like nuclear energy, we can use it to improve our lives, or we can use it to obliterate everything around us. The collective choice is ours to make, but is far from simple.

              Subscribe
                -

                Public comments are closed, but I love hearing from readers. Feel free to +

                Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                Yes, the world has always been greedy. This reminds me of Dijkstra greedy algorithm which is used to find the shortest route. There is a lot of “steps” for an organization to become profitable. Greediness tries to find the most cost-efficient way to achieve the goal of being profitable. Let us assume that each road is a railway and trains traverse to their destinations. Each decision path will sacrifice other trains waiting to cross to their destination. If human stupidity does not overrule again, our scarce resource ultimately will be constrained by economics to one element only: time. Where do we want humans to allocate spending their time on?

                Greediness will always thrive in the sense it is seen as a trait of growth by society. War, which in today’s society we ultimately condemn, was viewed in the past as one way for one nation to gain growth. Growth was limited to the domain of a specific country and the rest treated as an enemy. The end of the war was enforced by the right of not interfering other one’s own property. People had to find other means to gain growth. Thus, the concept of greediness was enforced. Greediness is using emotional appeal to manipulate other’s people habits to a specific domain.

                The problem with greediness is whether people evaluate the emotional appeal matching to something positive or not positive to self and society. The maximum capacity of getting that right is:

                1. The effort of people on having a multi-disciplinary knowledge of multiple domains
                2. The effort of people of using their knowledge to their daily decisions.

                Most consumers are passive on the above two points due to society constraints. More specifically, if people focus learning from other domains, they have the risk of underperforming on their main domain having a competitive disadvantage on their prospect of their career. This limited domain in bayesian terms makes people have low confidence for many topics leaving others to influence our decision making. I think that low confidence is the main causation we see a high rise trend where people’s decisions rely more upon push messages instead of pull messages. I haven’t seen to this day sophisticated push messages where the user has a choice of options what to see except in the on boarding phase of a product. In addition, the on boarding phase are only reasons why you should use me. There will never be a phase on reasons when not to use me. If a specific domain of a product could say to the user based on the personalized information it gathered: “Hey, you shouldn’t be using me in this situation. Use Bob instead, it will make your life easier” (This will be possible with the evolution of data). The problem is a specific domain will never explain alternative domains that can solve a user’s individual problem better because there is not a commission fee of recommending one user to another domain with qualitative information. This causes a domain which consists a set of employees to not have an interest in researching alternative domains that can solve a specific problem better because the current system has not placed a platform to reward it with a commission fee. Instead, the only way for a specific domain to thrive is by copying others ideas or owning them through acquisitions. This demotivates innovation in great sum. So far, it is only people with consciousness, with value or no value, such as start-up entrepreneurs that leave old positions and people who contribute in open source correspondingly, that go the extra mile to innovate. My whole hypothesis is that our natural instincts are a machine learner, and our only task is to do progress on everything, even our own personal life.

                If those two points happen, the rule of greediness will be overruled. People will consciously evaluate whether that emotional appeal makes sense in the big picture because their jobs will force them to associate their domain with alternatives to gain a commission fee. That will gain them a more robust interdisciplinary domain knowledge causing them to have more confidence on pulling than pushing information of other domains they start to know about. Passive consumers will be less passive. The value was based before by war, now greediness, later it will be all about evaluation.

                Your point of people doing less work emphasizes a more passive society than it already is. I do not propose that as that will make our situation worse. The problem is the type of tasks people do, not the task itself. People need to do tasks that progress our society instead of being passive like the game of civilization. It is the only way that makes us happy and has a purpose. Like we humans create machine learning instances have an end goal purpose, so we as humans are machine learners for a purpose where we can handle any situation that becomes a problem. Our starter pack was human suffering, hunger, and death to solve problems. Now it becomes less so and we have to be motivated by it beyond extrinsic rewards.

                Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                Interesting point on the causal significance. How does this work when you have confounders in x? I’d have thought that x must contain the set of prima facie causes for which we have true exogenous variation.

                Also, how does it work when you have bad controls in x (where x includes post-treatment causes that are plausibly varied by c)?

                Good questions :)

                To be honest, I’m not completely sure it works in all these cases, as there is always a need for interpretation to decide whether the identified causes are genuine. I tried playing a bit with the toy data from Pearl’s report on Simpson’s Paradox, but the results are not entirely convincing. However, I’m also not fully convinced that Pearl’s solution fully resolves Simpson’s Paradox, and Kleinberg does go through a few scenarios where her approach doesn’t work in her book, so I’d say that there are still quite a few open problems in the area.

                Post-treatment causes are partly addressed by the definition in Huang and Kleinberg (2015), where significance is weighted by the number of timepoints where e follows c. Again, that definition doesn’t handle all cases, but I think it’s an interesting line of research. I would definitely like to see their results reproduced by other researchers and expanded to other datasets, though.

                Excellent article! It has been very useful to understand what the topic of causality is about and triggered my interest to continue learning more!
                Thanks for this post! I share your troubles over Pearl/time/feedback loops!
                Nice post. have you had any chance to apply them on real datasets. Please share those results
                Great post. I did not know about Kleinberg and Hill’s work. I knew a similar list of criteria from this article, which is much younger https://doi.org/10.1177%2F0951629805050859 Regarding Kleinberg: Adding time certainly is valuable, but doesn’t the smoking example change the research question from whether smoking causes lung cancer to when it causes lung cancer? The latter question is more informative and implies the former, but I’d say it is fine to ask the first question when one is not interested in the time of occurrence of cancer.
                Thank you! I agree that the latter question is more informative, but I now think that saying that “smoking causes cancer” isn’t particularly meaningful, as it ignores both timing and dosage. A good summary of the case for well-defined interventions was provided by Miguel Hernán in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5207342/
                The limits of Pearl’s theory on feedback loops bothers me too. However, have you studied much Control Theory? Or dynamical systems in general? It explicitly deals with feedback loops. I’d be keen to get your thoughts on the comparison of Control Theory vs Pearl’s Causal Inference.
                Thanks for the comment! No, I haven’t studied Control Theory. Maybe I’ll look into it one day. :)

                Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                My hunch is that the % of HDI chosen is of less interest to a user than seeing how each test iteration alters the HDI and shifts the level of overlap between HDI and ROPE toward one of either outcome. In the example given above, would a fair interpretation be that the differences appear weighted more toward the negative than the positive? With precision, shouldn’t it be made a function of the minimum effect requested? Larger ROPE require less precision and vice versa?
                Thanks for your comment, John! I think that it appears weighted more towards the negative because the beta distribution is symmetric when the mean is 0.5 (alpha = beta), and asymmetric in other cases, making it less pointy. According to Kruschke’s simulations, using the precision stopping rule makes the success rate estimate closer to the true mean of the underlying distribution than with other stopping rules, which tend to overestimate the success rate. I’m not sure we’d get the same results if precision were a function of the minimum effect, but I’d like to run more simulations to get a better feeling for how it works.
                Could you tell me please how do you calculate HDI and ROPE? I am trying to replicate this calculator in R. Thanks!
                The source code for the calculation is here: https://github.com/yanirs/yanirs.github.io/blob/master/tools/split-test-calculator/src/bayes.coffee#L139 – it shouldn’t be too hard to translate to R.

                Thanks for the post!

                I’m more of a business stakeholder simply trying to improve our testing practices, rather than a data scientist who understands the theories at a detailed level.

                I’m a bit confused why, if I enter the default example in your calculator (5000 trials each, 100 successes vs 130), the recommendation is to implement EITHER variant.

                Whereas, using a tool such as the following suggests a 97.8% chance the variant with 130 successes will outperform the control: https://abtestguide.com/bayesian/

                This calculator also seems to suggest the 130 successes variant should be chosen, not EITHER, as there is 95% confidence the result is not due to chance : https://abtestguide.com/calc/

                A secondary question is, if there is no predetermined sample size with the Bayesian approach, how do you plan how long to run the test for? Mainly to deal with stakeholder communication & project planning, but also to avoid peaking.

                Many thanks, diff --git a/2016/08/04/is-data-scientist-a-useless-job-title/index.html b/2016/08/04/is-data-scientist-a-useless-job-title/index.html index 6925655c5..18a0a715b 100644 --- a/2016/08/04/is-data-scientist-a-useless-job-title/index.html +++ b/2016/08/04/is-data-scientist-a-useless-job-title/index.html @@ -8,7 +8,7 @@ https://yanirseroussi.com/2016/08/04/is-data-scientist-a-useless-job-title/science-versus-engineering.png 581w," src=https://yanirseroussi.com/2016/08/04/is-data-scientist-a-useless-job-title/science-versus-engineering.png alt="Information flow in science and engineering" loading=lazy>

                Why Data Scientist is a useless job title

                Given that a data scientist is someone who does data analysis, and/or a scientist, and/or an engineer, what does it mean for a person to hold a Data Scientist position? It can mean anything, as it depends on the company and industry. A job title like Data Scientist at Company is about as meaningful as Engineer at Organisation, Scientist at Institution, or Doctor at Hospital. It gives you a general idea what the person’s background is, but provides little clue as to what the person actually does on a day-to-day basis.

                Don’t believe me? Let’s look at a few examples. Noah Lorang (Basecamp) is OK with mostly doing arithmetic. David Robinson (Stack Overflow) builds machine learning features and internal R packages, and visualises data. Robert Chang (Twitter) helps surface product insights, create data pipelines, run A/B tests, and build predictive models. Rob Hyndman (Monash University) and Jake VanderPlas (University of Washington) are academic data scientists who contribute to major R and Python open-source libraries, respectively. From personal knowledge, data scientists in many Australian enterprises focus on generating reports and building dashboards. And in my current role at Car Next Door I do a little bit of everything, e.g., implement new features, fix bugs, set up data pipelines and dashboards, run experiments, build predictive models, and analyse data.

                To be clear, the work done by many data scientists is very useful. The number of decisions made based on arbitrary thresholds and some means multiplied together on a spreadsheet can be horrifying to those of us with minimal knowledge of basic statistics. Having a good data scientist on board can have a transformative effect on a business. But it’s also very easy to end up with ineffective hires working on low-impact tasks if the business has no idea what their data scientists should be doing. This situation isn’t uncommon, given the wide range of activities that may be performed by data scientists, the lack of consensus on the definition of the field, and a general disagreement over who deserves to be called a real data scientist. We need to move beyond the hype towards clearer definitions that would help align the expectations of data scientists with those of their current and future employers.

                It’s time to specialise

                Four years ago, I changed my LinkedIn title from software engineer with a research background to data scientist. Various offers started coming my way, and they haven’t stopped since. Many people have done the same. To be a data scientist, you just need to call yourself a data scientist. The dilution of the term means that as a job title, it is useless. Useless terms are unlikely to last, so if you’re seriously thinking of becoming a data scientist, you should also consider specialising. I believe we’ll see the emergence of new specific titles, such as Machine Learning Engineer. In addition, less “sexy” titles, such as Data Analyst, may end up making a comeback. In any case, those of us who invest in building their skills, delivering value in their job, and making sure people know about it don’t have much to worry about.

                What do you think? Is specialisation inevitable or are generalist data scientists here to stay? Please let me know privately, via Twitter, or in the comments section.

                Subscribe
                  -

                  Public comments are closed, but I love hearing from readers. Feel free to +

                  Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                  I think exactly the same, but for the moment the title is somewhat needed. I’ll make my example: I’m a data scientist because my company wants to differentiate between regular data analysts (who can’t code but are learning with me helping them) and backend software engineers who can code better than me, but lack the business knowledge and have the tendency to trow fancy algorithms at numbers without thinking about method and usefulness for the business.

                  Eventually we will have new job titles, but for now we are stuck with “data scientists”. As soon as the hype will fade we’ll see people moving to new titles.

                  Great article - and really the ambiguity surrounding the Data Scientist title hurts everyone - Data Scientists are frustrated that they’re expected to do everything, and others are frustrated that their Data Scientists can’t do everything that they’ve heard data scientists can do. I think this will change over time as data scientists (or whatever they will be called) roles get further defined.

                  Good article - I’ve always had a bit of problem with the term “Data Scientist” in that it infers that the person with such a title is somehow involved in research of data for scientific purposes or has a deep academic background neither of which are usually true.

                  Now, what does everyone think about the term “Data Architect” - I could do a rant on that one. Suffice it to say, you are a DATABASE Architect, not a DATA architect. Data is data, it is just raw numbers. You don’t design data, you design a data model which eventually gets translated into a database. Sorry, I guess that was a bit of a rant …

                  Yeah, people like the word data more than the word database these days. There are also the various places where you can put data. You can drown it in a lake, for example…

                  Great article, and couldn’t agree more - there’s deep irony in “Data Science” as a job title.

                  I’ve started to use the term “Entrepreneurial Analyst” to be more precise about the focus on the outcome and also to allow latitude for hypotheses, exploration and discovery.

                  Reblogged this on codefying and commented: Especially like the antiparallel structure of scientific inquiry and engineering design.
                  I find it interest that, as you said, data analysis is very useful if done by effective hires. It’s important to understand the data that you are provided with and what context it fits in. Otherwise, you could come to conclusions that miss the mark. It is important to have those who are properly qualified analyze data accurately.
                  Great article! I also feel that anybody with an experience of 5 years and more in Data Science, can be considered for a Data Scientist role. Professionals with lesser experience can always have their roles as Data Analysts or Data Engineers. There are many certification programs that can provide you with relevant skill-sets, such as Hortonworks certification, Cloudera Certifications, Data Science Council of Ameria (DASCA) certifications etc.

                  Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                  Thanks Yanir for this post! Once again, you hit the nail on the head! We’re probably all guilty of doing any number of those mistakes at one point or another of our careers. And it wouldn’t surprise me that a lot of companies are doing all of those mistakes at the same time. I especially liked #6. Instead of stupidity, I would suggest that that ego is responsible for it.
                  Yeah, I think that Bertrand Russell was a bit too harsh – it’s really ignorance that often causes overconfidence rather than stupidity. And yes, I have made this mistake as well. Many things often look misleadingly simple if you don’t get into the fine details.
                  Reblogged this on QA-notes and commented: All common sense, but as with many things, having it written down focusses the mind :-)

                  Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                  Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                  When I started doing data science in a business setting (after years of doing quantitative genetics in academic settings), I was puzzled by the talk of “customer lifetime value”, partly due to the issues you’ve mentioned. Even with appropriate clarifications - it’s over, say, five years, not forever, etc. - it’s a peculiar quantity, in that as typically calculated, at least by the business-y people in my vicinity, it isn’t an average over a single population of customers. Instead, it’s average, say, first-month net present value over customers who’ve been around for at least one month (or maybe over all such customers who’ve also been around for at most one year, to reduce the influence of customer behavior farther in the past, when the product catalog, marketing strategies, etc. were different), plus average second-month net present value over customers who’ve been around for at least two months, etc., that is, it’s a sum of averages over a sequence of populations of customers (which may not even be nested). And there can be further subtleties. For example, in the context of a “freemium” service such as the one that is my primary client at present, sometimes people want to measure time from when a customer signs up for an account, whereas other times people want to measure time from when a customer first buys something, which may be much later. Altogether, I’ve found that “customer lifetime value” generally requires a good deal of explanation.

                  “no amount of cheap demagoguery and misinformation can alter the objective reality of our world.”: Alas, that isn’t quite true. Next week, the objective reality of how the USA is governed will be altered substantially, partly due to blatant demagoguery and misinformation.

                  Great analysis Yanir!
                  Thanks Ralph! I meant the last sentence in the sense of “and yet it moves”. People’s actions and choices are definitely affected by demagoguery and misinformation, but the spread of misinformation doesn’t change reality by itself. For example, Trump et al.’s climate science denialism isn’t going to alter the reality of anthropogenic climate change, though their actions are probably going to accelerate it.

                  This is why Investment Banking and Venture Capital firms should hire Data Scientists.

                  I think your post and the links you share can have a part on the Google search results as well in near future :)

                  Great post.

                  There’s also the BTYD package in R that I’ve seen be used for CLV calculations although I don’t know if it could be used for anything industrial. All credit for this knowledge goes to Dan McCarthy, who just put out some great research on using CLV in non-contractual settings.

                  Hi Yanir!

                  Nice post.

                  How can the models you mentioned be altered in the case of a subscription based business in order to calculate the lifetime value of the customers?

                  Thanks Eleni! I think that in the case of subscription-based products, you’re better off using different models, as churn is observed and can be predicted (e.g., using a package like lifelines). Once you have an estimate of when a customer is going to churn, it’s easy to estimate their LTV (assuming constant recurring revenue). In any case, the general principle of not using closed formulas without testing their accuracy on your data still applies here.
                  Thanks for the article, Yanir! I am a huge proponent of only using formulas for CLV as a starting point, used only when historical similar models aren’t available, When a good historical financial model is available, it becomes much more useful than the generic formula. I was just speaking with service vendor who was trying to convince to allow his company to perform exhaustive FMEA’s on all of our equipment when we had years of failure data to approach a maintenance strategy. Only rely on theoretical when empirical isn’t an option.

                  Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                  Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                  Very enlightening post! It was very awesome to see that the insights you saw to Elasticsearch went to a PR. I bet that was worth the whole thing!
                  That’s very exciting, I wanted to ask are you a self learner or do you have a degree,can you please share your background. Thank you
                  Thanks Mostafa. Yes, I have a BSc in computer science, and a PhD in what you would now call data science. See: https://www.linkedin.com/in/yanirseroussi/

                  This was an amazing post, Yanir! Loved the breakdown and the patience you had for the whole process, very well played and you really deserved it! :)

                  P.S: Really can connect as I’ve been working independently for a while now and would definitely be open to looking for long-term contracts or remote jobs like this.

                  Your post is really a therapy to most people who apply for jobs and loose hope of waiting. I believe patience is a key to everything. Thqnks

                  Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                  Cool…not sure why and when i subscribed to your mailing list. and now quite surprised to hear that Bandcamp Recommender was your project. i am bandcamp freak… Bandcamp has recently strarted recommendations at the bottom ;-) seems primitive though. example https://ogreyouasshole.bandcamp.com/album/crossword-lost-sigh-days-james-mcnew-remixes would love to hear about the basic logic you used behind the “recommendations” . I have no technical knowledge at all but a few years ago thought of a basic recommendation model ..but couldnt take it forward though…i thought ‘contextualizing’ artists would be a cool way to connect bands. diff --git a/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/index.html b/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/index.html index 050e2695d..6c4979ab4 100644 --- a/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/index.html +++ b/2017/10/15/advice-for-aspiring-data-scientists-and-other-faqs/index.html @@ -3,7 +3,7 @@ Harrington Emerson (1911)

                  I want to become a data science freelancer. Can you provide some advice?

                  As with any freelancing job, expect to spend much of your time on sales and networking. I've only explored the freelancing path briefly, but Radim Řehůřek has published great slides on the topic. If you're thinking of freelancing as a way of gaining financial independence, also consider spending less, earning more, and investing wisely.

                  Can you recommend an academic data science degree?

                  Sorry, but I don't know much about those degrees. Boris Gorelik has some interesting thoughts on studying data science.

                  Will you be my mentor?

                  Probably not, unless you're hard-working, independent, and doing something I find interesting. Feel free to contact me if you believe we'd both find the relationship beneficial.

                  Can you help with my project?

                  Possibly. If you think I'd find your project exciting, please do contact me.


                  What about ethics?

                  What about them? There isn't a single definition of right and wrong, as morality is multi-dimensional. I believe it's important to question your own choices, and avoid applying data science blindly. For me, this means divesting from harmful industries like fossil fuels and striving to go beyond the creation of greedy robots (among other things).

                  I’m a manager. When should I hire a data scientist and start using machine learning?

                  There's a good chance you don't need a data scientist yet, but you should be aware of common pitfalls when trying to be data-driven. It's also worth reading Paras Chopra's post on what you need to know before you board the machine learning train.

                  Do you want to buy my products or services?

                  No. If I did, I'd contact you.

                  I have a question that isn’t answered here or anywhere on the internet, and I think you can help. Can I contact you?

                  Sure, use the form on this page.

                  Subscribe
                    -

                    Public comments are closed, but I love hearing from readers. Feel free to +

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Thanks so much for sharing this Yanir!

                    Indeed, such questions seem to be very recurring. Thanks for providing answers to help guide folks. I might add a few things:

                    when ready for the job search… Advice to Data Scientists on Where to Work http://multithreaded.stitchfix.com/blog/2015/03/31/advice-for-data-scientists/

                    if you are going to get into data science, do it for the right reasons. Let your passion drive! https://www.quora.com/How-do-I-move-from-data-scientist-to-data-science-management

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Great set of definitions and path of evolutions here!

                    There has to be chaos and confusion as it evolves surely, but the consensus as you very well mentioned is decisions. Anything done in the data world, if not leading to decisions is not quite viable in long term.

                    Thanks for sharing your thoughts, love reading your blog.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    I have been working remotely for WRI for nearly 2 years, and I can resonate with almost everything you have said. Great blog!
                    Interested. Though not trained as Data scientist yet but as BI consultant with experience over a decade. Let me know if you have any opportunity.
                    I am working for Accenture as Analyst. The article is very similar to my real life. I pursued data science from top university and worked on few capstone projects.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    “the basic bootstrap makes no assumption about the underlying distribution of the data”: I suppose bootstrapping per se doesn’t, but some things people like to use it for do. For example, suppose the original sample is from a Cauchy distribution, and bootstrapping is used to compute a confidence interval around the sample mean; no matter how many bootstrap replicates are used, the computed interval is worthless, because the original distribution doesn’t have a mean. Of course, that’s an extreme case unlikely to arise in practice, but it immediately raises doubt that guidelines like “n ≥ 101 for the bootstrap t” should be applied uncritically. As you obviously agree, it’s best to know and think about where the data came from when deciding which statistical methods to apply.
                    Learned a lot from the post. One question on the CIs for means and the difference between the means. If the two CIs for the means do not overlap, does it always imply that the difference is significant? Or can the error go in both ways, meaning that it is possible to have non-overlapping CIs and the CI of the difference includes 0?

                    Reblogged this on Boris Gorelik and commented:

                    Anything is better when bootstrapped. Read my co-worker’s post on bootstrapping. Also make sure following the links Yanir gives to support his claims

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Reblogged this on Boris Gorelik and commented:

                    Many years ago, I terribly overfit a model which caused losses of a lot of shekels (a LOT). It’s not that I wasn’t aware of the potential overfitting. I was. Among other things, I used several bootstrapping simulations. It turns out that I applied the bootstrapping in a wrong way. My particular problem was that I “forgot” about confounding parameters and that I “forgot” that peeping into the future is a bad thing.

                    Anyhow, Yanir Seroussi, my coworker data scientist, gave a very good talk on bootstrapping.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    This is the same conclusion I reached when deciding between deepening data science skills vs engineering; now I’m deeper into cloud services and off-the-shelf ML tools.
                    Good points, thanks Boris!

                    Hi Yanir!

                    The post really reasonated with me. I find more and more that I do engineering during my day than science.

                    I believe the data engineering part, including cloud and full stack development skills, will prove to be the skills that keep you relevant in industry. If you combine these with knowledge on which techniques to use regarding data science and machine learning, then you can be unstoppable.

                    Otherwise, as you said, it’s better to stay in academia.

                    Best, Antonios

                    Hi Yanir,

                    I am glad I found your post. I am switching careers and want to work with data for social good. I learned data analysis, and am thinking of explore machine learning, see if that’s something for me. Being an engineer (not tech related), I would definitely be content with your option 1, where I understand what’s going on behind the scenes but don’t get into research and maths.

                    Interesting to know what the trend is in the field.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Loved this short blog. Planning to transition to Climate tech as a DS guy and am slowly cultivating a pent-up passion for conserving the marine life so too many things that I can relate to here haha
                    Thank you! Good luck with the transition. 🙂

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    This is a very well written post and it’s great to hear your reasoning on this. Congrats on your new position!

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    I’m amused that the first sample on the home page features the user asking, “this code is not working like i expect - how do i fix it?” In my mind, I hear the voice of HAL (Douglas Rain) answer, “This sort of thing has cropped up before, and it has always been due to human error.”

                    I’ve been assuming ChatGPT is just the latest specimen of what typically passes for AI these days, a system with an elaborate model of utterances disconnected from any deeper and richer model of the world to which utterances refer, hence brittle and shallow. (Such systems more or less realize John Searle’s “Chinese room” scenario, although unlike Searle, I don’t think they represent any fundamental limit on AI, merely the current, crude state of the art.) However, you’ve convinced me to try it out.

                    Thanks Ralph! It’s definitely still early days, but it feels like a step change in chatbot tech, much like the significant improvements in image recognition from a decade ago. I can see it only getting more capable with all the interaction data that they’re collecting.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Reading of the despair of the many silent/invisible contributors of intellectual property copied, without financial recompense, which went into these LLMs. It’s a training data gold rush/free for all right now, just as unedifying and thoughtless as actual gold rushes in history were. I fear this will also quickly lead to Apple-style closed gardens of proprietary creative talent (think of top artists, writers and thinkers signed up to train AI rather than creating content for direct human consumption). Counter-ML tech like Glaze will only delay the inevitable. https://glaze.cs.uchicago.edu/.

                    I can think of some ways where governments might respond, e.g. special taxes and incentives on AI businesses to fund creative academies and collectives, much as public universities are today.

                    Thanks John! Yeah, I doubt that tech like Glaze can be made future proof, as they admit on that page. Besides, I lean more towards the view that all creative work is derivative and copying isn’t theft. Copyright mostly protects platforms and businesses rather than individuals. While I empathise with individuals who feel like their work is being exploited without their permission, I don’t see the training of machine learning models as being that different from artists learning from other artists.

                    Thoughtful government intervention would be great, but it’s unlikely to be applied in a timely manner or evenly across jurisdictions.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

                    2024

                    June

                    Five team-building mistakes, according to Patty McCord

                    Takeaways from an interview with Patty McCord on The Startup Podcast.

                    June 26, 2024

                    Dealing with endless data changes

                    Quotes from Demetrios Brinkmann on the relationship between MLOps and DevOps, with MLOps allowing for managing changes that come from data.

                    June 22, 2024

                    The rules of the passion economy

                    Summary of the main messages from the book The Passion Economy by Adam Davidson.

                    June 12, 2024

                    May

                    Adapting to the economy of algorithms

                    Overview of the book The Economy of Algorithms by Marek Kowalkiewicz.

                    May 25, 2024

                    April

                    LinkedIn is a teachable skill

                    An high-level overview of things I learned from Justin Welsh’s LinkedIn Operating System course.

                    April 11, 2024

                    The data engineering lifecycle is not going anywhere

                    My key takeaways from reading Fundamentals of Data Engineering by Joe Reis and Matt Housley.

                    April 5, 2024

                    March

                    Atomic Habits is full of actionable advice

                    I put the book to use after the first listen, and will definitely revisit it in the future to form better habits.

                    March 12, 2024

                    February

                    The three Cs of indie consulting: Confidence, Cash, and Connections

                    Jonathan Stark makes a compelling argument why you should have the three Cs before quitting your job to go solo consulting.

                    February 17, 2024

                    Future software development may require fewer humans

                    Reflecting on an interview with Jason Warner, CEO of poolside.

                    February 6, 2024

                    January

                    Psychographic specialisations may work for discipline generalists

                    When focusing on a market segment defined by personal beliefs, it’s often fine to position yourself as a generalist in your craft.

                    January 9, 2024

                    The power of parasocial relationships

                    Repeated exposure to media personas creates relationships that help justify premium fees.

                    January 8, 2024

                    2023

                    December

                    Positioning is a common problem for data scientists

                    With the commodification of data scientists, the problem of positioning has become more common: My takeaways from Genevieve Hayes interviewing Jonathan Stark.

                    December 18, 2023

                    Transfer learning applies to energy market bidding

                    An interesting approach to bidding of energy storage assets, showing that training on New York data is transferable to Queensland.

                    December 14, 2023

                    November

                    Our Blue Machine is changing, but we are not helpless

                    One of my many highlights from Helen Czerski’s Blue Machine.

                    November 28, 2023

                    You don’t need a proprietary API for static maps

                    For many use cases, libraries like cartopy are better than the likes of Mapbox and Google Maps.

                    November 21, 2023

                    October

                    Artificial intelligence was a marketing term all along – just call it automation

                    Replacing ‘artificial intelligence’ with ‘automation’ is a useful trick for cutting through the hype.

                    October 6, 2023

                    September

                    The lines between solo consulting and product building are blurry

                    It turns out that problems like finding a niche and defining the ideal clients are key to any solo business.

                    September 25, 2023

                    Google’s Rules of Machine Learning still apply in the age of large language models

                    Despite the excitement around large language models, building with machine learning remains an engineering problem with established best practices.

                    September 21, 2023

                    August

                    The Minimalist Entrepreneur is too prescriptive for me

                    While I found the story of Gumroad interesting, The Minimalist Entrepreneur seems to over-generalise from the founder’s experience.

                    August 21, 2023

                    Revisiting Start Small, Stay Small in 2023 (Chapter 2)

                    A summary of the second chapter of Rob Walling’s Start Small, Stay Small, along with my thoughts & reflections.

                    August 17, 2023

                    Revisiting Start Small, Stay Small in 2023 (Chapter 1)

                    A summary of the first chapter of Rob Walling’s Start Small, Stay Small, along with my thoughts & reflections.

                    August 16, 2023

                    Email notifications on public GitHub commits

                    GitHub publishes an Atom feed, which means you can use any RSS reader to follow commits.

                    August 14, 2023

                    The rule of thirds can probably be ignored

                    Turns out that the rule of thirds for composing visuals may not be that important.

                    August 11, 2023

                    July

                    Using YubiKey for SSH access

                    Some pointers for setting up SSH access with YubiKey on Ubuntu 22.04.

                    July 23, 2023

                    Making a TIL section with Hugo and PaperMod

                    How I added a Today I Learned section to my Hugo site with the PaperMod theme.

                    July 17, 2023

                    You can’t save time

                    Time can be spent doing different activities, but it can’t be stored and saved for later.

                    July 11, 2023
                    \ No newline at end of file