You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a tangent: The term "perceptron" in MLPs may be a bit confusing since you don't really want only linear neurons in your network. Using MLPs, you want to learn complex functions to solve non-linear problems. Thus, your network is conventionally composed of one or multiple "hidden" layers that connect the input and output layer. Those hidden layers normally have some sort of sigmoid activation function (log-sigmoid or the hyperbolic tangent etc.). For example, think of a log-sigmoid unit in your network as a logistic regression unit that returns continuous values outputs in the range 0-1. A simple MLP could look like this
where y_hat is the final class label that you return as the prediction based on the inputs (x) if this are classification tasks. The "a"s are your activated neurons and the "w"s are the weight coefficients.
@@ -20,10 +20,10 @@ Now, this is where "deep learning" comes into play. Roughly speaking, you can th
20
20
Let's consider a ConvNet in context of image classification
21
21
Here, you use so-called "receptive fields" (think of them as "windows") that slide over your image. You then connect those "receptive fields" (for example of the size of 5x5 pixel) with 1 unit in the next layer, this is also called "feature map". After you are done with this mapping, you have constructed a so-called convolutional layer. Note that your feature detectors are basically replicates of one another -- they share the same weights. The idea is that if a feature detector is useful in one part of the imagine it is likely that it is useful somewhere else, but at the same time it allows each patch of image to be represented in several ways.
Next, you have a "pooling" layer, where you reduce neighboring features from your feature map into single units (by taking the max feature or by averaging them, for example). You do this over many rounds and eventually arrive at an almost scale invariant representation of your image (the exact term is "equivariant"). This is very powerful since you can detect objects in an image no matter where they are located.
Copy file name to clipboardExpand all lines: faq/issues-with-clustering.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -7,23 +7,23 @@ I wouldn't necessarily call most of them "issues" but rather "challenges". For e
7
7
8
8
9
9
- The number of clusters is (typically) not known a priori (that's basically the characteristic of unsupervised learning problems), but there are a few "performance" or "evaluation metrics one can use to infer a "satisfying" grouping against the value of K; this is also called the elbow method:
10
-

10
+

11
11
12
12
13
13
Here, it seems that k=3 would be a good pick. Let's have a look at the accompanying 2D dataset that I used to train the *k*-means algorithm and see if our intuition agrees:
I'd say k=3 is definitely a reasonable pick. However, note that the "elbow" is typically not as clear as shown above. Moreover, note that in practice we normally work with higher-dimensional datasets so that we can't simply plot our data and double-check visually. (We could use unsupervised dimensionality reduction techniques though such as PCA). In fact, if we already knew that the 3 clusters belong to three different groups, this would be a classification task.
20
20
21
21
22
22
Anyway, there are other useful evaluation metrics such as the silhouette coefficient, which gives us some idea of the cluster sizes and shapes. Using the same dataset, let me give you a "good" silhouette plot (with k=3) and a not so decent one (k=2)
23
23
24
24
25
-

25
+

26
26
27
-

27
+

28
28
29
29
I would say that the biggest "shortcoming" in *k*-means may be that we assume that the groups come in spherical or globular shapes, which is rarely the case with "real-world" data. In contrast, I could think of choosing the "optimal" *k* as just another hyperparameter optimization procedure, which is also necessary for almost every supervised learning algorithm.
Copy file name to clipboardExpand all lines: faq/why-python.md
+10-12
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ By now, you may have already started wondering about this blog. I haven't writte
32
32
33
33
Maybe I should start with the short answer. You are welcome to stop reading this article below this paragraph because it really nails it. I am a scientist, I like to get my stuff done. I like to have an environment where I can quickly prototype and jot down my models and ideas. I need to solve very particular problems. I analyze given datasets to draw my conclusions. This is what matters most to me: How can I get the job done most productively? What do I mean by "productively"? Well, I typically run an analysis only once (the testing of different ideas and debugging aside); I don't need to repeatedly run a particular piece of code 24/7, I am not developing software applications or web apps for end users. When I *quantify* "productivity," I literally estimate the sum of (1) the time that it takes to get the idea written down in code, (2) debug it, and (3) execute it. To me, "most productively" means "how long does it take to get the results?" Now, over the years, I figured that Python is for me. Not always, but very often. Like everything else in life, Python is not a "silver bullet," it's not the "best" solution to every problem. However, it comes pretty close if you compare programming languages across the spectrum of common and not-so common problem tasks; Python is probably the most versatile and capable all-rounder.
Remember: "Premature optimization is the root of all evil" (Donald Knuth). If you are part of the software engineering team that wants to optimize the next game-changing high-frequency trading model from your machine learning and data science division, Python is probably not for you (but maybe it was the language of choice by the data science team, so it may still be useful to learn how to read it). So, my little piece of advice is to evaluate your daily problem tasks and needs when you choose a language. "If all that you have is a hammer, everything starts to look like a nail" -- you are too smart to fall for this trap! However, keep in mind that there is a balance. There are occasions where the hammer may be the best choice even if a screwdriver would probably be the "nicer" solution. Again, it comes down to productivity.
@@ -44,12 +44,12 @@ I needed to develop a bunch of novel algorithms to "screen" 15 million small, ch
44
44
45
45
Trust me, time was really "limited:" We just got our grant application accepted and research funded a few weeks before the results had to be collected (our collaborators were doing experiments on larvae of a certain fish species that only spawns in Spring). Therefore, I started thinking "How could I get those results to them as quickly as possible?" Well, I know C++ and FORTRAN, and if I implement those algorithms in the respective languages executing the "screening" run may be faster compared to a Python implementation. This was more of an educated guess, I don't really know if it would have been substantially faster. But there was one thing I knew for sure: If I started developing the code in Python, I could be able to get it to run in a few days -- maybe it would take a week to get the respective C++ versions coded up. I would worry about a more efficient implementation later. At that moment, it was just important to get those results to my collaborators -- "Premature optimization is the root of all evil." On a side node: The same train of thought applies to data storage solutions. Here, I just went with SQLite. CSV didn't make quite sense since I had to annotate and retrieve certain molecules repeatedly. I surely didn't want to scan or rewrite a CSV from start to end every time I wanted to look up a molecule or manipulate its entry -- issues in dealing with memory capacities aside. Maybe MySQL would have been even better but for the reasons mentioned above, I wanted to get the job done quickly, and setting up an additional SQL server ... there was no time for that, SQLite was just fine to get the job done.
The verdict: **Choose the language that satisfies *your* needs!**
52
-
However, there is once little caveat here! How can a beginning programmer possibly know about the advantages and disadvantages of a language before learning it, and how should the programmer know if this language will be useful to her at all? This is what I would do: Just search for particular applications and solutions related to your most common problem tasks on Google and [GitHub](https://github.com). You don't need to read and understand the code. Just look at the end product.
52
+
However, there is once little caveat here! How can a beginning programmer possibly know about the advantages and disadvantages of a language before learning it, and how should the programmer know if this language will be useful to her at all? This is what I would do: Just search for particular applications and solutions related to your most common problem tasks on Google and [GitHub](https://github.com). You don't need to read and understand the code. Just look at the end product.
53
53
54
54
> In the one and only true way. The object-oriented version of 'Spaghetti code' is, of course, 'Lasagna code'. (Too many layers). — Roberto Waltman.
55
55
@@ -68,7 +68,7 @@ If you are interested, those are my favorite and most frequently used Python "to
68
68
-[scikit-learn](http://scikit-learn.org/stable/): The most convenient API for the daily, more basic machine learning tasks.
69
69
-[matplotlib](http://matplotlib.org): My library of choice when it comes to plotting. Sometimes I also use [seaborn](http://stanford.edu/~mwaskom/software/seaborn/index.html) for particular plots, for example, the heat maps are particularly great!
@@ -77,7 +77,7 @@ If you are interested, those are my favorite and most frequently used Python "to
77
77
-[pandas](http://pandas.pydata.org): Working with relatively small datasets, mostly from CSV files.
78
78
-[sqlite3](https://docs.python.org/2/library/sqlite3.html): Annotating and querying "medium-sized" datasets.
79
79
-[IPython notebooks](http://ipython.org): What can I say, 90% of my research takes place in IPython notebooks. It's just a great environment to have everything in one place: Ideas, code, comments, LaTeX equations, illustrations, plots, outputs, ...
80
-

80
+

81
81
82
82
Note that the IPython Project recently evolved into [Project Jupyter](https://jupyter.org). Now, you can use Jupyter notebook environment not only for Python but R, Julia, and many more.
83
83
@@ -92,7 +92,7 @@ prototyping after all! Since it was built with linear algebra in mind (MATLAB fo
92
92
However, keep in mind that MATLAB comes with a big
93
93
price tag, and I think it is slowly fading from academia as well as industry. Plus, I am a big fan open-source enthusiast after all ;). In addition, its performance is also not that compelling compared to other "productive" languages looking at the benchmarks below:
94
94
95
-

95
+

96
96
97
97
(Benchmark times relative to C -- smaller is better, C performance = 1.0; Source: [http://julialang.org/benchmarks/](http://julialang.org/benchmarks/))
98
98
@@ -132,7 +132,7 @@ To be honest, I have to admit that I am not necessarily a big fan of the "@" sym
132
132
[[back to top](#table-of-contents)]
133
133
134
134
135
-
I think Julia is a great language, and I would like to recommend it to someone who's getting started with programming and machine learning. I am not sure if I really should though. Why? There is this sad, somewhat paradoxical thing about committing to programming languages. With Julia, we cannot tell if it will become "popular" enough in the next few years.
135
+
I think Julia is a great language, and I would like to recommend it to someone who's getting started with programming and machine learning. I am not sure if I really should though. Why? There is this sad, somewhat paradoxical thing about committing to programming languages. With Julia, we cannot tell if it will become "popular" enough in the next few years.
136
136
137
137
> There are only two kinds of languages: the ones people complain about and the ones nobody uses — Bjarne Stroustrup
138
138
@@ -163,13 +163,13 @@ I just wanted to bring up Theano and computing on GPUs as a big plus for Python,
163
163
To take one of my favorite Python quotes out of its original context: "We are all adults here" -- let's not waste our time with language wars. Choose the tool that "clicks" for you. When it comes to perspectives on the job market: There is no right or wrong here either. I don't think a company that wants to hire you as a "data scientist" really bothers about your favorite toolbox -- programming languages are just "tools" after all. The most important skill is to think like a "data scientist," to ask the right questions, to solve a problems. The hard part is the math and machine learning theory, a new programming language can easily be learned. Just think about, you learned how to swing a hammer to drive the nail in, how hard can it possibly be to pick up a hammer from a different manufacturer?
164
164
But if you are still interested, look at the Tiobe Index for example, *one* measure of popularity of programming languages:
However, if we look at the [The 2015 Top Ten Programming Languages](http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages) by Spectrum IEEE, the R language is climbing fast (left column: 2015, right column: 2014).
@@ -205,8 +205,6 @@ Speaking of hammers and nails again, Python is extremely versatile, the largest
205
205
206
206
Well, this is a pretty long answer to a seemingly very simple question. Trust me, I can go on for hours and days. But why complicate things? Let's bring the talks to a conclusion:
0 commit comments